I send you the source code in a private mail.

Ernesto.

aurora escribió:

Thanks. I would like to give it a try. Is the source code available? I'm using a Python version of Lucene so it would need to be wrapped or ported :)

Hi Aurora

I develop a tool with this multiple languages issue. I found very useful
an nuke library "language-identifier". This jar have nuke dependencies,
but I delete all unnecessary code (for me obvious).

This language-identifier that I use work fine and is very simple:
For example:

LanguageIdentifier languageIdentifier = LanguageIdentifier.getInstance();
String userInputText = "free text";
String language = languageIdentifier.identify(text);


This work for 11 languages: English, Spanish, Portuguese, Dutch, German,
French, Italian, and Others.

I can send you this touched jar, but remember that this jar is from
Nuke, for copyright (or left :).
http://www.nutch.org/LICENSE.txt

More comments above...

aurora escribió:

I'm trying to build some web search tool that could work for multiple languages. I understand that Lucene is shipped with StandardAnalyzer plus a German and Russian analyzers and some more in the sandbox. And that indexing and searching should use the same analyzer.

Now let's said I have an index with documents in multiple languages and analyzed by an assortment of analyzers. When user enter a query, what analyzer should be used? Should the user be asked for the language upfront? What to expect when the analyzer and the document doesn't match? Let's said the query is parsed using StandardAnalyzer. Would it match any documents done in German analyzer at all. Or would it end up in poor result?

When this happen, in the major cases you do not obtain matchs.

Also is there a good way to find out the languages used in a web page? There is a 'content-langage' header in http and a 'lang' attribute in HTML. Looks like people don't really use them. How can we recognize the language?

With language identifier. :)

Even more interesting is multiple languages used in one document, let's say half English and half French. Is there a good way to deal with those cases?

Language identifier only return one language. I look into
language-identifier and work with a score for each language, and return
the language with greater value.
Maybe you can modify the language-identifier for take the most greater
values.

Bye
Ernesto.

Thanks for any guidance.


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]






--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to