Hi Aurora
I develop a tool with this multiple languages issue. I found very useful
an nuke library "language-identifier". This jar have nuke dependencies,
but I delete all unnecessary code (for me obvious).
This language-identifier that I use work fine and is very simple:
For example:
LanguageIdentifier languageIdentifier =
LanguageIdentifier.getInstance();
String userInputText = "free text";
String language = languageIdentifier.identify(text);
This work for 11 languages: English, Spanish, Portuguese, Dutch, German,
French, Italian, and Others.
I can send you this touched jar, but remember that this jar is from
Nuke, for copyright (or left :).
http://www.nutch.org/LICENSE.txt
More comments above...
aurora escribió:
I'm trying to build some web search tool that could work for
multiple languages. I understand that Lucene is shipped with
StandardAnalyzer plus a German and Russian analyzers and some more
in the sandbox. And that indexing and searching should use the
same analyzer.
Now let's said I have an index with documents in multiple languages
and analyzed by an assortment of analyzers. When user enter a
query, what analyzer should be used? Should the user be asked for
the language upfront? What to expect when the analyzer and the
document doesn't match? Let's said the query is parsed using
StandardAnalyzer. Would it match any documents done in German
analyzer at all. Or would it end up in poor result?
When this happen, in the major cases you do not obtain matchs.
Also is there a good way to find out the languages used in a web
page? There is a 'content-langage' header in http and a 'lang'
attribute in HTML. Looks like people don't really use them. How
can we recognize the language?
With language identifier. :)
Even more interesting is multiple languages used in one document,
let's say half English and half French. Is there a good way to
deal with those cases?
Language identifier only return one language. I look into
language-identifier and work with a score for each language, and return
the language with greater value.
Maybe you can modify the language-identifier for take the most greater
values.
Bye
Ernesto.
Thanks for any guidance.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]