Lucene and multiple languages

2005-01-20 Thread aurora
I'm trying to build some web search tool that could work for multiple  
languages. I understand that Lucene is shipped with StandardAnalyzer plus  
a German and Russian analyzers and some more in the sandbox. And that  
indexing and searching should use the same analyzer.

Now let's said I have an index with documents in multiple languages and  
analyzed by an assortment of analyzers. When user enter a query, what  
analyzer should be used? Should the user be asked for the language  
upfront? What to expect when the analyzer and the document doesn't match?  
Let's said the query is parsed using StandardAnalyzer. Would it match any  
documents done in German analyzer at all. Or would it end up in poor  
result?

Also is there a good way to find out the languages used in a web page?  
There is a 'content-langage' header in http and a 'lang' attribute in  
HTML. Looks like people don't really use them. How can we recognize the  
language?

Even more interesting is multiple languages used in one document, let's  
say half English and half French. Is there a good way to deal with those  
cases?

Thanks for any guidance.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene and multiple languages

2005-01-20 Thread Daniel Naber
On Thursday 20 January 2005 21:08, aurora wrote:

 Now let's said I have an index with documents in multiple languages and
  analyzed by an assortment of analyzers. When user enter a query, what
 analyzer should be used?

Use q1 OR q2, where q1 is the query parsed with the analyzer for language 
1, q2 is the query parsed with the analyzer for language 2 (and so on). If 
there are conflicts you could also add a required term query to each 
subquery, like language:en^0 so that, for example, the English analyzer 
query only searches on documents that have been identified as English.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and multiple languages

2005-01-20 Thread Ernesto De Santis
Hi Aurora
I develop a tool with this multiple languages issue. I found very useful
an nuke library language-identifier. This jar have nuke dependencies,
but I delete all unnecessary code (for me obvious).
This language-identifier that I use work fine and is very simple:
For example:
LanguageIdentifier languageIdentifier = LanguageIdentifier.getInstance();
String userInputText = free text;
String language = languageIdentifier.identify(text);
This work for 11 languages: English, Spanish, Portuguese, Dutch, German,
French, Italian, and Others.
I can send you this touched jar, but remember that this jar is from
Nuke, for copyright (or left :).
http://www.nutch.org/LICENSE.txt
More comments above...
aurora escribió:
I'm trying to build some web search tool that could work for multiple  
languages. I understand that Lucene is shipped with StandardAnalyzer 
plus  a German and Russian analyzers and some more in the sandbox. And 
that  indexing and searching should use the same analyzer.

Now let's said I have an index with documents in multiple languages 
and  analyzed by an assortment of analyzers. When user enter a query, 
what  analyzer should be used? Should the user be asked for the 
language  upfront? What to expect when the analyzer and the document 
doesn't match?  Let's said the query is parsed using StandardAnalyzer. 
Would it match any  documents done in German analyzer at all. Or would 
it end up in poor  result?

When this happen, in the major cases you do not obtain matchs.
Also is there a good way to find out the languages used in a web 
page?  There is a 'content-langage' header in http and a 'lang' 
attribute in  HTML. Looks like people don't really use them. How can 
we recognize the  language?

With language identifier. :)
Even more interesting is multiple languages used in one document, 
let's  say half English and half French. Is there a good way to deal 
with those  cases?

Language identifier only return one language. I look into
language-identifier and work with a score for each language, and return
the language with greater value.
Maybe you can modify the language-identifier for take the most greater
values.
Bye
Ernesto.
Thanks for any guidance.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]