I'm trying to build some web search tool that could work for multiple languages. I understand that Lucene is shipped with StandardAnalyzer plus a German and Russian analyzers and some more in the sandbox. And that indexing and searching should use the same analyzer.

Now let's said I have an index with documents in multiple languages and analyzed by an assortment of analyzers. When user enter a query, what analyzer should be used? Should the user be asked for the language upfront? What to expect when the analyzer and the document doesn't match? Let's said the query is parsed using StandardAnalyzer. Would it match any documents done in German analyzer at all. Or would it end up in poor result?

Also is there a good way to find out the languages used in a web page? There is a 'content-langage' header in http and a 'lang' attribute in HTML. Looks like people don't really use them. How can we recognize the language?

Even more interesting is multiple languages used in one document, let's say half English and half French. Is there a good way to deal with those cases?

Thanks for any guidance.


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to