Otis Gospodnetic <[EMAIL PROTECTED]> writes: > For indexing text that has multiple languages.... I don't know what to > recommend. Well, I do - try the StandardAnalyzer and see if that > produces satisfactory results, but you'd really need a smart analyzer > that knows how to properly tokenize and filter words from multiple > languages, and I haven't heard of anyone doing that here.
We have a collection of Reuters documents in 13 languages (mostly European, but also Russian, Chinese, and Japanese) that we've indexed successfully with our Lucene-based system. The text is all in standard, modern encodings. Collection link: http://trec.nist.gov/data/reuters/reuters.html We had no problems whatsoever on the Lucene end. You need to take care about how you read your text before you feed it to an analyzer, and how you do the same with queries. Obviously the Lucene analyzer assumes words separated by puntuation and space, which is not so good for asian-language retrieval performance, and of course there are no stemmers if you want that. You're best off using some language-specific analyzer chains. If you don't know the language before analysis, that's a harder problem. Ian --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]