Re: Indexing .txt file containing english, german or french alphabet

Ian Soboroff Mon, 26 Sep 2005 12:55:42 -0700

Otis Gospodnetic <[EMAIL PROTECTED]> writes:

> For indexing text that has multiple languages.... I don't know what to
> recommend.  Well, I do - try the StandardAnalyzer and see if that
> produces satisfactory results, but you'd really need a smart analyzer
> that knows how to properly tokenize and filter words from multiple
> languages, and I haven't heard of anyone doing that here.


We have a collection of Reuters documents in 13 languages (mostly
European, but also Russian, Chinese, and Japanese) that we've indexed
successfully with our Lucene-based system.  The text is all in
standard, modern encodings.

Collection link: http://trec.nist.gov/data/reuters/reuters.html

We had no problems whatsoever on the Lucene end.  You need to take
care about how you read your text before you feed it to an analyzer,
and how you do the same with queries.

Obviously the Lucene analyzer assumes words separated by puntuation
and space, which is not so good for asian-language retrieval
performance, and of course there are no stemmers if you want that.
You're best off using some language-specific analyzer chains.  If you
don't know the language before analysis, that's a harder problem.

Ian



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing .txt file containing english, german or french alphabet

Reply via email to