Re: Analyzers and multiple languages

Mark Miller Fri, 13 Oct 2006 05:49:43 -0700

Generally, stemming is not a method for index size reduction even though
that might be a side effect. It is very useful in search however...you would
generally want a search for skiing to also hit ski and skier (I can't spell
so don't get caught up on that). There are lots of those examples...if you
are doing general search, stemming is great, if not quite as great as
lemmatization. Look at the Snowball stemmers in contrib. The stemming king
wrote them I believe.



Language recognition can be a pain in the ass. Do some google searching and
check out this:
http://en.wikipedia.org/wiki/Language_recognition_chart

- Mark

On 10/13/06, Antony Bowesman <[EMAIL PROTECTED]> wrote:


Hello,

I'm new to Lucene and wanted some advice on analyzers, stemmers and
language
analysis.  I've got LIA, so have read it's chapters.

I am writing a framework that needs to be able to index documents from a
range
of languages where just the character set of the document is known.  Has
anyone
looked at or is using language analysis to determine the language of a
document
in ISO-8859-1.

Is it worth doing or does StandardAnalyzer cope well with most European
languages as long as it is provided with a suitable multi-lingual set of
stop words.

What about stemming?  I see Google now says it does stemming, but again
here
language detection seems to be a stumbling block in the way of choosing
the
right stemmer.  Does stemming provide much of an index size reduction and
is it
actually useful in search?

Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analyzers and multiple languages

Reply via email to