On Oct 13, 2006, at 3:42 AM, Antony Bowesman wrote:
I am writing a framework that needs to be able to index documents
from a range of languages where just the character set of the
document is known. Has anyone looked at or is using language
analysis to determine the language of a document in ISO-8859-1.
There is a language identifier plugin in the Nutch codebase that
could surely be distilled (and there are plans to do so) into a
standalone library:
<http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/
languageidentifier/>
What about stemming? I see Google now says it does stemming, but
again here language detection seems to be a stumbling block in the
way of choosing the right stemmer. Does stemming provide much of
an index size reduction and is it actually useful in search?
Stemming shouldn't be considered for reducing index size, but rather
to improve a users experience in findability. It is quite useful in
the right situations, but it is not something that all projects
desire, so you'd have to see if it fits your needs specifically.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]