Generally, stemming is not a method for index size reduction even though that might be a side effect. It is very useful in search however...you would generally want a search for skiing to also hit ski and skier (I can't spell so don't get caught up on that). There are lots of those examples...if you are doing general search, stemming is great, if not quite as great as lemmatization. Look at the Snowball stemmers in contrib. The stemming king wrote them I believe.
Language recognition can be a pain in the ass. Do some google searching and check out this: http://en.wikipedia.org/wiki/Language_recognition_chart - Mark On 10/13/06, Antony Bowesman <[EMAIL PROTECTED]> wrote:
Hello, I'm new to Lucene and wanted some advice on analyzers, stemmers and language analysis. I've got LIA, so have read it's chapters. I am writing a framework that needs to be able to index documents from a range of languages where just the character set of the document is known. Has anyone looked at or is using language analysis to determine the language of a document in ISO-8859-1. Is it worth doing or does StandardAnalyzer cope well with most European languages as long as it is provided with a suitable multi-lingual set of stop words. What about stemming? I see Google now says it does stemming, but again here language detection seems to be a stumbling block in the way of choosing the right stemmer. Does stemming provide much of an index size reduction and is it actually useful in search? Antony --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]