Hello list,

We have a fairly large Lucene database for a 30+ million post forum. Users post and search for all kinds of things. To make sure users don't have to type exact matches, we combine a WordDelimiterFilter with a (Dutch) SnowballFilter.

Unfortunately users sometimes find examples of words that get stemmed to a word that's basically a stop word. Or reversely, where a very common word is stemmed so that it becomes the same as a rare word.

We do index stop words, so theoretically they could still find their result. But when a rare word is stemmed in such a way it yields a million hits, that makes it very unusable...

One example is the Dutch word 'van' which is the equivalent of 'of' in English. A user tried to search for the shoe brand 'vans', which gets stemmed to 'van' and obviously gives useless results.

I already noticed the 'KeywordRepeatFilter' to index/search both 'vans' and 'van' and the StemmerOverrideFilter to try and prevent these cases. Are there any other solutions for these kinds of problems?

Best regards,

Arjen van der Meijden

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to