How to handle words that stem to stop words

Arjen van der Meijden Sun, 06 Jul 2014 11:48:16 -0700

Hello list,

We have a fairly large Lucene database for a 30+ million post forum.Users post and search for all kinds of things. To make sure users don'thave to type exact matches, we combine a WordDelimiterFilter with a(Dutch) SnowballFilter.

Unfortunately users sometimes find examples of words that get stemmed toa word that's basically a stop word. Or reversely, where a very commonword is stemmed so that it becomes the same as a rare word.

We do index stop words, so theoretically they could still find theirresult. But when a rare word is stemmed in such a way it yields amillion hits, that makes it very unusable...

One example is the Dutch word 'van' which is the equivalent of 'of' inEnglish. A user tried to search for the shoe brand 'vans', which getsstemmed to 'van' and obviously gives useless results.

I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'and 'van' and the StemmerOverrideFilter to try and prevent these cases.Are there any other solutions for these kinds of problems?


Best regards,

Arjen van der Meijden

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

How to handle words that stem to stop words

Reply via email to