Hello list,
We have a fairly large Lucene database for a 30+ million post forum.
Users post and search for all kinds of things. To make sure users don't
have to type exact matches, we combine a WordDelimiterFilter with a
(Dutch) SnowballFilter.
Unfortunately users sometimes find examples of words that get stemmed to
a word that's basically a stop word. Or reversely, where a very common
word is stemmed so that it becomes the same as a rare word.
We do index stop words, so theoretically they could still find their
result. But when a rare word is stemmed in such a way it yields a
million hits, that makes it very unusable...
One example is the Dutch word 'van' which is the equivalent of 'of' in
English. A user tried to search for the shoe brand 'vans', which gets
stemmed to 'van' and obviously gives useless results.
I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
and 'van' and the StemmerOverrideFilter to try and prevent these cases.
Are there any other solutions for these kinds of problems?
Best regards,
Arjen van der Meijden
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org