I'm reluctant to apply either solution:
Emitting both tokens will likely still provide the user with a very long
result list. Even though the results with 'vans' in it are likely to be
ranked to the top, its still not very user friendly due to its
overwhelmingly large number of results (nor
Hi Arjen,
This is kind of a spin on your last observation that your list of stop
words don't change frequently. If you have a custom filter that attempts to
stem the incoming token and if it stems to the same as a stopword, only
then sets the keyword attribute on the original token.
That way
Hi Sujit,
Thanks. I was thinking along those lines myself. And reversely, the same
list of stopwords could be used to mark the stopwords as keyword as
well, to prevent them from collapsing with rare words.
Best regards,
Arjen
On 10-7-2014 22:30 Sujit Pal wrote:
Hi Arjen,
This is kind of
I think emitting two tokens for vans is the right (potentially only) way to
do it. You could
also control the dictionary of terms that require this special treatment.
Any reason makes you not happy with this approach?
On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden acmmail...@tweakers.net
Some of these anomalous cases are best handled by simply suppressing
stemming, using PatternKeywordMarkerFilter and SetKeywordMarkerFilter, to
set the keyword attribute for matching tokens and then most stemmers will
not change them.
You can create a list of words to ignore, like plurals of
Hi Arjen,
You could also mark a token as keyword so the stemmer passes it through
unchanged. For example, per the Javadocs for PorterStemFilter:
http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html
Note: This filter is aware of the
Arjen,
An approach requiring less list maintenance could be more advanced
linguistic processing to distinguish the stop word from the content word,
such as lemmatization rather than stemming.
A commercial offering, Rosette Search Essentials from Basis
http://www.basistech.com/search-essentials/