protwords.txt support in stemmers

Robert Muir Tue, 30 Mar 2010 05:07:19 -0700

Hello Solr devs,

One thing we did recently in lucene that I would like to expose in Solr, is
add support for "protected words" to all stemmers.


So the way this works is that a TokenStream attribute 'KeywordAttribute' is
set, and all the stemfilters know to ignore tokens with this boolean value
set.

We also added two neat tokenfilters that make this easy to use:
* KeywordMarkerFilter: a tokenfilter, that given a set of input words, marks
them as keywords with this attribute so any later stemmer ignores them.
* StemmerOverrideFilter: a tokenfilter, that given a map of input
words->stems, stems them with the dictionary, and marks them as keywords so
any later stemmer ignores them.

We have two choices:
* we could treat this stuff as impl details, and add protwords.txt support
to all stemming factories. we could just wrap the filter with a
keywordmarkerfilter internally.
* we could deprecate the explicit protwords.txt in the few factories that
support it, and instead create a factory for KeywordMarkerFilter.
* we could do something else, e.g. both.

So, to illustrate, by adding a factory for the KeywordMarkerFilter, a user
could do:

<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.SomeStemmer"/>

and get the same effect, instead of having to add support for protwords.txt
to every single stem factory.

I don't really have a personal preference as to how we do it, but it would
be cool to have a plan so we can add these factories and clean a few things
up.

In any event I think we should add a factory for the StemmerOverrideFilter,
so someone can have a text file with exceptions, the dutch handling for
"fiets" comes to mind.

Thanks

-- 
Robert Muir
rcm...@gmail.com

protwords.txt support in stemmers

Reply via email to