Hello Solr devs, One thing we did recently in lucene that I would like to expose in Solr, is add support for "protected words" to all stemmers.
So the way this works is that a TokenStream attribute 'KeywordAttribute' is set, and all the stemfilters know to ignore tokens with this boolean value set. We also added two neat tokenfilters that make this easy to use: * KeywordMarkerFilter: a tokenfilter, that given a set of input words, marks them as keywords with this attribute so any later stemmer ignores them. * StemmerOverrideFilter: a tokenfilter, that given a map of input words->stems, stems them with the dictionary, and marks them as keywords so any later stemmer ignores them. We have two choices: * we could treat this stuff as impl details, and add protwords.txt support to all stemming factories. we could just wrap the filter with a keywordmarkerfilter internally. * we could deprecate the explicit protwords.txt in the few factories that support it, and instead create a factory for KeywordMarkerFilter. * we could do something else, e.g. both. So, to illustrate, by adding a factory for the KeywordMarkerFilter, a user could do: <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.SomeStemmer"/> and get the same effect, instead of having to add support for protwords.txt to every single stem factory. I don't really have a personal preference as to how we do it, but it would be cool to have a plan so we can add these factories and clean a few things up. In any event I think we should add a factory for the StemmerOverrideFilter, so someone can have a text file with exceptions, the dutch handling for "fiets" comes to mind. Thanks -- Robert Muir rcm...@gmail.com