Re: FrenchLightStemFilterFactory : normalizing tokens longer than 4 characters and having repeated characters in it

Tanguy Moal Wed, 16 May 2012 05:29:08 -0700

Any idea someone ?

I think this is important since this could produce weird results on
collections with numbers mixed in text.


>From my understanding, there are a few options to address the issue :
1) Make *LightStemmer token type aware and don't try to stem on things that
are not text (alpha/alphanum whatever :)) : this pulls the fix into the
faulty component but makes it dependant on the StandardTokenizer which may
not be what people want...
2) Enable StandardTokenizer to mark NUM tokens with the keyword attribute so
that NUM tokens are not stemmed by contract (this could be a configuration
flag markNumbersWithKeywordAttribute=true, false by default)
3) Use a custom processor to mark NUM tokens as keywords (the solution I
chose since it doesn't require modifying lucene/solr's code base, it's a
very simply contrib module)

I chose solution #3.

Maybe #2 is the way to go since most people using
FrenchLightStemFilterFactory will also want to use StandardTokenizer...

Any advice is welcome

--
Tanguy

--
View this message in context: 
http://lucene.472066.n3.nabble.com/FrenchLightStemFilterFactory-normalizing-tokens-longer-than-4-characters-and-having-repeated-charactt-tp3974148p3984080.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: FrenchLightStemFilterFactory : normalizing tokens longer than 4 characters and having repeated characters in it

Reply via email to