Any idea someone ? I think this is important since this could produce weird results on collections with numbers mixed in text.
>From my understanding, there are a few options to address the issue : 1) Make *LightStemmer token type aware and don't try to stem on things that are not text (alpha/alphanum whatever :)) : this pulls the fix into the faulty component but makes it dependant on the StandardTokenizer which may not be what people want... 2) Enable StandardTokenizer to mark NUM tokens with the keyword attribute so that NUM tokens are not stemmed by contract (this could be a configuration flag markNumbersWithKeywordAttribute=true, false by default) 3) Use a custom processor to mark NUM tokens as keywords (the solution I chose since it doesn't require modifying lucene/solr's code base, it's a very simply contrib module) I chose solution #3. Maybe #2 is the way to go since most people using FrenchLightStemFilterFactory will also want to use StandardTokenizer... Any advice is welcome -- Tanguy -- View this message in context: http://lucene.472066.n3.nabble.com/FrenchLightStemFilterFactory-normalizing-tokens-longer-than-4-characters-and-having-repeated-charactt-tp3974148p3984080.html Sent from the Solr - User mailing list archive at Nabble.com.