Re: FrenchLightStemFilterFactory : normalizing tokens longer than 4 characters and having repeated characters in it

2012-05-16 Thread Tanguy Moal
-Original Message- > From: Tanguy Moal [mailto:tanguy.m...@gmail.com] > Sent: Wednesday, May 16, 2012 8:29 AM > To: solr-user@lucene.apache.org > Subject: Re: FrenchLightStemFilterFactory : normalizing tokens longer than > 4 characters and having repeated characters in it > >

RE: FrenchLightStemFilterFactory : normalizing tokens longer than 4 characters and having repeated characters in it

2012-05-16 Thread Steven A Rowe
ts the operation to chars 'k', 'p', and 't'.) Thanks, Steve -Original Message- From: Tanguy Moal [mailto:tanguy.m...@gmail.com] Sent: Wednesday, May 16, 2012 8:29 AM To: solr-user@lucene.apache.org Subject: Re: FrenchLightStemFilterFactory : normalizing to

Re: FrenchLightStemFilterFactory : normalizing tokens longer than 4 characters and having repeated characters in it

2012-05-16 Thread Robert Muir
On Wed, May 16, 2012 at 8:28 AM, Tanguy Moal wrote: > Any idea someone ? > > I think this is important since this could produce weird results on > collections with numbers mixed in text. I agree, i think we should just add '&& Character.isLetter(ch)' to the undoublet check? Thanks for bringing t

Re: FrenchLightStemFilterFactory : normalizing tokens longer than 4 characters and having repeated characters in it

2012-05-16 Thread Tanguy Moal
Any idea someone ? I think this is important since this could produce weird results on collections with numbers mixed in text. >From my understanding, there are a few options to address the issue : 1) Make *LightStemmer token type aware and don't try to stem on things that are not text (alpha/alp