If you haven't seen it, this page is invaluable for this kind of question: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LetterTokenizerFactory <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LetterTokenizerFactory>
LetterTokenizerFactory may well be your friend here, followed by LowerCaserFilterFactory. There is a problem that it would split "nation's" up into "nation" and "s", so searching on "nations" wouldn't get a hit. But you have equally ugly stuff with WhiteSpaceTokenizerFactory as you're finding out. It's actually quite rare for simple tokenizers like these to be satisfactory unless it's a field you can guarantee is indexed/searched in a very controlled manner, say part numbers or words from a list. In your example above, none of the three variants would get a hit if the user searched for "nation". Is that what you want? But no, Standard* don't have any stemming built in. And what do you mean by "language specific functionality"? They do NOT fold accents for instance if that's what you're getting at. Could you explain a bit about *why* you want this behavior? HTH Erick On Fri, Jan 15, 2010 at 1:17 PM, David Seltzer <dselt...@tveyes.com> wrote: > I'm hesitant to change Tokenizers at the moment because what we have is > working so nicely - or so I thought. > > What I'm looking for is case-insensitive search for words and numbers > without any of the stemming features turned on. The new requirement is > that we take punctuation out of the mix. > > Right now when I search for "Obama" I'm not getting any hits on "Obama." > > So I'm basically looking to strip punctuation. The consequence would be > that "nation's", "nations" and "nations," would all be represented the > same way. > > Would the StandardTokenizerFactory accomplish this? > Does it have any language specific functionality? > Does it do anything with stemming? > > Thanks for everyone's input! > > -Dave > > > > -----Original Message----- > From: Ahmet Arslan [mailto:iori...@yahoo.com] > Sent: Friday, January 15, 2010 12:42 PM > To: solr-user@lucene.apache.org > Subject: Re: Stripping Punctuation in a fieldType > > > I'm trying to find the best way to set up a fieldType that > > strips punctuation. > > Use solr.StandardTokenizerFactory that strips punctuations. > > Or if you do not care about alphanumeric or numeric queries use > solr.LowerCaseTokenizerFactory that uses LetterTokenizer. > > I think the right way to do this is using a > > CharacterFilter > > of some type, but I can't seem to find any examples of how > > to set this > > up in a schema.xml file. > > If you want to use solr.MappingCharFilterFactory you need to write all > punctiation characters to a text file manually. e.g. "," => "" > > > >