Re: Stripping Punctuation in a fieldType

Erick Erickson Fri, 15 Jan 2010 10:42:50 -0800

If you haven't seen it, this page is invaluable for this kind of question:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LetterTokenizerFactory
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LetterTokenizerFactory>

LetterTokenizerFactory may well be your friend here, followed by
LowerCaserFilterFactory. There is a problem that it would
split "nation's" up into "nation" and "s", so searching on "nations"
wouldn't get a hit.

But you have equally ugly stuff with WhiteSpaceTokenizerFactory
as you're finding out.

It's actually quite rare for simple tokenizers like these to be
satisfactory
unless it's a field you can guarantee is indexed/searched in a very
controlled manner, say part numbers or words from a list. In your
example above, none of the three variants would get a hit if the
user searched for "nation". Is that what you want?

But no, Standard* don't have any stemming built in. And
what do you mean by "language specific functionality"?
They do NOT fold accents for instance if that's what
you're getting at.

Could you explain a bit about *why* you want this behavior?

HTH
Erick

On Fri, Jan 15, 2010 at 1:17 PM, David Seltzer <dselt...@tveyes.com> wrote:

> I'm hesitant to change Tokenizers at the moment because what we have is
> working so nicely - or so I thought.
>
> What I'm looking for is case-insensitive search for words and numbers
> without any of the stemming features turned on. The new requirement is
> that we take punctuation out of the mix.
>
> Right now when I search for "Obama" I'm not getting any hits on "Obama."
>
> So I'm basically looking to strip punctuation. The consequence would be
> that "nation's", "nations" and "nations," would all be represented the
> same way.
>
> Would the StandardTokenizerFactory accomplish this?
> Does it have any language specific functionality?
> Does it do anything with stemming?
>
> Thanks for everyone's input!
>
> -Dave
>
>
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Friday, January 15, 2010 12:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Stripping Punctuation in a fieldType
>
> > I'm trying to find the best way to set up a fieldType that
> > strips punctuation.
>
> Use solr.StandardTokenizerFactory that strips punctuations.
>
> Or if you do not care about alphanumeric or numeric queries use
> solr.LowerCaseTokenizerFactory that uses LetterTokenizer.
>
> I think the right way to do this is using a
> > CharacterFilter
> > of some type, but I can't seem to find any examples of how
> > to set this
> > up in a schema.xml file.
>
> If you want to use solr.MappingCharFilterFactory you need to write all
> punctiation characters to a text file manually. e.g. "," => ""
>
>
>
>

Re: Stripping Punctuation in a fieldType

Reply via email to