Re: Tokenization: How to Allow Multiple Strategies?

Erick Erickson Tue, 08 Feb 2011 17:55:50 -0800

A couple of things...

First, you haven't provided any evidence that increasing the index size is a
concern. If your index isn't all that large, it really doesn't matter, and
conserving
index size may not be a concern.


WordDelimterFilterFactory (WDFF) will do the use cases you outlined below,
but don't
get stuck on, for instance, having the '-' be a token unless you can say
for certain that it has benefits over both indexing and searching on just
"123" followed by "4567" which is what would happen with WDFF.

I recommend that you look at the analysis page (check the "verbose" box)
to see the effects of tokenization with various analysis chains before
making
any firm decisions.

Best
Erick

On Tue, Feb 8, 2011 at 6:24 PM, Tavi Nathanson <tavi.nathan...@gmail.com>wrote:

> Thanks for the suggestions! Using a new field makes sense, except it would
> double the size of the index. I'd like to add additional terms, at my
> discretion, only when there's ambiguity.
>
> More specifically, do you know of any way to put multiple *tokens sets* at
> the same position of the same field?
>
> If I can tokenize "123-4567 apple" as:
>
> [Token(123), Token(-), Token(4567), Token(apple)]
> or
> [Token(123-4567), Token(apple)]
>
> ...might there be a way to put [Token(123), Token(-), Token(4567)] *and*
> [Token(123-4567)]  in the index in such a way that the PhraseQuery
> "Token(123-4567) Token(apple)" would match the above string, *and* the
> PhraseQuery "Token(123) Token(-) Token(4567) Token(apple)" would also match
> it?
>
> Thanks!
> Tavi
>
> On Tue, Feb 8, 2011 at 10:34 AM, Em <mailformailingli...@yahoo.de> wrote:
>
> >
> > Hi Tavi,
> >
> > if you want to use multiple tokenization strategies (different tokenizers
> > so
> > to speak) you have to use different fieldTypes.
> >
> > Maybe you have to create your own tokenizer for doing what you want or a
> > PatternTokenizer might help you.
> >
> > However, your examples for the different positions of specific terms
> > reminds
> > me on the WordDelimiterFilter (see
> >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
> > ).
> >
> > It does almost everything you wrote and is close to what you want, I
> think.
> > Have a look at it.
> >
> > Regards
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Tokenization-How-to-Allow-Multiple-Strategies-tp2452505p2453215.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>

Re: Tokenization: How to Allow Multiple Strategies?

Reply via email to