Tokenization: How to Allow Multiple Strategies?

Tavi Nathanson Tue, 08 Feb 2011 08:47:30 -0800

Hey everyone,

Tokenization seems inherently fuzzy and imprecise, yet Solr/Lucene does not
appear to provide an easy mechanism to account for this fuzziness.


Let's take an example, where the document I'm indexing is "v1.1.0 mr. jones
www.gmail.com"

I may want to tokenize this as follows: ["v1.1.0", "mr", "jones", "
www.gmail.com"]
...or I may want to tokenize this as follows: ["v1", "1.0", "mr", "jones",
"www", "gmail.com"]
...or I may want to tokenize it another way.

I would think that the best approach would be indexing using multiple
strategies, such as:

["v1.1.0", "v1", "1.0", "mr", "jones", "www.gmail.com", "www", "gmail.com"]

However, this would destroy phrase queries. And while Lucene lets you index
multiple tokens at the same position, I haven't found a way to deal with
cases where you want to index a set of tokens at one position: nor does that
even make sense. For instance, I can't index ["www", "gmail.com"] in the
same position as "www.gmail.com".

So:

- Any thoughts, in general, about how you all approach this fuzziness? Do
you just choose one tokenization strategy and hope for the best?
- Might there be a way to use multiple strategies and *not* break phrase
queries that I'm overlooking?

Thanks!

Tavi

Tokenization: How to Allow Multiple Strategies?

Reply via email to