Hi Tavi,

solr-...@lucene.apache.org has been deprecated since the Lucene and Solr source 
trees merged last year.  Please use dev@lucene.apache.org instead.

However, your question is about *usage* of Lucene/Solr, rather than 
*development*, so you should be using solr-u...@lucene.apache.org or 
lucene-u...@lucene.apache.org.  Please repost your question to one of these 
lists.

Steve

> -----Original Message-----
> From: Tavi Nathanson [mailto:tavi.nathan...@gmail.com]
> Sent: Monday, February 07, 2011 12:12 PM
> To: solr-...@lucene.apache.org
> Subject: Tokenization and Fuzziness: How to Allow Multiple Strategies?
> 
> 
> Hey everyone,
> 
> Tokenization seems inherently fuzzy and imprecise, yet Lucene does not
> appear to provide an easy mechanism to account for this fuzziness.
> 
> Let's take an example, where the document I'm indexing is "v1.1.0 mr.
> jones
> da...@gmail.com"
> 
> I may want to tokenize this as follows: ["v1.1.0", "mr", "jones",
> "da...@gmail.com"]
> ...or I may want to tokenize this as follows: ["v1", "1.0", "mr", "jones",
> "david", "gmail.com"]
> ...or I may want to tokenize it another way.
> 
> I would think that the best approach would be indexing using multiple
> strategies, such as:
> 
> ["v1.1.0", "v1", "1.0", "mr", "jones", "da...@gmail.com", "david",
> "gmail.com"]
> 
> However, this would destroy phrase queries. And while Lucene lets you
> index
> multiple tokens at the same position, I haven't found a way to deal with
> cases where you want to index a set of tokens at one position: nor does
> that
> even make sense. For instance, I can't index ["david", "gmail.com"] in the
> same position as "da...@gmail.com".
> 
> So:
> 
> - Any thoughts, in general, about how you all approach this fuzziness? Do
> you just choose one tokenization strategy and hope for the best?
> - Might there be a way to use multiple strategies and *not* break phrase
> queries that I'm overlooking?
> 
> Thanks!
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Tokenization-and-Fuzziness-How-to-
> Allow-Multiple-Strategies-tp2444956p2444956.html
> Sent from the Solr - Dev mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to