Hey everyone, Tokenization seems inherently fuzzy and imprecise, yet Solr/Lucene does not appear to provide an easy mechanism to account for this fuzziness.
Let's take an example, where the document I'm indexing is "v1.1.0 mr. jones www.gmail.com" I may want to tokenize this as follows: ["v1.1.0", "mr", "jones", " www.gmail.com"] ...or I may want to tokenize this as follows: ["v1", "1.0", "mr", "jones", "www", "gmail.com"] ...or I may want to tokenize it another way. I would think that the best approach would be indexing using multiple strategies, such as: ["v1.1.0", "v1", "1.0", "mr", "jones", "www.gmail.com", "www", "gmail.com"] However, this would destroy phrase queries. And while Lucene lets you index multiple tokens at the same position, I haven't found a way to deal with cases where you want to index a set of tokens at one position: nor does that even make sense. For instance, I can't index ["www", "gmail.com"] in the same position as "www.gmail.com". So: - Any thoughts, in general, about how you all approach this fuzziness? Do you just choose one tokenization strategy and hope for the best? - Might there be a way to use multiple strategies and *not* break phrase queries that I'm overlooking? Thanks! Tavi