On Thu, Mar 3, 2011 at 1:00 PM, Grant Ingersoll <[email protected]> wrote: > > Unfortunately, I'm not following your reasons for doing it. I won't say I'm > against it at this point, but I don't see a compelling reason to change it > either so if you could clarify that would be great. It's been around for > quite some time in it's current form and I think fits most people's > expectations of ngrams.
Grant I'm sorry, but I couldnt disagree more. There are many variations on ngram tokenization (word-internal, word-spanning, skipgrams), besides allowing flexibility for what should be a "word character" and what should not be (e.g. punctuation), and how to handle the specifics of these. But our n-gram tokenizer is *UNARGUABLY* completely broken for these reasons: 1. it discards anything after the first 1024 code units of the document. 2. it uses partial characters (UTF-16 code units) as its fundamental measure, potentially creating lots of invalid unicode. 3. it forms n-grams in the wrong order, contributing to #1. I explained this in LUCENE-1224 Its these reasons that I suggested we completely rewrite it... people that are just indexing english documents with < 1024 chars per document and don't care about these things can use ClassicNGramTokenizer. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
