On Thu, Mar 3, 2011 at 2:06 PM, Grant Ingersoll <gsing...@apache.org> wrote: > > On Mar 3, 2011, at 1:10 PM, Robert Muir wrote: > >> On Thu, Mar 3, 2011 at 1:00 PM, Grant Ingersoll <gsing...@apache.org> wrote: >>> >>> Unfortunately, I'm not following your reasons for doing it. I won't say I'm >>> against it at this point, but I don't see a compelling reason to change it >>> either so if you could clarify that would be great. It's been around for >>> quite some time in it's current form and I think fits most people's >>> expectations of ngrams. >> >> Grant I'm sorry, but I couldnt disagree more. >> >> There are many variations on ngram tokenization (word-internal, >> word-spanning, skipgrams), besides allowing flexibility for what >> should be a "word character" and what should not be (e.g. >> punctuation), and how to handle the specifics of these. >> >> But our n-gram tokenizer is *UNARGUABLY* completely broken for these reasons: >> 1. it discards anything after the first 1024 code units of the document. >> 2. it uses partial characters (UTF-16 code units) as its fundamental >> measure, potentially creating lots of invalid unicode. >> 3. it forms n-grams in the wrong order, contributing to #1. I >> explained this in LUCENE-1224 > > Sure, but those are ancillary to the whitespace question that was asked about. >
Not really? its the more general form of the whitespace question. I'm saying you should be able to say 'this is part of a word', but then also specify if you want to fold runs of "non-characters" into a single thing (e.g. '_') or into nothing at all, or whatever. Additionally NGramTokenizer should also support option to treat "start" and "end" of string as "non-characters"... in my opinion this should be the default and is the root cause of Dave's issue? --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org