Re: Unintuitive NGramTokenizer behavior

Robert Muir Thu, 03 Mar 2011 10:11:25 -0800

On Thu, Mar 3, 2011 at 1:00 PM, Grant Ingersoll <[email protected]> wrote:
>
> Unfortunately, I'm not following your reasons for doing it.  I won't say I'm
> against it at this point, but I don't see a compelling reason to change it
> either so if you could clarify that would be great.  It's been around for
> quite some time in it's current form and I think fits most people's
> expectations of ngrams.


Grant I'm sorry, but I couldnt disagree more.

There are many variations on ngram tokenization (word-internal,
word-spanning, skipgrams), besides allowing flexibility for what
should be a "word character" and what should not be (e.g.
punctuation), and how to handle the specifics of these.

But our n-gram tokenizer is *UNARGUABLY* completely broken for these reasons:
1. it discards anything after the first 1024 code units of the document.
2. it uses partial characters (UTF-16 code units) as its fundamental
measure, potentially creating lots of invalid unicode.
3. it forms n-grams in the wrong order, contributing to #1. I
explained this in LUCENE-1224

Its these reasons that I suggested we completely rewrite it... people
that are just indexing english documents with < 1024 chars per
document and don't care about these things can use
ClassicNGramTokenizer.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Unintuitive NGramTokenizer behavior

Reply via email to