On Thu, Mar 3, 2011 at 2:06 PM, Grant Ingersoll <gsing...@apache.org> wrote:
>
> On Mar 3, 2011, at 1:10 PM, Robert Muir wrote:
>
>> On Thu, Mar 3, 2011 at 1:00 PM, Grant Ingersoll <gsing...@apache.org> wrote:
>>>
>>> Unfortunately, I'm not following your reasons for doing it.  I won't say I'm
>>> against it at this point, but I don't see a compelling reason to change it
>>> either so if you could clarify that would be great.  It's been around for
>>> quite some time in it's current form and I think fits most people's
>>> expectations of ngrams.
>>
>> Grant I'm sorry, but I couldnt disagree more.
>>
>> There are many variations on ngram tokenization (word-internal,
>> word-spanning, skipgrams), besides allowing flexibility for what
>> should be a "word character" and what should not be (e.g.
>> punctuation), and how to handle the specifics of these.
>>
>> But our n-gram tokenizer is *UNARGUABLY* completely broken for these reasons:
>> 1. it discards anything after the first 1024 code units of the document.
>> 2. it uses partial characters (UTF-16 code units) as its fundamental
>> measure, potentially creating lots of invalid unicode.
>> 3. it forms n-grams in the wrong order, contributing to #1. I
>> explained this in LUCENE-1224
>
> Sure, but those are ancillary to the whitespace question that was asked about.
>

Not really? its the more general form of the whitespace question.

I'm saying you should be able to say 'this is part of a word', but
then also specify if you want to fold runs of "non-characters" into a
single thing (e.g. '_') or into nothing at all, or whatever.

Additionally NGramTokenizer should also support option to treat
"start" and "end" of string as "non-characters"... in my opinion this
should be the default and is the root cause of Dave's issue?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to