Re: Unintuitive NGramTokenizer behavior

David Byrne Thu, 03 Mar 2011 10:29:17 -0800

Grant,

To explain the advantage:


Trigrams for "foo bar": 'foo', 'oo ', 'o b', ' ba', 'bar'
Trigrams for "bar foo": 'bar', 'ar ', 'r f', ' fo', 'foo'

Only two out of eight unique trigrams match.

Trigrams for " foo bar ": ' fo', 'foo', 'oo ', 'o b', ' ba', 'bar', 'ar '
Trigrams for " bar foo ": ' ba', 'bar', 'ar ', 'r f', ' fo', 'foo', 'oo '

Six out of eight unique trigrams match.

I can't do the character replacement up front, because foo and bar might
already contain whitespace as well.  Anyways, its a hack, and if my
arbitrary character ever gets introduced into the data I am in trouble.

Not only is this undocumented, but it seems unintentional if you look at the
comments in the code.

FYI, I opened up an issue regarding this: http://bit.ly/eqhTO1
 On Mar 3, 2011 1:00 PM, "Grant Ingersoll" <[email protected]> wrote:
>
> On Mar 3, 2011, at 9:36 AM, David Byrne wrote:
>
>> I have a minor quibble about Lucene's NGramTokenizer.
>>
>> Before I tokenize my strings, I am padding them with white space:
>>
>> String foobar = " " + foo + " " + bar + " ";
>>
>> When constructing term vectors from ngrams, this strategy has a couple
benefits. First, it places special emphasis on the starting and ending of a
word. Second, it improves the similarity between phrases with swapped words.
" foo bar " matches " bar foo " more closely than "foo bar" matches "bar
foo".
>>
>>
>
> I'm not following this argument. What does the extra whitespace give you
here?
>
>> The problem is that Lucene's NGramTokenizer trims whitespace. This forces
me to do some preprocessing on my strings before I can tokenize them:
>>
>> foobar.replaceAll(" ","$"); //arbitrary char not in my data
>>
>>
>
> I'm confused. If you are padding them up front, then why don't you just do
the arbitrary char trick then? Where is the extra processing?
>
>> This is undocumented, so users won't realize their strings are being
trim()'ed, unless they look through the source, or examine the tokens
manually.
>>
>>
>
> It may be undocumented, but I think it is pretty standard as to what users
expect out of a tokenizer.
>
>> I am proposing NGramTokenizer should be changed to respect whitespace. Is
there a compelling reason against this?
>>
>>
>
> Unfortunately, I'm not following your reasons for doing it. I won't say
I'm against it at this point, but I don't see a compelling reason to change
it either so if you could clarify that would be great. It's been around for
quite some time in it's current form and I think fits most people's
expectations of ngrams.
>
> -Grant

Re: Unintuitive NGramTokenizer behavior

Reply via email to