On Mar 3, 2011, at 9:36 AM, David Byrne wrote:
> I have a minor quibble about Lucene's NGramTokenizer.
>
> Before I tokenize my strings, I am padding them with white space:
>
> String foobar = " " + foo + " " + bar + " ";
>
> When constructing term vectors from ngrams, this strategy has a couple
> benefits. First, it places special emphasis on the starting and ending of a
> word. Second, it improves the similarity between phrases with swapped words.
> " foo bar " matches " bar foo " more closely than "foo bar" matches "bar
> foo".
>
>
I'm not following this argument. What does the extra whitespace give you here?
> The problem is that Lucene's NGramTokenizer trims whitespace. This forces me
> to do some preprocessing on my strings before I can tokenize them:
>
> foobar.replaceAll(" ","$"); //arbitrary char not in my data
>
>
I'm confused. If you are padding them up front, then why don't you just do the
arbitrary char trick then? Where is the extra processing?
> This is undocumented, so users won't realize their strings are being
> trim()'ed, unless they look through the source, or examine the tokens
> manually.
>
>
It may be undocumented, but I think it is pretty standard as to what users
expect out of a tokenizer.
> I am proposing NGramTokenizer should be changed to respect whitespace. Is
> there a compelling reason against this?
>
>
Unfortunately, I'm not following your reasons for doing it. I won't say I'm
against it at this point, but I don't see a compelling reason to change it
either so if you could clarify that would be great. It's been around for quite
some time in it's current form and I think fits most people's expectations of
ngrams.
-Grant