Grant, To explain the advantage:
Trigrams for "foo bar": 'foo', 'oo ', 'o b', ' ba', 'bar' Trigrams for "bar foo": 'bar', 'ar ', 'r f', ' fo', 'foo' Only two out of eight unique trigrams match. Trigrams for " foo bar ": ' fo', 'foo', 'oo ', 'o b', ' ba', 'bar', 'ar ' Trigrams for " bar foo ": ' ba', 'bar', 'ar ', 'r f', ' fo', 'foo', 'oo ' Six out of eight unique trigrams match. I can't do the character replacement up front, because foo and bar might already contain whitespace as well. Anyways, its a hack, and if my arbitrary character ever gets introduced into the data I am in trouble. Not only is this undocumented, but it seems unintentional if you look at the comments in the code. FYI, I opened up an issue regarding this: http://bit.ly/eqhTO1 On Mar 3, 2011 1:00 PM, "Grant Ingersoll" <[email protected]> wrote: > > On Mar 3, 2011, at 9:36 AM, David Byrne wrote: > >> I have a minor quibble about Lucene's NGramTokenizer. >> >> Before I tokenize my strings, I am padding them with white space: >> >> String foobar = " " + foo + " " + bar + " "; >> >> When constructing term vectors from ngrams, this strategy has a couple benefits. First, it places special emphasis on the starting and ending of a word. Second, it improves the similarity between phrases with swapped words. " foo bar " matches " bar foo " more closely than "foo bar" matches "bar foo". >> >> > > I'm not following this argument. What does the extra whitespace give you here? > >> The problem is that Lucene's NGramTokenizer trims whitespace. This forces me to do some preprocessing on my strings before I can tokenize them: >> >> foobar.replaceAll(" ","$"); //arbitrary char not in my data >> >> > > I'm confused. If you are padding them up front, then why don't you just do the arbitrary char trick then? Where is the extra processing? > >> This is undocumented, so users won't realize their strings are being trim()'ed, unless they look through the source, or examine the tokens manually. >> >> > > It may be undocumented, but I think it is pretty standard as to what users expect out of a tokenizer. > >> I am proposing NGramTokenizer should be changed to respect whitespace. Is there a compelling reason against this? >> >> > > Unfortunately, I'm not following your reasons for doing it. I won't say I'm against it at this point, but I don't see a compelling reason to change it either so if you could clarify that would be great. It's been around for quite some time in it's current form and I think fits most people's expectations of ngrams. > > -Grant
