On Mar 3, 2011, at 9:36 AM, David Byrne wrote: > I have a minor quibble about Lucene's NGramTokenizer. > > Before I tokenize my strings, I am padding them with white space: > > String foobar = " " + foo + " " + bar + " "; > > When constructing term vectors from ngrams, this strategy has a couple > benefits. First, it places special emphasis on the starting and ending of a > word. Second, it improves the similarity between phrases with swapped words. > " foo bar " matches " bar foo " more closely than "foo bar" matches "bar > foo". > >
I'm not following this argument. What does the extra whitespace give you here? > The problem is that Lucene's NGramTokenizer trims whitespace. This forces me > to do some preprocessing on my strings before I can tokenize them: > > foobar.replaceAll(" ","$"); //arbitrary char not in my data > > I'm confused. If you are padding them up front, then why don't you just do the arbitrary char trick then? Where is the extra processing? > This is undocumented, so users won't realize their strings are being > trim()'ed, unless they look through the source, or examine the tokens > manually. > > It may be undocumented, but I think it is pretty standard as to what users expect out of a tokenizer. > I am proposing NGramTokenizer should be changed to respect whitespace. Is > there a compelling reason against this? > > Unfortunately, I'm not following your reasons for doing it. I won't say I'm against it at this point, but I don't see a compelling reason to change it either so if you could clarify that would be great. It's been around for quite some time in it's current form and I think fits most people's expectations of ngrams. -Grant