On Mar 3, 2011, at 9:36 AM, David Byrne wrote:

> I have a minor quibble about Lucene's NGramTokenizer.
> Before I tokenize my strings, I am padding them with white space:
> String foobar = " " + foo + " " + bar + " ";
> When constructing term vectors from ngrams, this strategy has a couple 
> benefits.  First, it places special emphasis on the starting and ending of a 
> word.  Second, it improves the similarity between phrases with swapped words. 
>  " foo bar " matches " bar foo " more closely than "foo bar" matches "bar 
> foo".

I'm not following this argument.  What does the extra whitespace give you here? 

> The problem is that Lucene's NGramTokenizer trims whitespace.  This forces me 
> to do some preprocessing on my strings before I can tokenize them:
> foobar.replaceAll(" ","$"); //arbitrary char not in my data

I'm confused.  If you are padding them up front, then why don't you just do the 
arbitrary char trick then?  Where is the extra processing?

> This is undocumented, so users won't realize their strings are being 
> trim()'ed, unless they look through the source, or examine the tokens 
> manually.

It may be undocumented, but I think it is pretty standard as to what users 
expect out of a tokenizer.

> I am proposing NGramTokenizer should be changed to respect whitespace.  Is 
> there a compelling reason against this?

Unfortunately, I'm not following your reasons for doing it.  I won't say I'm 
against it at this point, but I don't see a compelling reason to change it 
either so if you could clarify that would be great.  It's been around for quite 
some time in it's current form and I think fits most people's expectations of 


Reply via email to