On Mar 3, 2011, at 9:36 AM, David Byrne wrote:

> I have a minor quibble about Lucene's NGramTokenizer.
> 
> Before I tokenize my strings, I am padding them with white space:
> 
> String foobar = " " + foo + " " + bar + " ";
> 
> When constructing term vectors from ngrams, this strategy has a couple 
> benefits.  First, it places special emphasis on the starting and ending of a 
> word.  Second, it improves the similarity between phrases with swapped words. 
>  " foo bar " matches " bar foo " more closely than "foo bar" matches "bar 
> foo".
> 
> 

I'm not following this argument.  What does the extra whitespace give you here? 
 

> The problem is that Lucene's NGramTokenizer trims whitespace.  This forces me 
> to do some preprocessing on my strings before I can tokenize them:
> 
> foobar.replaceAll(" ","$"); //arbitrary char not in my data
> 
> 

I'm confused.  If you are padding them up front, then why don't you just do the 
arbitrary char trick then?  Where is the extra processing?

> This is undocumented, so users won't realize their strings are being 
> trim()'ed, unless they look through the source, or examine the tokens 
> manually.
> 
> 

It may be undocumented, but I think it is pretty standard as to what users 
expect out of a tokenizer.

> I am proposing NGramTokenizer should be changed to respect whitespace.  Is 
> there a compelling reason against this?
> 
> 

Unfortunately, I'm not following your reasons for doing it.  I won't say I'm 
against it at this point, but I don't see a compelling reason to change it 
either so if you could clarify that would be great.  It's been around for quite 
some time in it's current form and I think fits most people's expectations of 
ngrams.

-Grant

Reply via email to