[jira] Commented: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace

Robert Muir (JIRA) Thu, 03 Mar 2011 07:22:03 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002019#comment-13002019
 ]


Robert Muir commented on LUCENE-2947:
-------------------------------------

Hi Dave, in my opinion there are a lot of problems with our current 
NGramTokenizer (yours is just one) and it would be a good idea to consider 
creating a new one. We could rename the old one to ClassicNGramTokenizer or 
something for people that need the backwards compatibility.

A lot of the problems already have open JIRA issues: i gave my opinion on some 
of the broken-ness here: LUCENE-1224 . The largest problem being that these 
tokenizers only examine the first 1024 chars of the document. They shouldn't 
just discard anything after 1024 chars. There is no need to load the 'entire 
document' into memory... n-gram tokenization can work on a "sliding window" 
across the document.

In my opinion part of n-gram character tokenization is being able to configure 
what is a token character and what is not. (Note I don't mean java character 
here, but in the more abstract sense, e.g. a character might have diacritics 
and be treated as a single unit). For some applications maybe this is just 
'alphabetic letters', for other apps perhaps even punctuation could be 
considered 'relevant'. So it should somehow be flexible.  Furthermore, in the 
case of word-spanning n-grams, you should be able to collapse runs of 
"Non-characters" into a single marker (e.g. _), and usually you would want to 
do this for the start and end of string too.

here's visual representation of how things should look when you use these 
tokenizers in my opinion:
http://www.csee.umbc.edu/~nicholas/601/SIGIR08-Poster.pdf

> NGramTokenizer shouldn't trim whitespace
> ----------------------------------------
>
>                 Key: LUCENE-2947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2947
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 3.0.3
>            Reporter: David Byrne
>            Priority: Minor
>
> Before I tokenize my strings, I am padding them with white space:
> String foobar = " " + foo + " " + bar + " ";
> When constructing term vectors from ngrams, this strategy has a couple 
> benefits.  First, it places special emphasis on the starting and ending of a 
> word.  Second, it improves the similarity between phrases with swapped words. 
>  " foo bar " matches " bar foo " more closely than "foo bar" matches "bar 
> foo".
> The problem is that Lucene's NGramTokenizer trims whitespace.  This forces me 
> to do some preprocessing on my strings before I can tokenize them:
> foobar.replaceAll(" ","$"); //arbitrary char not in my data
> This is undocumented, so users won't realize their strings are being 
> trim()'ed, unless they look through the source, or examine the tokens 
> manually.
> I am proposing NGramTokenizer should be changed to respect whitespace.  Is 
> there a compelling reason against this?

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2947) NGramTokenizer shouldn't trim whitespace

Reply via email to