[
https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002019#comment-13002019
]
Robert Muir commented on LUCENE-2947:
-------------------------------------
Hi Dave, in my opinion there are a lot of problems with our current
NGramTokenizer (yours is just one) and it would be a good idea to consider
creating a new one. We could rename the old one to ClassicNGramTokenizer or
something for people that need the backwards compatibility.
A lot of the problems already have open JIRA issues: i gave my opinion on some
of the broken-ness here: LUCENE-1224 . The largest problem being that these
tokenizers only examine the first 1024 chars of the document. They shouldn't
just discard anything after 1024 chars. There is no need to load the 'entire
document' into memory... n-gram tokenization can work on a "sliding window"
across the document.
In my opinion part of n-gram character tokenization is being able to configure
what is a token character and what is not. (Note I don't mean java character
here, but in the more abstract sense, e.g. a character might have diacritics
and be treated as a single unit). For some applications maybe this is just
'alphabetic letters', for other apps perhaps even punctuation could be
considered 'relevant'. So it should somehow be flexible. Furthermore, in the
case of word-spanning n-grams, you should be able to collapse runs of
"Non-characters" into a single marker (e.g. _), and usually you would want to
do this for the start and end of string too.
here's visual representation of how things should look when you use these
tokenizers in my opinion:
http://www.csee.umbc.edu/~nicholas/601/SIGIR08-Poster.pdf
> NGramTokenizer shouldn't trim whitespace
> ----------------------------------------
>
> Key: LUCENE-2947
> URL: https://issues.apache.org/jira/browse/LUCENE-2947
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/analyzers
> Affects Versions: 3.0.3
> Reporter: David Byrne
> Priority: Minor
>
> Before I tokenize my strings, I am padding them with white space:
> String foobar = " " + foo + " " + bar + " ";
> When constructing term vectors from ngrams, this strategy has a couple
> benefits. First, it places special emphasis on the starting and ending of a
> word. Second, it improves the similarity between phrases with swapped words.
> " foo bar " matches " bar foo " more closely than "foo bar" matches "bar
> foo".
> The problem is that Lucene's NGramTokenizer trims whitespace. This forces me
> to do some preprocessing on my strings before I can tokenize them:
> foobar.replaceAll(" ","$"); //arbitrary char not in my data
> This is undocumented, so users won't realize their strings are being
> trim()'ed, unless they look through the source, or examine the tokens
> manually.
> I am proposing NGramTokenizer should be changed to respect whitespace. Is
> there a compelling reason against this?
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]