[
https://issues.apache.org/jira/browse/LUCENE-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677930#comment-13677930
]
Simon Willnauer commented on LUCENE-5042:
-----------------------------------------
hey adrien, this looks very cool! I have a couple of minor comments:
* can we factor out the toCodepoints calculation into a method in for instance
CharacterUtils I think we use this elsewhere as well in a similar way and you
might want to reuse it in the future as well.
* can we have a comment on NGramTokenizer that every method should be final
except of isTokenChar
* if you can think of a hard limit for the while(true) loop in NGramTokenizer
can we add an assert that makes sure we always make progress ie. never walk
backwards or don't consume anything? not sure if it is posssible.
* can you use more parentesis for readability like in:
{code}
if (gramSize > maxGram || bufferStart + gramSize > bufferEnd)
// vs.
if (gramSize > maxGram || (bufferStart + gramSize) > bufferEnd)
{code}
> Improve NGramTokenizer
> ----------------------
>
> Key: LUCENE-5042
> URL: https://issues.apache.org/jira/browse/LUCENE-5042
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Fix For: 5.0, 4.4
>
> Attachments: LUCENE-5042.patch
>
>
> Now that we fixed NGramTokenizer and NGramTokenFilter to not produce corrupt
> token streams, the only way to have "true" offsets for n-grams is to use the
> tokenizer (the filter emits the offsets of the original token).
> Yet, our NGramTokenizer has a few flaws, in particular:
> - it doesn't have the ability to pre-tokenize the input stream, for example
> on whitespaces,
> - it doesn't play nice with surrogate pairs.
> Since we already broke backward compatibility for it in 4.4, I'd like to also
> fix these issues before we release.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]