[ 
https://issues.apache.org/jira/browse/LUCENE-5269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792429#comment-13792429
 ] 

Robert Muir commented on LUCENE-5269:
-------------------------------------

{quote}
This is so crazy! Why did we never hit this combination before?
{quote}

This combination is especially good at finding the bug, here's why:
{code}
Tokenizer tokenizer = new EdgeNGramTokenizer(TEST_VERSION_CURRENT, reader, 2, 
94);
TokenStream stream = new ShingleFilter(tokenizer, 5);
stream = new NGramTokenFilter(TEST_VERSION_CURRENT, stream, 55, 83);
{code}

The edge-ngram has min=2 max=94, its basically brute forcing every token size.
then the shingles makes tons of tokens with positionIncrement=0.
so it makes it easy for the (previously buggy ngramtokenfilter with wrong 
length filter) to misclassify tokens with its logic expecting codepoints, emit 
an initial token with posinc=0:

{code}
if ((curPos + curGramSize) <= curCodePointCount) {
...
          posIncAtt.setPositionIncrement(curPosInc);
{code}


> TestRandomChains failure
> ------------------------
>
>                 Key: LUCENE-5269
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5269
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 4.5.1, 4.6, 5.0
>
>         Attachments: LUCENE-5269.patch, LUCENE-5269.patch, LUCENE-5269.patch, 
> LUCENE-5269_test.patch, LUCENE-5269_test.patch, LUCENE-5269_test.patch
>
>
> One of EdgeNGramTokenizer, ShingleFilter, NGramTokenFilter is buggy, or 
> possibly only the combination of them conspiring together.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to