[ https://issues.apache.org/jira/browse/LUCENE-5269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792429#comment-13792429 ]
Robert Muir commented on LUCENE-5269: ------------------------------------- {quote} This is so crazy! Why did we never hit this combination before? {quote} This combination is especially good at finding the bug, here's why: {code} Tokenizer tokenizer = new EdgeNGramTokenizer(TEST_VERSION_CURRENT, reader, 2, 94); TokenStream stream = new ShingleFilter(tokenizer, 5); stream = new NGramTokenFilter(TEST_VERSION_CURRENT, stream, 55, 83); {code} The edge-ngram has min=2 max=94, its basically brute forcing every token size. then the shingles makes tons of tokens with positionIncrement=0. so it makes it easy for the (previously buggy ngramtokenfilter with wrong length filter) to misclassify tokens with its logic expecting codepoints, emit an initial token with posinc=0: {code} if ((curPos + curGramSize) <= curCodePointCount) { ... posIncAtt.setPositionIncrement(curPosInc); {code} > TestRandomChains failure > ------------------------ > > Key: LUCENE-5269 > URL: https://issues.apache.org/jira/browse/LUCENE-5269 > Project: Lucene - Core > Issue Type: Bug > Reporter: Robert Muir > Fix For: 4.5.1, 4.6, 5.0 > > Attachments: LUCENE-5269.patch, LUCENE-5269.patch, LUCENE-5269.patch, > LUCENE-5269_test.patch, LUCENE-5269_test.patch, LUCENE-5269_test.patch > > > One of EdgeNGramTokenizer, ShingleFilter, NGramTokenFilter is buggy, or > possibly only the combination of them conspiring together. -- This message was sent by Atlassian JIRA (v6.1#6144) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org