[ https://issues.apache.org/jira/browse/LUCENE-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092349#comment-15092349 ]
ASF subversion and git services commented on LUCENE-2229: --------------------------------------------------------- Commit 1724096 from [~jpountz] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1724096 ] LUCENE-2229: Move CHANGES entry to 5.4.1. > SimpleSpanFragmenter fails to start a new fragment > -------------------------------------------------- > > Key: LUCENE-2229 > URL: https://issues.apache.org/jira/browse/LUCENE-2229 > Project: Lucene - Core > Issue Type: Bug > Components: modules/highlighter > Reporter: Elmer Garduno > Assignee: David Smiley > Priority: Minor > Fix For: 5.5, 5.4.1 > > Attachments: LUCENE-2229.patch, LUCENE-2229.patch, LUCENE-2229.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > SimpleSpanFragmenter fails to identify a new fragment when there is more than > one stop word after a span is detected. This problem can be observed when the > Query contains a PhraseQuery. > The problem is that the span extends toward the end of the TokenGroup. This > is because {{waitForProps = positionSpans.get(i).end + 1;}} and {{position += > posIncAtt.getPositionIncrement();}} this generates a value of {{position}} > greater than the value of {{waitForProps}} and {{(waitForPos == position)}} > never matches. > {code:title=SimpleSpanFragmenter.java} > public boolean isNewFragment() { > position += posIncAtt.getPositionIncrement(); > if (waitForPos == position) { > waitForPos = -1; > } else if (waitForPos != -1) { > return false; > } > WeightedSpanTerm wSpanTerm = > queryScorer.getWeightedSpanTerm(termAtt.term()); > if (wSpanTerm != null) { > List<PositionSpan> positionSpans = wSpanTerm.getPositionSpans(); > for (int i = 0; i < positionSpans.size(); i++) { > if (positionSpans.get(i).start == position) { > waitForPos = positionSpans.get(i).end + 1; > break; > } > } > } > ... > {code} > An example is provided in the test case for the following Document and the > query *"all tokens"* followed by the words _of a_. > {panel:title=Document} > "Attribute instances are reused for *all tokens* _of a_ document. Thus, a > TokenStream/-Filter needs to update the appropriate Attribute(s) in > incrementToken(). The consumer, commonly the Lucene indexer, consumes the > data in the Attributes and then calls incrementToken() again until it retuns > false, which indicates that the end of the stream was reached. This means > that in each call of incrementToken() a TokenStream/-Filter can safely > overwrite the data in the Attribute instances." > {panel} > {code:title=HighlighterTest.java} > public void testSimpleSpanFragmenter() throws Exception { > ... > doSearching("\"all tokens\""); > maxNumFragmentsRequired = 2; > > scorer = new QueryScorer(query, FIELD_NAME); > highlighter = new Highlighter(this, scorer); > for (int i = 0; i < hits.totalHits; i++) { > String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME); > TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new > StringReader(text)); > highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer, 20)); > String result = highlighter.getBestFragments(tokenStream, text, > maxNumFragmentsRequired, "..."); > System.out.println("\t" + result); > } > } > {code} > {panel:title=Result} > are reused for <B>all</B> <B>tokens</B> of a document. Thus, a > TokenStream/-Filter needs to update the appropriate Attribute(s) in > incrementToken(). The consumer, commonly the Lucene indexer, consumes the > data in the Attributes and then calls incrementToken() again until it retuns > false, which indicates that the end of the stream was reached. This means > that in each call of incrementToken() a TokenStream/-Filter can safely > overwrite the data in the Attribute instances. > {panel} > {panel:title=Expected Result} > for <B>all</B> <B>tokens</B> of a document > {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org