[ 
https://issues.apache.org/jira/browse/LUCENE-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092350#comment-15092350
 ] 

ASF subversion and git services commented on LUCENE-2229:
---------------------------------------------------------

Commit 1724097 from [~jpountz] in branch 'dev/branches/lucene_solr_5_4'
[ https://svn.apache.org/r1724097 ]

LUCENE-2229: Fix SimpleSpanFragmenter bug with adjacent stop-words

> SimpleSpanFragmenter fails to start a new fragment
> --------------------------------------------------
>
>                 Key: LUCENE-2229
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2229
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>            Reporter: Elmer Garduno
>            Assignee: David Smiley
>            Priority: Minor
>             Fix For: 5.5, 5.4.1
>
>         Attachments: LUCENE-2229.patch, LUCENE-2229.patch, LUCENE-2229.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> SimpleSpanFragmenter fails to identify a new fragment when there is more than 
> one stop word after a span is detected. This problem can be observed when the 
> Query contains a PhraseQuery.
> The problem is that the span extends toward the end of the TokenGroup. This 
> is because {{waitForProps = positionSpans.get(i).end + 1;}} and {{position += 
> posIncAtt.getPositionIncrement();}} this generates a value of {{position}} 
> greater than the value of {{waitForProps}} and {{(waitForPos == position)}} 
> never matches.
> {code:title=SimpleSpanFragmenter.java}
>   public boolean isNewFragment() {
>     position += posIncAtt.getPositionIncrement();
>     if (waitForPos == position) {
>       waitForPos = -1;
>     } else if (waitForPos != -1) {
>       return false;
>     }
>     WeightedSpanTerm wSpanTerm = 
> queryScorer.getWeightedSpanTerm(termAtt.term());
>     if (wSpanTerm != null) {
>       List<PositionSpan> positionSpans = wSpanTerm.getPositionSpans();
>       for (int i = 0; i < positionSpans.size(); i++) {
>         if (positionSpans.get(i).start == position) {
>           waitForPos = positionSpans.get(i).end + 1;
>           break;
>         }
>       }
>     }
>    ...
> {code}
> An example is provided in the test case for the following Document and the 
> query *"all tokens"* followed by the words _of a_.
> {panel:title=Document}
> "Attribute instances are reused for *all tokens* _of a_ document. Thus, a 
> TokenStream/-Filter needs to update the appropriate Attribute(s) in 
> incrementToken(). The consumer, commonly the Lucene indexer, consumes the 
> data in the Attributes and then calls incrementToken() again until it retuns 
> false, which indicates that the end of the stream was reached. This means 
> that in each call of incrementToken() a TokenStream/-Filter can safely 
> overwrite the data in the Attribute instances."
> {panel}
> {code:title=HighlighterTest.java}
>  public void testSimpleSpanFragmenter() throws Exception {
>     ...
>     doSearching("\"all tokens\"");
>     maxNumFragmentsRequired = 2;
>     
>     scorer = new QueryScorer(query, FIELD_NAME);
>     highlighter = new Highlighter(this, scorer);
>     for (int i = 0; i < hits.totalHits; i++) {
>       String text = searcher.doc(hits.scoreDocs[i].doc).get(FIELD_NAME);
>       TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, new 
> StringReader(text));
>       highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer, 20));
>       String result = highlighter.getBestFragments(tokenStream, text,
>           maxNumFragmentsRequired, "...");
>       System.out.println("\t" + result);
>     }
>   }
> {code}
> {panel:title=Result}
> are reused for <B>all</B> <B>tokens</B> of a document. Thus, a 
> TokenStream/-Filter needs to update the appropriate Attribute(s) in 
> incrementToken(). The consumer, commonly the Lucene indexer, consumes the 
> data in the Attributes and then calls incrementToken() again until it retuns 
> false, which indicates that the end of the stream was reached. This means 
> that in each call of incrementToken() a TokenStream/-Filter can safely 
> overwrite the data in the Attribute instances.
> {panel}
> {panel:title=Expected Result}
> for <B>all</B> <B>tokens</B> of a document
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to