[ 
https://issues.apache.org/jira/browse/LUCENE-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15709615#comment-15709615
 ] 

David Smiley commented on LUCENE-7578:
--------------------------------------

_disclaimer: I'm merely filing this issue at this time; no time to do it._

Perhaps a separate issue or do here as well if it would be overall less work 
than separate: Instead of PhraseHelper filtering a provided PostingsEnum, I 
think it should produce one OffsetsEnum per top level SpanQuery.  A redesigned 
half rewritten PhraseHelper that uses the SpanCollector API could do this in 
the same amount of code whereas trying to change the current design to do this 
would add a lot of complexity, I think.  The outcome would improve passage 
relevancy for position-sensitive clauses, I think.  It could be further tweaked 
such that _some_ SpanQueries (namely those converted from PhraseQuery) yield 
one virtual position (with earliest startOffset and last endOffset) instead of 
exposing each word position separately.  That would eliminate intra-phrase 
highlight delimiters, and it would probably indirectly improve passage 
relevancy too.  The reported freq() would be the smallest freq of the provided 
terms.  Also, the move to this design would eliminate the position span caching 
going on in PhraseHelper.

> UnifiedHighlighter: Convert PhraseHelper to use SpanCollector API
> -----------------------------------------------------------------
>
>                 Key: LUCENE-7578
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7578
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>
> The PhraseHelper of the UnifiedHighlighter currently collects position-spans 
> per SpanQuery (and it knows which terms are in which SpanQuery), and then it 
> filters PostingsEnum based on that.  It's similar to how the original 
> Highlighter WSTE works.  The main problem with this approach is that it can 
> be inaccurate for some nested span queries -- LUCENE-2287, LUCENE-5455 (has 
> the clearest example), LUCENE-6796.  Non-nested SpanQueries (e.g. that which 
> is converted from a PhraseQuery or MultiPhraseQuery) are _not_ a problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to