[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

ASF GitHub Bot (JIRA) Wed, 02 Nov 2016 05:49:46 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15628861#comment-15628861
 ]


ASF GitHub Bot commented on LUCENE-7526:
----------------------------------------

Github user dsmiley commented on the issue:

    https://github.com/apache/lucene-solr/pull/105
  
    _(I wrote this as you made your last comments, but rather than delete it 
I'll comment any way)_
    
    The documentation for `PostingsEnum.nextPosition()` states that calling it 
more than `freq()` times is undefined.  Thus it's quite valid to throw an 
IllegalStateException.
    
    > Btw, we've seen other needs for something like a CompositePostingsEnum 
that abstracts over a set of terms, but since this is still internal, dropping 
the house-keeping will also make this code do less. 
    
    I don't think I quite get what you're saying.  By "other needs" do you mean 
Bloomberg internally?  If so, how would that relate this this one inside the 
UH?  Are you advocating a general purpose Multi-PosrtingsEnum?  On the 
latter... a highlighter wouldn't be where to introduce such a thing.  There is 
a `org.apache.lucene.index.MultiPostingsEnum` which I looked at while at the 
Lucene hackday code sprint as it got my curiosity.  Unfortunately, it doesn't 
seem quite general purpose enough for us to use -- it demands a MultiTermsEnum 
parent.  Perhaps that could be improved to be useful without demanding a 
MultiTermsEnum parent... but that seems like a separate issue.


> Improvements to UnifiedHighlighter OffsetStrategies
> ---------------------------------------------------
>
>                 Key: LUCENE-7526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7526
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>            Priority: Minor
>             Fix For: 6.4
>
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies 
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
>   ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
> MemoryIndex for producing Offsets
>   ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
> MemoryIndex.  Can only be used if the query distills down to terms and 
> automata.
> * TokenStream removal 
>   ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
> the memory index and then once consumed a new one was generated by 
> uninverting the MemoryIndex back into a TokenStream if there were automata 
> (wildcard/mtq queries) involved.  Now this is avoided, which should save 
> memory and avoid a second pass over the data.
>   ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
> generating a TokenStream if automata are involved.
>   ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for 
> wildcard/mtq queries.  This should improve relevancy by providing unified 
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated 
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

Reply via email to