[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

David Smiley (JIRA) Sun, 13 Nov 2016 17:48:16 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15662505#comment-15662505
 ]


David Smiley commented on LUCENE-7526:
--------------------------------------

RE Benchmarks:

Yeah it seems hard to get consistency. I even upped the iterations counts to 
2000 (from 500) and saw more variability than I'd like.  Nonetheless your 
analysis looks good to me; it's how I would sum up my observations today.

I ran my benchmarks with "-Xms4G -Xmx4G -XX:NewRatio=1" (to reduce the effects 
of GC).  My machine is a MacBook Pro Retina "Late 2013", 2 GHz i7, 8GB RAM.  
This index is on an external spinning disk but there is no disk activity after 
warm-up because it's all in the O.S. Cache.

As I looked over things carefully, I made a few more changes; all pretty minor. 
 A few were in reaction to "ant precommit". 
https://github.com/dsmiley/lucene-solr/commits/uh_Tim -- the test & some 
javadoc issues.  I also did a little bit of refactoring I hope you'll be good 
with.  In a couple cases I merely moved a method up or down so that the code 
flows top to bottom better.  Regarding Passage: I'll create a separate issue 
expressly for making Passage usable by anyone customizing the highlighter.  
It's not as simple as addMatch being public; there are other methods.  

_What's there now is in good shape for committing._

A couple ideas occurred to me; _feel free to punt to another issue or never_:
* MultiValueTokenStream isn't needed for MemoryIndexOffsetStrategy, albeit with 
a change to loop over content.split(separatorChar).  MemoryIndex.addField is 
overloaded to take the position increment gap.  Then, MultiValueTokenStream 
could move to an inner class of TokenStreamOffsetStrategy, and it wouldn't 
generally be used (as it's no longer by default).  That'd be nice -- keeping 
the complexity over there, and it's a bit of a hack too.
* We had made OffsetsEnum & TokenStreamPostingsEnum implement Closeable to 
ameliorate the ramifications of the text analysis code throwing an exception, 
i.e. due to a bug.  The only beneficiary of this now is 
TokenStreamOffsetStrategy, which isn't the default anymore. It could be removed 
to simplify things.  But then again, perhaps it could be useful for those 
implementing custom OffsetStrategies.  I guess it should stay; there's very 
little to this after all.

Proposed CHANGES.txt in "Improvements":
* Enhanced UnifiedHighlighter's passage relevancy for queries with wildcards 
and sometimes just terms. Added shouldPreferPassageRelevancyOverSpeed() which 
can be overridden to return false to eek out more speed in some cases. (Tim 
Rodriguez, David Smiley)

> Improvements to UnifiedHighlighter OffsetStrategies
> ---------------------------------------------------
>
>                 Key: LUCENE-7526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7526
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>            Priority: Minor
>             Fix For: 6.4
>
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies 
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
>   ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
> MemoryIndex for producing Offsets
>   ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
> MemoryIndex.  Can only be used if the query distills down to terms and 
> automata.
> * TokenStream removal 
>   ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
> the memory index and then once consumed a new one was generated by 
> uninverting the MemoryIndex back into a TokenStream if there were automata 
> (wildcard/mtq queries) involved.  Now this is avoided, which should save 
> memory and avoid a second pass over the data.
>   ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
> generating a TokenStream if automata are involved.
>   ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for 
> wildcard/mtq queries.  This should improve relevancy by providing unified 
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated 
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

Reply via email to