[
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15662505#comment-15662505
]
David Smiley commented on LUCENE-7526:
--------------------------------------
RE Benchmarks:
Yeah it seems hard to get consistency. I even upped the iterations counts to
2000 (from 500) and saw more variability than I'd like. Nonetheless your
analysis looks good to me; it's how I would sum up my observations today.
I ran my benchmarks with "-Xms4G -Xmx4G -XX:NewRatio=1" (to reduce the effects
of GC). My machine is a MacBook Pro Retina "Late 2013", 2 GHz i7, 8GB RAM.
This index is on an external spinning disk but there is no disk activity after
warm-up because it's all in the O.S. Cache.
As I looked over things carefully, I made a few more changes; all pretty minor.
A few were in reaction to "ant precommit".
https://github.com/dsmiley/lucene-solr/commits/uh_Tim -- the test & some
javadoc issues. I also did a little bit of refactoring I hope you'll be good
with. In a couple cases I merely moved a method up or down so that the code
flows top to bottom better. Regarding Passage: I'll create a separate issue
expressly for making Passage usable by anyone customizing the highlighter.
It's not as simple as addMatch being public; there are other methods.
_What's there now is in good shape for committing._
A couple ideas occurred to me; _feel free to punt to another issue or never_:
* MultiValueTokenStream isn't needed for MemoryIndexOffsetStrategy, albeit with
a change to loop over content.split(separatorChar). MemoryIndex.addField is
overloaded to take the position increment gap. Then, MultiValueTokenStream
could move to an inner class of TokenStreamOffsetStrategy, and it wouldn't
generally be used (as it's no longer by default). That'd be nice -- keeping
the complexity over there, and it's a bit of a hack too.
* We had made OffsetsEnum & TokenStreamPostingsEnum implement Closeable to
ameliorate the ramifications of the text analysis code throwing an exception,
i.e. due to a bug. The only beneficiary of this now is
TokenStreamOffsetStrategy, which isn't the default anymore. It could be removed
to simplify things. But then again, perhaps it could be useful for those
implementing custom OffsetStrategies. I guess it should stay; there's very
little to this after all.
Proposed CHANGES.txt in "Improvements":
* Enhanced UnifiedHighlighter's passage relevancy for queries with wildcards
and sometimes just terms. Added shouldPreferPassageRelevancyOverSpeed() which
can be overridden to return false to eek out more speed in some cases. (Tim
Rodriguez, David Smiley)
> Improvements to UnifiedHighlighter OffsetStrategies
> ---------------------------------------------------
>
> Key: LUCENE-7526
> URL: https://issues.apache.org/jira/browse/LUCENE-7526
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Reporter: Timothy M. Rodriguez
> Assignee: David Smiley
> Priority: Minor
> Fix For: 6.4
>
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
> ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a
> MemoryIndex for producing Offsets
> ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a
> MemoryIndex. Can only be used if the query distills down to terms and
> automata.
> * TokenStream removal
> ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill
> the memory index and then once consumed a new one was generated by
> uninverting the MemoryIndex back into a TokenStream if there were automata
> (wildcard/mtq queries) involved. Now this is avoided, which should save
> memory and avoid a second pass over the data.
> ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid
> generating a TokenStream if automata are involved.
> ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for
> wildcard/mtq queries. This should improve relevancy by providing unified
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]