[
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15660328#comment-15660328
]
Timothy M. Rodriguez commented on LUCENE-7526:
----------------------------------------------
I've merged with the changes from LUCENE-7544 and also ran some benchmarks.
(Thanks [~dsmiley] for the fix on LUCENE-7546!)
Original:
||Impl||Terms||Phrases||Wildcards||
|(search)|1.14|1.43|2.44|
|SH_A|7.36|7.49|16.37|
|UH_A|5.32|4.55|9.24|
|SH_V|4.12|4.42|8.47|
|FVH_V|3.46|2.98|7.13|
|UH_V|3.7|3.45|6.61|
|PH_P|3.76|3.45|9.6|
|UH_P|3.34|2.91|9.33|
|UH_PV|3.26|2.8|6.72|
With improvements from LUCENE-7526:
||Impl||Terms||Phrases||Wildcards||
|(search)|1.18|1.38|2.52|
|SH_A|7.98|7.53|16.62|
|UH_A|5.46|4.6|9.43|
|SH_V|4.13|4.42|8.26|
|FVH_V|3.45|3.05|6.93|
|UH_V|3.79|3.43|6.62|
|PH_P|3.82|3.47|9.4|
|UH_P|3.33|3.03|9.46|
|UH_PV|3.24|2.81|6.92|
If you disable the new option to prefer passage relevancy over speed you'll get
the following for analysis:
||Impl||Terms||Phrases||Wildcards||
|(search)|1.1|1.43|2.44|
|UH_A|5.31|4.66|9.14|
I wasn't able to get very consistent times with the benchmarks, but it looks
like the changes keep close performance while simplifying the code and
improving relevancy in the Analysis case (unless
preferPassageRelevancyOverSpeed is disabled). If that option is disabled the
timings line up pretty closely with the originals, providing a minor speed
boost. There should also be a memory savings by avoiding re-creation of
TokenStreams, but that was difficult to measure, but could prove beneficial if
there is memory pressure.
I performed these benchmark on a machine with the following configuration:
Processor: AMD Phenom II X4 960T 3.0GHz
Memory: 24GB DDR3
Disk: Crucial CT256MX SSD
OS: Windows 10
Java: Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
All versions of the benchmarks incorporated above included the changes from
LUCENE-7544.
[~dsmiley] It looks like my older processor took significantly longer to
highlight across the board than in your initial run for LUCENE-7438. I'd be
curious how this set of changes performs on your machine now.
> Improvements to UnifiedHighlighter OffsetStrategies
> ---------------------------------------------------
>
> Key: LUCENE-7526
> URL: https://issues.apache.org/jira/browse/LUCENE-7526
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Reporter: Timothy M. Rodriguez
> Assignee: David Smiley
> Priority: Minor
> Fix For: 6.4
>
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
> ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a
> MemoryIndex for producing Offsets
> ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a
> MemoryIndex. Can only be used if the query distills down to terms and
> automata.
> * TokenStream removal
> ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill
> the memory index and then once consumed a new one was generated by
> uninverting the MemoryIndex back into a TokenStream if there were automata
> (wildcard/mtq queries) involved. Now this is avoided, which should save
> memory and avoid a second pass over the data.
> ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid
> generating a TokenStream if automata are involved.
> ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for
> wildcard/mtq queries. This should improve relevancy by providing unified
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]