[jira] [Comment Edited] (LUCENE-7438) UnifiedHighlighter

David Smiley (JIRA) Wed, 07 Sep 2016 20:21:00 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470848#comment-15470848
 ]


David Smiley edited comment on LUCENE-7438 at 9/8/16 3:19 AM:
--------------------------------------------------------------

_(dsmiley: edited formatting only)_

Some additional information:

h2. Missing features & possible future improvements:
Despite the offset source flexibility and accuracy options of this highlighter, 
it continues to be the case that some highlighters have unique features.  The 
following features are in the standard Highlighter (and possibly 
FastVectorHighlighter) but are not in the UnifiedHighlighter (and thus not 
PostingsHighlighter either since UH is derived from PH):
* Being able to disable “requireFieldMatch” to thus highlight a query 
insensitive to whatever fields are mentioned in the query.
* Using boosts in the query to weight passages.
* Regex pased passage delineation. Though I’m unsure if anyone cares given the 
existing BreakIterator options available.

Aside from addressing the feature gaps listed above, there are a couple known 
things that would be nice to add:
* The phrase highlighting (implemented by PhraseHelper) could be made more 
accurate, and probably faster too, by using techniques in [~romseygeek]'s 
[Luwak 
system|https://github.com/flaxsearch/luwak/blob/master/luwak/src/main/java/uk/co/flax/luwak/matchers/HighlightingMatcher.java]
 that uses the Lucene SpanCollector API introduced in Lucene 5.3. It wasn’t 
done this way to begin with because this highlighter was developed originally 
for Lucene 4.10.
* Wildcard queries usually use TokenStreamFromTermVector, which uninverts the 
terms out of a Terms index.  Instead, we now think it would be better to create 
a bunch of PostingsEnum for each matching term. This would bring about some 
simplifications and efficiencies, and can lead to better passage relevancy. A 
bonus would be aggregating terms matching the same automata into a merged 
PostingsEnum that has a freq() based on the sum of the underlying matching 
terms.

h2. Changes from the PostingsHighlighter 
* The UH is more stateful
** Holds the IndexSearcher instead of asking most methods to pass it through.
** Options now have simple setters, and the per-field getters return these. 
This means the common case of a setting being non-specific to a field doesn’t 
require subclassing.
* Multi-valued field handling is improved to ensure that a passage will never 
span across values, plus it honors the positionIncrementGap for an analyzed 
offset source. See MultiValueTokenStream and SplittingBreakIterator.
* The PH caches all content to be highlighted for all docs and then highlights 
it all.  The UH has a limit on this which led to a batching approach.  But if 
all fields use an Analyzer or if more than one use term vectors, then instead 
highlighting happens one doc at a time since the up-front content caching is 
not helpful.
* No longer tries to re-use PostingsEnums (or TermsEnum or LeafReader) from one 
doc to the next. This really simplified some code; it didn’t seem worth it.
* MultiTermHighlighting’s fake PostingsEnum was made Closeable and we close it 
to guard against ramifications of exceptions being thrown during highlighting 
(e.g. a BreakIterator bug or TokenStream bug). Nasty to debug!
* (from standard Highlighter) TokenStreamFromTermVector: optimizations to 
uninvert filtered (thus sparse) Terms.

h2. Non-Core Dependencies
* MemoryIndex: For Analyzer based highlighting when phrases need to be 
highlighted accurately.
* Standard Highlighter things:
** TokenStreamFromTermVector: For most multi-term queries. The UH actually has 
its own derived copy that has been optimized to handle filtered (thus sparse) 
Terms. With further work, we could switch to a different approach and remove it 
(as indicated earlier).  For as long as it stays, it’s also possible to replace 
the existing one with this if we want to do that.
** WeightedSpanTermExtractor: For highlighting phrases accurately to re-use 
it’s SpanQuery conversion and rewrite detecting abilities.  Perhaps these parts 
of WSTE could move to general SpanQuery utilities.
** TermVectorLeafReader: When highlighting offsets from term vectors.
* PostingHighlighter things:
** Technically, Nothing however it has multiple copies of some things that have 
not been modified: Passage, PassageScorer, PassageFormatter, 
DefaultPassageFormatter.
** Note: Utility BreakIterators are of use to the PH, UH, and even the FVH: 
WholeBreakIterator, CustomSeparatorBreakIterator.  Maybe they should move to a 
utils package that isn’t in any of these highlighters?


was (Author: timothy055):
Some additional information:

h2. Missing features & possible future improvements:
Despite the offset source flexibility and accuracy options of this highlighter, 
it continues to be the case that some highlighters have unique features.  The 
following features are in the standard Highlighter (and possibly 
FastVectorHighlighter) but are not in the UnifiedHighlighter (and thus not 
PostingsHighlighter either since UH is derived from PH):
* Being able to disable “requireFieldMatch” to thus highlight a query 
insensitive to whatever fields are mentioned in the query.
* Using boosts in the query to weight passages.
* Regex pased passage delineation. Though I’m unsure if anyone cares given the 
existing BreakIterator options available.
Aside from addressing the feature gaps listed above, there are a couple known 
things that would be nice to add:
* The phrase highlighting (implemented by PhraseHelper) could be made more 
accurate, and probably faster too, by using techniques in Alan’s Luwak system 
that uses the Lucene SpanCollector API introduced in Lucene 5.3. It wasn’t done 
this way to begin with because this highlighter was developed originally for 
Lucene 4.10.
* Wildcard queries usually use TokenStreamFromTermVector, which uninverts the 
terms out of a Terms index.  Instead, we now think it would be better to create 
a bunch of PostingsEnum for each matching term. This would bring about some 
simplifications and efficiencies, and can lead to better passage relevancy. A 
bonus would be aggregating terms matching the same automata into a merged 
PostingsEnum that has a freq() based on the sum of the underlying matching 
terms.

h2. Changes from the PostingsHighlighter 
* The UH is more stateful
** Holds the IndexSearcher instead of asking most methods to pass it through.
** Options now have simple setters, and the per-field getters return these. 
This means the common case of a setting being non-specific to a field doesn’t 
require subclassing.
* Multi-valued field handling is improved to ensure that a passage will never 
span across values, plus it honors the positionIncrementGap for an analyzed 
offset source. See MultiValueTokenStream and SplittingBreakIterator.
* The PH caches all content to be highlighted for all docs and then highlights 
it all.  The UH has a limit on this which led to a batching approach.  But if 
all fields use an Analyzer or if more than one use term vectors, then instead 
highlighting happens one doc at a time since the up-front content caching is 
not helpful.
* No longer tries to re-use PostingsEnums (or TermsEnum or LeafReader) from one 
doc to the next. This really simplified some code; it didn’t seem worth it.
* MultiTermHighlighting’s fake PostingsEnum was made Closeable and we close it 
to guard against ramifications of exceptions being thrown during highlighting 
(e.g. a BreakIterator bug or TokenStream bug). Nasty to debug!
* (from standard Highlighter) TokenStreamFromTermVector: optimizations to 
uninvert filtered (thus sparse) Terms.

h2. Non-Core Dependencies
* MemoryIndex: For Analyzer based highlighting when phrases need to be 
highlighted accurately.
* Standard Highlighter things:
** TokenStreamFromTermVector: For most multi-term queries. The UH actually has 
its own derived copy that has been optimized to handle filtered (thus sparse) 
Terms. With further work, we could switch to a different approach and remove it 
(as indicated earlier).  For as long as it stays, it’s also possible to replace 
the existing one with this if we want to do that.
** WeightedSpanTermExtractor: For highlighting phrases accurately to re-use 
it’s SpanQuery conversion and rewrite detecting abilities.  Perhaps these parts 
of WSTE could move to general SpanQuery utilities.
** TermVectorLeafReader: When highlighting offsets from term vectors.
* PostingHighlighter things:
** Technically, Nothing however it has multiple copies of some things that have 
not been modified: Passage, PassageScorer, PassageFormatter, 
DefaultPassageFormatter.
** Note: Utility BreakIterators are of use to the PH, UH, and even the FVH: 
WholeBreakIterator, CustomSeparatorBreakIterator.  Maybe they should move to a 
utils package that isn’t in any of these highlighters?


> UnifiedHighlighter
> ------------------
>
>                 Key: LUCENE-7438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7438
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 6.2
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-7438) UnifiedHighlighter

Reply via email to