[
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470848#comment-15470848
]
David Smiley edited comment on LUCENE-7438 at 9/8/16 3:19 AM:
--------------------------------------------------------------
_(dsmiley: edited formatting only)_
Some additional information:
h2. Missing features & possible future improvements:
Despite the offset source flexibility and accuracy options of this highlighter,
it continues to be the case that some highlighters have unique features. The
following features are in the standard Highlighter (and possibly
FastVectorHighlighter) but are not in the UnifiedHighlighter (and thus not
PostingsHighlighter either since UH is derived from PH):
* Being able to disable “requireFieldMatch” to thus highlight a query
insensitive to whatever fields are mentioned in the query.
* Using boosts in the query to weight passages.
* Regex pased passage delineation. Though I’m unsure if anyone cares given the
existing BreakIterator options available.
Aside from addressing the feature gaps listed above, there are a couple known
things that would be nice to add:
* The phrase highlighting (implemented by PhraseHelper) could be made more
accurate, and probably faster too, by using techniques in [~romseygeek]'s
[Luwak
system|https://github.com/flaxsearch/luwak/blob/master/luwak/src/main/java/uk/co/flax/luwak/matchers/HighlightingMatcher.java]
that uses the Lucene SpanCollector API introduced in Lucene 5.3. It wasn’t
done this way to begin with because this highlighter was developed originally
for Lucene 4.10.
* Wildcard queries usually use TokenStreamFromTermVector, which uninverts the
terms out of a Terms index. Instead, we now think it would be better to create
a bunch of PostingsEnum for each matching term. This would bring about some
simplifications and efficiencies, and can lead to better passage relevancy. A
bonus would be aggregating terms matching the same automata into a merged
PostingsEnum that has a freq() based on the sum of the underlying matching
terms.
h2. Changes from the PostingsHighlighter
* The UH is more stateful
** Holds the IndexSearcher instead of asking most methods to pass it through.
** Options now have simple setters, and the per-field getters return these.
This means the common case of a setting being non-specific to a field doesn’t
require subclassing.
* Multi-valued field handling is improved to ensure that a passage will never
span across values, plus it honors the positionIncrementGap for an analyzed
offset source. See MultiValueTokenStream and SplittingBreakIterator.
* The PH caches all content to be highlighted for all docs and then highlights
it all. The UH has a limit on this which led to a batching approach. But if
all fields use an Analyzer or if more than one use term vectors, then instead
highlighting happens one doc at a time since the up-front content caching is
not helpful.
* No longer tries to re-use PostingsEnums (or TermsEnum or LeafReader) from one
doc to the next. This really simplified some code; it didn’t seem worth it.
* MultiTermHighlighting’s fake PostingsEnum was made Closeable and we close it
to guard against ramifications of exceptions being thrown during highlighting
(e.g. a BreakIterator bug or TokenStream bug). Nasty to debug!
* (from standard Highlighter) TokenStreamFromTermVector: optimizations to
uninvert filtered (thus sparse) Terms.
h2. Non-Core Dependencies
* MemoryIndex: For Analyzer based highlighting when phrases need to be
highlighted accurately.
* Standard Highlighter things:
** TokenStreamFromTermVector: For most multi-term queries. The UH actually has
its own derived copy that has been optimized to handle filtered (thus sparse)
Terms. With further work, we could switch to a different approach and remove it
(as indicated earlier). For as long as it stays, it’s also possible to replace
the existing one with this if we want to do that.
** WeightedSpanTermExtractor: For highlighting phrases accurately to re-use
it’s SpanQuery conversion and rewrite detecting abilities. Perhaps these parts
of WSTE could move to general SpanQuery utilities.
** TermVectorLeafReader: When highlighting offsets from term vectors.
* PostingHighlighter things:
** Technically, Nothing however it has multiple copies of some things that have
not been modified: Passage, PassageScorer, PassageFormatter,
DefaultPassageFormatter.
** Note: Utility BreakIterators are of use to the PH, UH, and even the FVH:
WholeBreakIterator, CustomSeparatorBreakIterator. Maybe they should move to a
utils package that isn’t in any of these highlighters?
was (Author: timothy055):
Some additional information:
h2. Missing features & possible future improvements:
Despite the offset source flexibility and accuracy options of this highlighter,
it continues to be the case that some highlighters have unique features. The
following features are in the standard Highlighter (and possibly
FastVectorHighlighter) but are not in the UnifiedHighlighter (and thus not
PostingsHighlighter either since UH is derived from PH):
* Being able to disable “requireFieldMatch” to thus highlight a query
insensitive to whatever fields are mentioned in the query.
* Using boosts in the query to weight passages.
* Regex pased passage delineation. Though I’m unsure if anyone cares given the
existing BreakIterator options available.
Aside from addressing the feature gaps listed above, there are a couple known
things that would be nice to add:
* The phrase highlighting (implemented by PhraseHelper) could be made more
accurate, and probably faster too, by using techniques in Alan’s Luwak system
that uses the Lucene SpanCollector API introduced in Lucene 5.3. It wasn’t done
this way to begin with because this highlighter was developed originally for
Lucene 4.10.
* Wildcard queries usually use TokenStreamFromTermVector, which uninverts the
terms out of a Terms index. Instead, we now think it would be better to create
a bunch of PostingsEnum for each matching term. This would bring about some
simplifications and efficiencies, and can lead to better passage relevancy. A
bonus would be aggregating terms matching the same automata into a merged
PostingsEnum that has a freq() based on the sum of the underlying matching
terms.
h2. Changes from the PostingsHighlighter
* The UH is more stateful
** Holds the IndexSearcher instead of asking most methods to pass it through.
** Options now have simple setters, and the per-field getters return these.
This means the common case of a setting being non-specific to a field doesn’t
require subclassing.
* Multi-valued field handling is improved to ensure that a passage will never
span across values, plus it honors the positionIncrementGap for an analyzed
offset source. See MultiValueTokenStream and SplittingBreakIterator.
* The PH caches all content to be highlighted for all docs and then highlights
it all. The UH has a limit on this which led to a batching approach. But if
all fields use an Analyzer or if more than one use term vectors, then instead
highlighting happens one doc at a time since the up-front content caching is
not helpful.
* No longer tries to re-use PostingsEnums (or TermsEnum or LeafReader) from one
doc to the next. This really simplified some code; it didn’t seem worth it.
* MultiTermHighlighting’s fake PostingsEnum was made Closeable and we close it
to guard against ramifications of exceptions being thrown during highlighting
(e.g. a BreakIterator bug or TokenStream bug). Nasty to debug!
* (from standard Highlighter) TokenStreamFromTermVector: optimizations to
uninvert filtered (thus sparse) Terms.
h2. Non-Core Dependencies
* MemoryIndex: For Analyzer based highlighting when phrases need to be
highlighted accurately.
* Standard Highlighter things:
** TokenStreamFromTermVector: For most multi-term queries. The UH actually has
its own derived copy that has been optimized to handle filtered (thus sparse)
Terms. With further work, we could switch to a different approach and remove it
(as indicated earlier). For as long as it stays, it’s also possible to replace
the existing one with this if we want to do that.
** WeightedSpanTermExtractor: For highlighting phrases accurately to re-use
it’s SpanQuery conversion and rewrite detecting abilities. Perhaps these parts
of WSTE could move to general SpanQuery utilities.
** TermVectorLeafReader: When highlighting offsets from term vectors.
* PostingHighlighter things:
** Technically, Nothing however it has multiple copies of some things that have
not been modified: Passage, PassageScorer, PassageFormatter,
DefaultPassageFormatter.
** Note: Utility BreakIterators are of use to the PH, UH, and even the FVH:
WholeBreakIterator, CustomSeparatorBreakIterator. Maybe they should move to a
utils package that isn’t in any of these highlighters?
> UnifiedHighlighter
> ------------------
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Affects Versions: 6.2
> Reporter: Timothy M. Rodriguez
> Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is
> able to highlight using offsets in either postings, term vectors, or from
> analysis (a TokenStream). Lucene’s existing highlighters are mostly
> demarcated along offset source lines, whereas here it is unified -- hence
> this proposed name. In this highlighter, the offset source strategy is
> separated from the core highlighting functionalty. The UnifiedHighlighter
> further improves on the PostingsHighlighter’s design by supporting accurate
> phrase highlighting using an approach similar to the standard highlighter’s
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset
> source strategythat utilizes postings and “light” term vectors (i.e. just the
> terms) for highlighting multi-term queries (wildcards) without resorting to
> analysis. Phrase highlighting and wildcard highlighting can both be disabled
> if you’d rather highlight a little faster albeit not as accurately reflecting
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the
> other highlighters and the results were exciting! It’s tempting to share
> those results but it’s definitely due for another benchmark, so we’ll work on
> that. Performance was the main motivator for creating the UnifiedHighlighter,
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy
> requirements) wasn’t fast enough, even with term vectors along with several
> improvements we contributed back, and even after we forked it to highlight in
> multiple threads.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]