[
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15473810#comment-15473810
]
David Smiley commented on LUCENE-7438:
--------------------------------------
I think we can avoid some duplication confusion as follows:
For the internal classes that user's don't normally use:
* *MultiTermHighlighting*: transfer most of the changes I did in
MultiTermHighlighting to the copy in the {{postingshighlight}} package --
particularly to anything that already existed there. Then make that public and
lucene.internal so it can be accessed. That is very low-impact on the PH. For
the couple methods added -- {{uninvertAndFilterTerms}} and
{{makeStringMatchAutomata}} I think we can add these to FieldOffsetStrategy and
AnalysisOffsetStrategy respectively. And add comments mentioning it would
logically go in MTH but since that's in a different highlighter, we don't.
* *TokenStreamFromTermVector*: I think we can replace the one in the
{{highlighter}} with this one, as the sparseness ratio is configurable in the
constructor.
For the surface classes users use: Passage, PassageScorer, PassageFormatter,
DefaultPassageFormatter. -- I don't think it good to have users use parts of
another highlighter ({{postingshighlight}}), which is weird for users. I
propose copying these with a leading 'U', i.e. {{UPassage}} etc. That said if
others think that's a worse trade-off, it's no big deal to me. Once
{{o.a.l.s.ph.Passage}}'s constructor is public, it's possible to do that.
RE benchmarks... not sure when we'll have those ready but I would hope by the
end of this month. I figure using our benchmark module on wikipedia is a fine
way to go. I've used that to benchmark enhancements to the standard
highlighter before.
Thoughts (esp. from other committers)? [~rcmuir], I figure you'll have some
valuable feedback as you did most (all?) of the herculean work on the
PostingsHighlighter which was an ideal starting point for this UH. I know
some folks are on vacation or at another conference right now who I know want
to provide feedback so I'm in no hurry to commit anything.
> UnifiedHighlighter
> ------------------
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Affects Versions: 6.2
> Reporter: Timothy M. Rodriguez
> Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is
> able to highlight using offsets in either postings, term vectors, or from
> analysis (a TokenStream). Lucene’s existing highlighters are mostly
> demarcated along offset source lines, whereas here it is unified -- hence
> this proposed name. In this highlighter, the offset source strategy is
> separated from the core highlighting functionalty. The UnifiedHighlighter
> further improves on the PostingsHighlighter’s design by supporting accurate
> phrase highlighting using an approach similar to the standard highlighter’s
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset
> source strategythat utilizes postings and “light” term vectors (i.e. just the
> terms) for highlighting multi-term queries (wildcards) without resorting to
> analysis. Phrase highlighting and wildcard highlighting can both be disabled
> if you’d rather highlight a little faster albeit not as accurately reflecting
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the
> other highlighters and the results were exciting! It’s tempting to share
> those results but it’s definitely due for another benchmark, so we’ll work on
> that. Performance was the main motivator for creating the UnifiedHighlighter,
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy
> requirements) wasn’t fast enough, even with term vectors along with several
> improvements we contributed back, and even after we forked it to highlight in
> multiple threads.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]