[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15473810#comment-15473810
 ] 

David Smiley commented on LUCENE-7438:
--------------------------------------

I think we can avoid some duplication confusion as follows:

For the internal classes that user's don't normally use:

* *MultiTermHighlighting*: transfer most of the changes I did in 
MultiTermHighlighting to the copy in the {{postingshighlight}} package -- 
particularly to anything that already existed there. Then make that public and 
lucene.internal so it can be accessed.  That is very low-impact on the PH. For 
the couple methods added -- {{uninvertAndFilterTerms}} and 
{{makeStringMatchAutomata}} I think we can add these to FieldOffsetStrategy and 
AnalysisOffsetStrategy respectively.  And add comments mentioning it would 
logically go in MTH but since that's in a different highlighter, we don't.
* *TokenStreamFromTermVector*: I think we can replace the one in the 
{{highlighter}} with this one, as the sparseness ratio is configurable in the 
constructor.

For the surface classes users use: Passage, PassageScorer, PassageFormatter, 
DefaultPassageFormatter.  -- I don't think it good to have users use parts of 
another highlighter ({{postingshighlight}}), which is weird for users.  I 
propose copying these with a leading 'U', i.e. {{UPassage}} etc.  That said if 
others think that's a worse trade-off, it's no big deal to me.  Once 
{{o.a.l.s.ph.Passage}}'s constructor is public, it's possible to do that.

RE benchmarks... not sure when we'll have those ready but I would hope by the 
end of this month.  I figure using our benchmark module on wikipedia is a fine 
way to go.  I've used that to benchmark enhancements to the standard 
highlighter before.

Thoughts (esp. from other committers)?  [~rcmuir], I figure you'll have some 
valuable feedback as you did most (all?) of the herculean work on the 
PostingsHighlighter which was an ideal starting point for this UH.   I know 
some folks are on vacation or at another conference right now who I know want 
to provide feedback so I'm in no hurry to commit anything.

> UnifiedHighlighter
> ------------------
>
>                 Key: LUCENE-7438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7438
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 6.2
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to