[jira] [Comment Edited] (LUCENE-7438) UnifiedHighlighter

Ryan Pedela (JIRA) Mon, 26 Sep 2016 13:27:33 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524080#comment-15524080
 ]


Ryan Pedela edited comment on LUCENE-7438 at 9/26/16 8:27 PM:
--------------------------------------------------------------

I am very happy to see this. I use Elasticsearch, and I currently use the 
[experimental highlighter 
plugin|https://github.com/wikimedia/search-highlighter] for three reasons.

1. It uses either term vectors or postings to increase performance.
2. It has fragment and sentence modes.
3. The sentence mode produces significantly better highlights than the postings 
highlighter in my experience.

I would prefer to use an official highlighter and happy to see that the 
UnifiedHighlighter will take care of #1 and #2. Now I would like to talk about 
#3.

I don't know the specifics of the algorithm, but the experimental highlighter 
appears to take proximity and a keyword's document position into account. One 
example from memory, I had a medical research paper about warfarin and the 
highlight returned by the postings highlighter for the search "warfarin" came 
from the references. However the experimental highlighter returned a highlight 
near the beginning of the paper and it was a pretty good summary of the paper.

There is also room for improvement for both the experimental and postings 
highlighters. They both appear to use the same sentence fragmenter which does 
not do a good job with abbreviations and decimal points. Would something like 
Stanford CoreNLP help?

Also I would like a highlighter that tries to get as many keywords as possible 
into the highlight, at least as a config option. That is hard if only returning 
a single sentence or fragment. However I often want three fragments and I would 
like the union of the three fragments to contain all the keywords or as many as 
possible. For example, I am working on a search engine for SEC filings and a 
user searched "BPL hedgings" during a user test. BPL is the stock ticker for 
Buckeye Partners, and stock tickers are pretty unique within the SEC filings. 
The experimental highlighter returned three fragments with "BPL" but no 
"hedgings" (fast vector highlighter produced similar fragments). The user was 
very confused because they didn't see the word "hedgings" in the highlight and 
thought it wasn't found even though it was. To fix this, I retrieve the top 100 
fragments and post-process them to find the best 3 fragments which contain the 
most keywords collectively. The post processing is quite naive since it does 
not understand proximity, stemming, etc. I would prefer if Lucene or ES did it 
because it can be much smarter.


was (Author: rpedela):
I am very happy to see this. I use Elasticsearch, and I currently use the 
[experimental highlighter 
plugin|https://github.com/wikimedia/search-highlighter] for three reasons.

1. It uses either term vectors or postings to increase performance.
2. It has fragment and sentence modes.
3. The sentence mode produces significantly better highlights than the postings 
highlighter in my experience.

I would prefer to use an official highlighter and happy to see that the 
UnifiedHighlighter will take care of #1 and #2. Now I would like to talk about 
#3.

I don't know the specifics of the algorithm, but the experimental highlighter 
appears to take proximity and a keyword's document position into account. One 
example from memory, I had a medical research paper about warfarin and the 
highlight returned by the postings highlighter for the search "warfarin" came 
from the references. However the experimental highlighter returned a highlight 
near the beginning of the paper and it was a pretty good summary of the paper.

There is also room for improvement for both the experimental and postings 
highlighters. They both appear to use the same sentence fragmenter which does 
not do a good job with abbreviations and decimal points. Would something like 
Stanford CoreNLP help?

Also I would like a highlighter that tries to get as many keywords as possible 
into the highlight, at least as a config option. That is hard if only returning 
a single sentence or fragment. However I often want three fragments and I would 
like the union of the three fragments to contain all the keywords or as many as 
possible. For example, I am working on a search engine for SEC filings and a 
user searched "BPL hedgings" during a user test. BPL is the stock ticker for 
Buckeye Partners, and stock tickers are pretty unique within the SEC filings. 
The experimental highlighter returned three fragments with "BPL" but no 
"hedgings" (fast vector highlighter produced similar fragments). The user was 
very confused because they didn't see the word "hedgings" in the highlight and 
thought that keyword wasn't found even though it was. To fix this, I retrieve 
the top 100 fragments and post-process them to find the best 3 fragments which 
contain the most keywords collectively. The post processing is quite naive 
since it does not understand proximity, stemming, etc. I would prefer if Lucene 
or ES did it because it can be much smarter.

> UnifiedHighlighter
> ------------------
>
>                 Key: LUCENE-7438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7438
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 6.2
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>         Attachments: LUCENE_7438_UH_benchmark.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-7438) UnifiedHighlighter

Reply via email to