Unfortunately, all Lucene's highlighters are "approximate" in this
regard: there is no guarantee that the shown snippets, if they were a
single little document, would have matched the query.

Even the newest highlighter, PostingsHighlighter, doesn't look at
positions, e.g. a PhraseQuery highlight could be "wrong", though
"typically" the snippets with all terms from the phrase will scorer
higher and be more likely to be picked in practice.

Net/net I think a "precise highlighter", would be a nice addition to
Lucene, but it is a challenge because you need to turn every leaf
query into a positional query, even queries like TermQuery that
normally don't touch positions, and then you need to follow the query
tree while you highlight so that in your first example a OR (b AND z),
having picked a snippet or two for a, you then also go and pick a
snippet or two for the b AND z clause, and then present them both
together.

It's a hard problem but it would make a great addition.


Mike McCandless

http://blog.mikemccandless.com


On Fri, Feb 14, 2014 at 7:05 PM, Steve Davids <sdav...@gmail.com> wrote:
> Hello,
>
> I have recently been given a requirement to improve document highlights 
> within our system. Unfortunately, the current functionality gives more of a 
> best-guess on what terms to highlight vs the actual terms to highlight that 
> actually did perform the match. A couple examples of issues that were found:
>
> Nested boolean clause with a term that doesn't exist ANDed with a term that 
> does highlights the ignored term in the query
> Text: a b c
> Logical Query: a OR (b AND z)
> Result: <b>a</b> <b>b</b> c
> Expected: <b>a</b> b c
> Nested span query doesn't maintain the proper positions and offsets
> Text: y z x y z a
> Logical Query: ("x y z", a) span near 10
> Result: <b>y</b> <b>z</b> <b>x</b> <b>y</b> <b>z</b> <b>a</b>
> Expected: y z <b>x</b> <b>y</b> <b>z</b> <b>a</b>
>
> I am currently using the Highlighter with a QueryScorer and a 
> SimpleSpanFragmenter. While looking through the code it looks like the entire 
> query structure is dropped in the WeightedSpanTermExtractor by just grabbing 
> any positive TermQuery and flattening them all into a simple Map which is 
> then passed on to highlight all of those terms. I believe this over 
> simplification of term extraction is the crux of the issue and needs to be 
> modified in order to produce more "exact" highlights.
>
> I was brainstorming with a colleague and thought perhaps we can spin up a 
> MemoryIndex to index that one document and start performing a depth-first 
> search of all queries within the overall Lucene query graph. At that point we 
> can start querying the MemoryIndex for leaf queries and start walking back up 
> the tree, pruning branches that don't result in a search hit which results in 
> a map of actual matched query terms. This approach seems pretty painful but 
> will hopefully produce better matches. I would like to see what the experts 
> on the mailing list would have to say about this approach or is there a 
> better way to retrieve the query terms & positions that produced the match? 
> Or perhaps there is a different Highlighter implementation that should be 
> used, though our user queries are extremely complex with a lot of nested 
> queries of various types.
>
> Thanks,
>
> -Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to