RE: Only highlight terms that caused a search hit/match

Rose, Stuart J Sun, 16 Feb 2014 10:36:24 -0800

Hi Steve, 

We leveraged the SpanQuery and Highlighting APIs in 3.5 a couple of years ago 
to do this. In order to get accurate doc hits for the types of phrases that we 
needed to support search on, we defined a phrase query syntax and then 
implemented a span query parser to create a nested structure of span operations 
that embody the query.

The test output below gives the span structure that we generate and then the 
resulting highlights for each query. 

        spanOr([text:a, spanNear([text:b, text:z], 987654321, false)])
        <B>a</B> b c

        spanNear([spanNear([text:x, text:y, text:z], 0, true), text:a], 10, 
false)
        y z <B>x</B> <B>y</B> <B>z</B> <B>a</B>

I'll check to see if we can make it available as a starting point for what Mike 
is suggesting.

In the meantime, I recommend verifying that each span query is created as 
intended, keeping in mind that doc hits may be 'valid', but might have matched 
for the wrong reason and therefore have mismatched highlighting. 

Stuart

-----Original Message-----
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Saturday, February 15, 2014 2:54 AM
To: Lucene Users
Cc: sdav...@gmail.com
Subject: Re: Only highlight terms that caused a search hit/match

Unfortunately, all Lucene's highlighters are "approximate" in this
regard: there is no guarantee that the shown snippets, if they were a single 
little document, would have matched the query.

Even the newest highlighter, PostingsHighlighter, doesn't look at positions, 
e.g. a PhraseQuery highlight could be "wrong", though "typically" the snippets 
with all terms from the phrase will scorer higher and be more likely to be 
picked in practice.

Net/net I think a "precise highlighter", would be a nice addition to Lucene, 
but it is a challenge because you need to turn every leaf query into a 
positional query, even queries like TermQuery that normally don't touch 
positions, and then you need to follow the query tree while you highlight so 
that in your first example a OR (b AND z), having picked a snippet or two for 
a, you then also go and pick a snippet or two for the b AND z clause, and then 
present them both together.

It's a hard problem but it would make a great addition.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Feb 14, 2014 at 7:05 PM, Steve Davids <sdav...@gmail.com> wrote:
> Hello,
>
> I have recently been given a requirement to improve document highlights 
> within our system. Unfortunately, the current functionality gives more of a 
> best-guess on what terms to highlight vs the actual terms to highlight that 
> actually did perform the match. A couple examples of issues that were found:
>
> Nested boolean clause with a term that doesn't exist ANDed with a term 
> that does highlights the ignored term in the query
> Text: a b c
> Logical Query: a OR (b AND z)
> Result: <b>a</b> <b>b</b> c
> Expected: <b>a</b> b c
> Nested span query doesn't maintain the proper positions and offsets
> Text: y z x y z a
> Logical Query: ("x y z", a) span near 10
> Result: <b>y</b> <b>z</b> <b>x</b> <b>y</b> <b>z</b> <b>a</b>
> Expected: y z <b>x</b> <b>y</b> <b>z</b> <b>a</b>
>
> I am currently using the Highlighter with a QueryScorer and a 
> SimpleSpanFragmenter. While looking through the code it looks like the entire 
> query structure is dropped in the WeightedSpanTermExtractor by just grabbing 
> any positive TermQuery and flattening them all into a simple Map which is 
> then passed on to highlight all of those terms. I believe this over 
> simplification of term extraction is the crux of the issue and needs to be 
> modified in order to produce more "exact" highlights.
>
> I was brainstorming with a colleague and thought perhaps we can spin up a 
> MemoryIndex to index that one document and start performing a depth-first 
> search of all queries within the overall Lucene query graph. At that point we 
> can start querying the MemoryIndex for leaf queries and start walking back up 
> the tree, pruning branches that don't result in a search hit which results in 
> a map of actual matched query terms. This approach seems pretty painful but 
> will hopefully produce better matches. I would like to see what the experts 
> on the mailing list would have to say about this approach or is there a 
> better way to retrieve the query terms & positions that produced the match? 
> Or perhaps there is a different Highlighter implementation that should be 
> used, though our user queries are extremely complex with a lot of nested 
> queries of various types.
>
> Thanks,
>
> -Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Only highlight terms that caused a search hit/match

Reply via email to