Unfortunately, all Lucene's highlighters are "approximate" in this regard: there is no guarantee that the shown snippets, if they were a single little document, would have matched the query.
Even the newest highlighter, PostingsHighlighter, doesn't look at positions, e.g. a PhraseQuery highlight could be "wrong", though "typically" the snippets with all terms from the phrase will scorer higher and be more likely to be picked in practice. Net/net I think a "precise highlighter", would be a nice addition to Lucene, but it is a challenge because you need to turn every leaf query into a positional query, even queries like TermQuery that normally don't touch positions, and then you need to follow the query tree while you highlight so that in your first example a OR (b AND z), having picked a snippet or two for a, you then also go and pick a snippet or two for the b AND z clause, and then present them both together. It's a hard problem but it would make a great addition. Mike McCandless http://blog.mikemccandless.com On Fri, Feb 14, 2014 at 7:05 PM, Steve Davids <sdav...@gmail.com> wrote: > Hello, > > I have recently been given a requirement to improve document highlights > within our system. Unfortunately, the current functionality gives more of a > best-guess on what terms to highlight vs the actual terms to highlight that > actually did perform the match. A couple examples of issues that were found: > > Nested boolean clause with a term that doesn't exist ANDed with a term that > does highlights the ignored term in the query > Text: a b c > Logical Query: a OR (b AND z) > Result: <b>a</b> <b>b</b> c > Expected: <b>a</b> b c > Nested span query doesn't maintain the proper positions and offsets > Text: y z x y z a > Logical Query: ("x y z", a) span near 10 > Result: <b>y</b> <b>z</b> <b>x</b> <b>y</b> <b>z</b> <b>a</b> > Expected: y z <b>x</b> <b>y</b> <b>z</b> <b>a</b> > > I am currently using the Highlighter with a QueryScorer and a > SimpleSpanFragmenter. While looking through the code it looks like the entire > query structure is dropped in the WeightedSpanTermExtractor by just grabbing > any positive TermQuery and flattening them all into a simple Map which is > then passed on to highlight all of those terms. I believe this over > simplification of term extraction is the crux of the issue and needs to be > modified in order to produce more "exact" highlights. > > I was brainstorming with a colleague and thought perhaps we can spin up a > MemoryIndex to index that one document and start performing a depth-first > search of all queries within the overall Lucene query graph. At that point we > can start querying the MemoryIndex for leaf queries and start walking back up > the tree, pruning branches that don't result in a search hit which results in > a map of actual matched query terms. This approach seems pretty painful but > will hopefully produce better matches. I would like to see what the experts > on the mailing list would have to say about this approach or is there a > better way to retrieve the query terms & positions that produced the match? > Or perhaps there is a different Highlighter implementation that should be > used, though our user queries are extremely complex with a lot of nested > queries of various types. > > Thanks, > > -Steve --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org