Hello,

I have recently been given a requirement to improve document highlights within 
our system. Unfortunately, the current functionality gives more of a best-guess 
on what terms to highlight vs the actual terms to highlight that actually did 
perform the match. A couple examples of issues that were found:

Nested boolean clause with a term that doesn’t exist ANDed with a term that 
does highlights the ignored term in the query
Text: a b c
Logical Query: a OR (b AND z)
Result: <b>a</b> <b>b</b> c
Expected: <b>a</b> b c
Nested span query doesn’t maintain the proper positions and offsets
Text: y z x y z a
Logical Query: (“x y z”, a) span near 10
Result: <b>y</b> <b>z</b> <b>x</b> <b>y</b> <b>z</b> <b>a</b>
Expected: y z <b>x</b> <b>y</b> <b>z</b> <b>a</b>

I am currently using the Highlighter with a QueryScorer and a 
SimpleSpanFragmenter. While looking through the code it looks like the entire 
query structure is dropped in the WeightedSpanTermExtractor by just grabbing 
any positive TermQuery and flattening them all into a simple Map which is then 
passed on to highlight all of those terms. I believe this over simplification 
of term extraction is the crux of the issue and needs to be modified in order 
to produce more “exact” highlights.

I was brainstorming with a colleague and thought perhaps we can spin up a 
MemoryIndex to index that one document and start performing a depth-first 
search of all queries within the overall Lucene query graph. At that point we 
can start querying the MemoryIndex for leaf queries and start walking back up 
the tree, pruning branches that don’t result in a search hit which results in a 
map of actual matched query terms. This approach seems pretty painful but will 
hopefully produce better matches. I would like to see what the experts on the 
mailing list would have to say about this approach or is there a better way to 
retrieve the query terms & positions that produced the match? Or perhaps there 
is a different Highlighter implementation that should be used, though our user 
queries are extremely complex with a lot of nested queries of various types.

Thanks,

-Steve

Reply via email to