Re: In-document highlighting DocValues?

Mike Sokolov Thu, 13 Oct 2011 07:24:11 -0700

Is there some reason you don't want to leverage Highlighter to do thiswork? It has all the necessary code for using the analyzed version ofyour query so it will only match tokens that really contribute to thesearch match.

You might also be interested in LUCENE-2878 (which is still underdevelopment on a branch though). It aims to provide first-class accessto payloads and positions during scoring, and this will be very usefulfor complex highlighting tasks.

Another possible solution to the OCR problem could be: generate an XMLfile with a tag for each word encoding its x,y coords, like : <wordx="3" y="10">This</word>; index that file using XmlCharFilter orHTMLStripCharFilter. Then when you search, use the Solr highlighter tohighlight the entire document, and process it using XML tools to findthe locations of the matches.


-Mike

On 10/10/2011 10:19 AM, Jan Høydahl wrote:

Hi,

We index structured documents, with numbered chapters, paragraphs and 
sentences. After doing a (rather complex) search, we may get multiple matches 
in each result doc. We want to highlight those matches in our front-end and 
currently we do a simple string match of the query words against the raw text.

However, this highlights some words that do not satisfy the original query, and 
also does not highlight other words where the match was in a stem, or synonym 
or wildcard. We thus need to improve this, and my plan was to utilize DocValues 
(Payloads). Would the following work?

1. For each term in the field "text", index DocValues with info about chapter#, 
paragraph#, sentence# and word#.
    This can be done in our application code, e.g. "foo|1,2,3,4" for chapter 1, 
paragraph 2, sentence 3 and word 4.

2. Then, for a specific document in the result list, retrieve a list of all matches in 
field "text", and for each match,
    retrieve the associated DocValues.

3. The client application can now use this information to highlight matches, as well as 
"jump to next match" etc,
    and would highlight the correct words only, e.g. it would be able to highlight 
"colour" even if the match was on the
    synonym "color".

Another use case for this technique would be OCR applications where we store 
with each term its x,y offsets for where it occurs in
the original TIFF image scan.

What is in already in place and what code needs to be written? I don't 
currently see how to get a complete list of matches for a particular document.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

Re: In-document highlighting DocValues?

Reply via email to