Re: In-document highlighting DocValues?

Michael Sokolov Sun, 16 Oct 2011 13:07:36 -0700

On 10/14/2011 7:20 PM, Jan Høydahl wrote:

Hi,


The Highlighter is way too slow for this customer's particular use case - which 
is veery large documents. We don't need highlighted snippets for now, but we 
need to accurately decide what words (offsets) in the real HTML display of the 
resulting page to highlight. For this we only need offset info, not the 
snippets/fragments from the stored field.

But I have not looked at the Highlighter code. Perhaps we could fork it into a 
new search component which pulls out only the necessary meta info and payloads 
for us and returns it to client?

Jan I've looked into this, and I believe the slowness of Highlighterdoesn't have to do with constructing the snippets as much as with theanalysis that is required to find the locations of matching terms in thedocument text, so I think your problem is basically the same ashighlighting.

There seem to be basically two approaches right now: one is Highlighter,which is a you point out is a bit slow because it has to basicallyre-analyze the entire document, but this does have the virtue of anexact match to the semantics of the original query.FastVectorHighlighter works by doing some cheap mimicry of the originalquery, extracting terms from the query (and also intersecting with thedocument too, if you have MultiTermQuery), and finding the offsets ofthose terms (which have to be stored in the index). It is smart enoughto respect phrase boundaries, but does not support every kind of Query;however it might be good enough, and is quite a bit faster thanHighlighter (5-10x I think?).

The work in LUCENE-2878 is the only thing I know of that could representan improvement. I did some tests there including storing characteroffsets as payloads and got some additional speedup (maybe another 2x?)beyond FVH. There doesn't seem to be a lot of energy into pushing thatahead right now though, and it requires some fundamental changes to theway that searching is done.


-Mike

Re: In-document highlighting DocValues?

Reply via email to