On 10/14/2011 7:20 PM, Jan Høydahl wrote:
Hi,

The Highlighter is way too slow for this customer's particular use case - which 
is veery large documents. We don't need highlighted snippets for now, but we 
need to accurately decide what words (offsets) in the real HTML display of the 
resulting page to highlight. For this we only need offset info, not the 
snippets/fragments from the stored field.

But I have not looked at the Highlighter code. Perhaps we could fork it into a 
new search component which pulls out only the necessary meta info and payloads 
for us and returns it to client?

Jan I've looked into this, and I believe the slowness of Highlighter doesn't have to do with constructing the snippets as much as with the analysis that is required to find the locations of matching terms in the document text, so I think your problem is basically the same as highlighting.

There seem to be basically two approaches right now: one is Highlighter, which is a you point out is a bit slow because it has to basically re-analyze the entire document, but this does have the virtue of an exact match to the semantics of the original query. FastVectorHighlighter works by doing some cheap mimicry of the original query, extracting terms from the query (and also intersecting with the document too, if you have MultiTermQuery), and finding the offsets of those terms (which have to be stored in the index). It is smart enough to respect phrase boundaries, but does not support every kind of Query; however it might be good enough, and is quite a bit faster than Highlighter (5-10x I think?).

The work in LUCENE-2878 is the only thing I know of that could represent an improvement. I did some tests there including storing character offsets as payloads and got some additional speedup (maybe another 2x?) beyond FVH. There doesn't seem to be a lot of energy into pushing that ahead right now though, and it requires some fundamental changes to the way that searching is done.

-Mike

Reply via email to