On 10/14/2011 7:20 PM, Jan Høydahl wrote:
Hi,
The Highlighter is way too slow for this customer's particular use case - which
is veery large documents. We don't need highlighted snippets for now, but we
need to accurately decide what words (offsets) in the real HTML display of the
resulting page to highlight. For this we only need offset info, not the
snippets/fragments from the stored field.
But I have not looked at the Highlighter code. Perhaps we could fork it into a
new search component which pulls out only the necessary meta info and payloads
for us and returns it to client?
Jan I've looked into this, and I believe the slowness of Highlighter
doesn't have to do with constructing the snippets as much as with the
analysis that is required to find the locations of matching terms in the
document text, so I think your problem is basically the same as
highlighting.
There seem to be basically two approaches right now: one is Highlighter,
which is a you point out is a bit slow because it has to basically
re-analyze the entire document, but this does have the virtue of an
exact match to the semantics of the original query.
FastVectorHighlighter works by doing some cheap mimicry of the original
query, extracting terms from the query (and also intersecting with the
document too, if you have MultiTermQuery), and finding the offsets of
those terms (which have to be stored in the index). It is smart enough
to respect phrase boundaries, but does not support every kind of Query;
however it might be good enough, and is quite a bit faster than
Highlighter (5-10x I think?).
The work in LUCENE-2878 is the only thing I know of that could represent
an improvement. I did some tests there including storing character
offsets as payloads and got some additional speedup (maybe another 2x?)
beyond FVH. There doesn't seem to be a lot of energy into pushing that
ahead right now though, and it requires some fundamental changes to the
way that searching is done.
-Mike