I don't think you would see much of gain. Shoving the TokenStream into
the MemoryIndex is actually pretty fast and I wouldn't be surprised if
it was much faster than reading from disk. Most of the computational
time is spent in reconstructing the TokenStream, whether you use
term-vectors or re-analyze. Also, if the Query does not have any
position sensitive clauses, no MemoryIndex is created, so no worries there.
The great speed challenge of the current method (other than needing a
TokenStream created) is that it runs over each Token and stitches the
document together a piece at a time. This doesn't scale well on huge
docs. There are ways to cut this down and to just analyze the pertinent
Tokens as is done by a different patch. However, you'd need to have
TermVectors stored, and the concept doesn't fit with the current
Highlighter framework, which already has some significant functionality
and robustness.
- Mark
Marjan Celikik wrote:
Mark Miller wrote:
That is why the original contrib does not work with PhraseQuery's. It
simply matches Tokens from the query with those in the TokenStream.
LUCENE-794 takes the TokenStream and shoves it into a MemoryIndex.
Then, after converting the query to a SpanQuery approximation,
getSpans is called on the index for the query. The Spans provide a
bound on what positions should be Highlighted. Everything else is
done exactly like the original Highlighter (This is a patch that fits
into the original Highlighter framework that was developed, thereby
retaining all of its richness :) ).
Mark, thanks for your patience! I have one final (conceptual,
high-level) question concerning the usage of the MemoryIndex index
over the TokenStream. Is it a good idea to
store the procomputed MemoryIndex (conceptually speaking) as a field
into each document at indexing time and then just load this
precomputed index from
disk (as you do with TermVector) such that you save extra computation
for the highlighting?
Marjan.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]