Do we have a best practice for going from, say a SpanQuery doc/
position information and retrieving the actual range of positions of
content from the Document? Is it just to reanalyze the Document
using the appropriate Analyzer and start recording once you hit the
positions you are interested in? Seems like Term Vectors _could_
help, but even my new Mapper approach patch (LUCENE-868) doesn't
really help, because they are stored in a term-centric manner. I
guess what I am after is a position centric approach. That is, give
a Document, get a term vector (note, not a TermFreqVector) back that
is ordered by position (thus, there may be duplicate entries for a
given term that occurs in multiple positions)
I feel like I am missing something obvious. I would suspect the
highlighter needs to do this, but it seems to take the reanalyze
approach as well (I admit, though, that I have little experience with
the highlighter.)
I am wondering if it would be useful to have an alternative Term
Vector storage mechanism that was position centric. Because we
couldn't take advantage of the lexicographic compression, it would
take up more disk space, but it would be a lot faster for these kinds
of things. With this kind of approach, you could easily index into
an array based on the result of a SpanQuery.start(), etc. Of course,
you would have to have a data structure that handled the multiple
terms per position option, but I don't think that would be too hard,
correct?
Just thinking out loud...
Cheers,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]