Do we have a best practice for going from, say a SpanQuery doc/ position information and retrieving the actual range of positions of content from the Document? Is it just to reanalyze the Document using the appropriate Analyzer and start recording once you hit the positions you are interested in? Seems like Term Vectors _could_ help, but even my new Mapper approach patch (LUCENE-868) doesn't really help, because they are stored in a term-centric manner. I guess what I am after is a position centric approach. That is, give a Document, get a term vector (note, not a TermFreqVector) back that is ordered by position (thus, there may be duplicate entries for a given term that occurs in multiple positions)

I feel like I am missing something obvious. I would suspect the highlighter needs to do this, but it seems to take the reanalyze approach as well (I admit, though, that I have little experience with the highlighter.)

I am wondering if it would be useful to have an alternative Term Vector storage mechanism that was position centric. Because we couldn't take advantage of the lexicographic compression, it would take up more disk space, but it would be a lot faster for these kinds of things. With this kind of approach, you could easily index into an array based on the result of a SpanQuery.start(), etc. Of course, you would have to have a data structure that handled the multiple terms per position option, but I don't think that would be too hard, correct?

Just thinking out loud...

Cheers,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to