Best Practices for getting Strings from a position range

Grant Ingersoll Sun, 15 Jul 2007 19:18:54 -0700

Do we have a best practice for going from, say a SpanQuery doc/position information and retrieving the actual range of positions ofcontent from the Document? Is it just to reanalyze the Documentusing the appropriate Analyzer and start recording once you hit thepositions you are interested in? Seems like Term Vectors _could_help, but even my new Mapper approach patch (LUCENE-868) doesn'treally help, because they are stored in a term-centric manner. Iguess what I am after is a position centric approach. That is, givea Document, get a term vector (note, not a TermFreqVector) back thatis ordered by position (thus, there may be duplicate entries for agiven term that occurs in multiple positions)

I feel like I am missing something obvious. I would suspect thehighlighter needs to do this, but it seems to take the reanalyzeapproach as well (I admit, though, that I have little experience withthe highlighter.)

I am wondering if it would be useful to have an alternative TermVector storage mechanism that was position centric. Because wecouldn't take advantage of the lexicographic compression, it wouldtake up more disk space, but it would be a lot faster for these kindsof things. With this kind of approach, you could easily index intoan array based on the result of a SpanQuery.start(), etc. Of course,you would have to have a data structure that handled the multipleterms per position option, but I don't think that would be too hard,correct?


Just thinking out loud...

Cheers,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Best Practices for getting Strings from a position range

Reply via email to