Re: Best Practices for getting Strings from a position range

Grant Ingersoll Mon, 16 Jul 2007 04:41:52 -0700


On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:


: Do we have a best practice for going from, say a SpanQuery doc/
: position information and retrieving the actual range of positions of
: content from the Document?  Is it just to reanalyze the Document
: using the appropriate Analyzer and start recording once you hit the
: positions you are interested in?    Seems like Term Vectors _could_
: help, but even my new Mapper approach patch (LUCENE-868) doesn't
: really help, because they are stored in a term-centric manner.  I
: guess what I am after is a position centric approach.  That is, give

this is kind of what i was suggesting in the last message i sent
to the java-user thread about paylods and SpanQueries (which i'm
guessing is what prompted this thread as well)...

http://www.nabble.com/Payloads-and-PhraseQuery-tf3988826.html#a11551628

This is one use case, the other is related to the new patch Isubmitted for LUCENE-960. In this case, I have a SpanQueryFilterthat identifies a bunch of docs and positions ahead of time. Thenthe user enters new Span Query and I want to relate the matches fromthe user query with the positions of matches in the filter and thenshow that window.

my point was that currently, to retrieve a payload you need a
TermPositions instance, which is designed for iterating in theorder of...
    seek(term)
      skipTo(doc)
         nextPosition()
            getPayload()
...which is great for getting the payload of every instance
(ie:position) of a specific term in a given document (or in every
document) but without serious changes to the Spans API, the idealpayload
API would let you say...
    skipTo(doc)
       advance(startPosition)
         getPayload()
       while (nextPosition() < endPosition)
         getPosition()
but this seems like a nearly impossible API to implement given thenatore
of hte inverted index and the fact that terms aren't ever stored in
position order.

there's a lot i really don't know/understand about the lucene term
position internals ... but as i recall, the datastructure writtento diskisn't actually a tree structure inverted index, it's a longsequence oftuples correct? so in theory you could scan along the tuplesuntill youfind the doc you are interested in, ignoring all of the term infoalongthe way, then whatever term you happen be on at the moment, youcould scan
along all of the positions until you find one in the range you are
interested in -- assuming you do, then you record the current Term(and
read your payload data if interested)

I think the main issue I see is in both the payloads and the matchingcase above is that they require a document centric approach. Andthen, for each Document,you ideally want to be able to just index into an array so that youcan go directly to the position that is needed based on Span.getStart()

if i remember correctly, the first part of this is easy, andrelative fast-- i think skipTo(doc) on a TermDoc or TermPositions will happilyscan forthe first <term,doc> pair with the correct docId, irregardless ofthe term... the only thing i'm not sure about is how efficient it is toloop overnextPosition() for every term you find to see if any of them are inyourrange ... the best case scenerio is that the first positionreturned isabove the high end of your range, in which case you can stopimmediately
and seek to the next term -- butthe worst case is that you call
nextPosition() over an over a lot of times before you get aposition in(or above) your rnage .... an advancePosition(pos) that wokred likeseek
or skipTo might be helpful here.

: I feel like I am missing something obvious.  I would suspect the
: highlighter needs to do this, but it seems to take the reanalyze
: approach as well (I admit, though, that I have little experiencewith
: the highlighter.)

as i understand it the default case is to reanalyze, but if you have
TermFreqVector info stored with positions (ie: aTermPositionVector) thenit can use that to construct a TokenStream by iterating over allterms andwriting them into a big array in position order (see theTermSources class
in the highlighter)



Ah, I see that now.  Thanks.

this makes sense when highlighting because it doesn't know whatkind offragmenter is going to be used so it needs the whole TokenStream,but itseems less then ideal when you are only interested in a smallnumber of
position ranges that you know in advance.

: I am wondering if it would be useful to have an alternative Term
: Vector storage mechanism that was position centric.  Because we
: couldn't take advantage of the lexicographic compression, it would
: take up more disk space, but it would be a lot faster for thesekinds
i'm not sure if it's really neccessary to store the data in a position
centric manner, assuming we have a way to "seek" by position like i
described above -- but then again i don't really know that what i
described above is all that possible/practical/performant.

I suppose I could use my Mapper approach to organize things in aposition centric way now that I think about it more. Just means someunpacking and repacking. Still, probably would perform well enoughsince I can setup the correct structure on the fly. I will give thisa try. Maybe even add a Mapper to do this.



-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Best Practices for getting Strings from a position range

Reply via email to