Re: Best Practices for getting Strings from a position range

Grant Ingersoll Tue, 07 Aug 2007 19:38:32 -0700

Hi Peter,

Give https://issues.apache.org/jira/browse/LUCENE-975 a try. Itprovides a TermVectorMapper that loads by position.

Still not what ideally what you want, but I haven't had time to scopethat one out yet.,


-Grant

On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:

Hi Grant,
No problem - I know you are very busy. I just wanted to get asense for thetiming because I'd like to use this for a release this Fall. If Ican get aprototype working in the coming weeks AND the performance isgreat :) , thiswould be terrific. If not, I'll have to fall back on a more complexdesign
that handles the query outside of Lucene :(

In the meantime, I'll try playing with LUCENE-868.

Thanks for the update.
Peter

On 7/24/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
Sorry, Peter, I haven't had a chance to work on it.  I don't see it
happening this week, but maybe next.

I do think the Mapper approach via TermVectors will work.  It will
require implementing a new mapper that orders by position, but I
don't think that is too hard.   I started on one on the LUCENE-868
patch (version 4) but it is not complete. Maybe you want to pickit up?
With this approach, you would iterate your spans, when you come to a
new doc, you would load the term vector using the PositionMapper, and
then you could index into the positions for the matches in thedocument.
I realize this does not cover the just wanting to get the Payload at
the match issue.  Maybe next week...

Cheers,
Grant

On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:

> Any idea on when this might be available (days, weeks...)?
>
> Peter
>
> On 7/16/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>>
>>
>> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
>>
>> >
>> > : Do we have a best practice for going from, say a SpanQuerydoc/
>> > : position information and retrieving the actual range of
>> positions of
>> > : content from the Document? Is it just to reanalyze theDocument
>> > : using the appropriate Analyzer and start recording once you
>> hit the
>> > : positions you are interested in?    Seems like Term Vectors
>> _could_
>> > : help, but even my new Mapper approach patch (LUCENE-868)doesn't>> > : really help, because they are stored in a term-centricmanner. I
>> > : guess what I am after is a position centric approach.  That
>> is, give
>> >
>> > this is kind of what i was suggesting in the last message i sent
>> > to the java-user thread about paylods and SpanQueries (which i'm
>> > guessing is what prompted this thread as well)...
>> >
>> > http://www.nabble.com/Payloads-and-PhraseQuery-
>> > tf3988826.html#a11551628
>>
>>
>> This is one use case, the other is related to the new patch I
>> submitted for LUCENE-960.  In this case, I have a SpanQueryFilter
>> that identifies a bunch of docs and positions ahead of time.  Then
>> the user enters new Span Query and I want to relate the matchesfrom>> the user query with the positions of matches in the filter andthen
>> show that window.
>>
>> >
>> > my point was that currently, to retrieve a payload you need a
>> > TermPositions instance, which is designed for iterating in the
>> > order of...
>> >     seek(term)
>> >       skipTo(doc)
>> >          nextPosition()
>> >             getPayload()
>> > ...which is great for getting the payload of every instance
>> > (ie:position) of a specific term in a given document (or inevery>> > document) but without serious changes to the Spans API, theideal
>> > payload
>> > API would let you say...
>> >     skipTo(doc)
>> >        advance(startPosition)
>> >          getPayload()
>> >        while (nextPosition() < endPosition)
>> >          getPosition()
>> >
>> > but this seems like a nearly impossible API to implementgiven the
>> > natore
>> > of hte inverted index and the fact that terms aren't everstored in
>> > position order.
>> >
>> > there's a lot i really don't know/understand about the luceneterm>> > position internals ... but as i recall, the datastructurewritten
>> > to disk
>> > isn't actually a tree structure inverted index, it's a long
>> > sequence of
>> > tuples correct?  so in theory you could scan along the tuples
>> > untill you
>> > find the doc you are interested in, ignoring all of the terminfo
>> > along
>> > the way, then whatever term you happen be on at the moment, you
>> > could scan
>> > along all of the positions until you find one in the rangeyou are>> > interested in -- assuming you do, then you record the currentTerm
>> > (and
>> > read your payload data if interested)
>>
>> I think the main issue I see is in both the payloads and thematching
>> case above is that they require a document centric approach.  And
>> then, for each Document,
>> you ideally want to be able to just index into an array so thatyou
>> can go directly to the position that is needed based on
>> Span.getStart()
>>
>> >
>> > if i remember correctly, the first part of this is easy, and
>> > relative fast
>> > -- i think skipTo(doc) on a TermDoc or TermPositions willhappily
>> > scan for
>> > the first <term,doc> pair with the correct docId,irregardless of
>> > the term
>> > ... the only thing i'm not sure about is how efficient it is to
>> > loop over
>> > nextPosition() for every term you find to see if any of themare in
>> > your
>> > range ... the best case scenerio is that the first position
>> > returned is
>> > above the high end of your range, in which case you can stop
>> > immediately
>> > and seek to the next term -- butthe worst case is that you call
>> > nextPosition() over an over a lot of times before you get a
>> > position in
>> > (or above) your rnage .... an advancePosition(pos) thatwokred like
>> > seek
>> > or skipTo might be helpful here.
>> >
>> > : I feel like I am missing something obvious. I wouldsuspect the>> > : highlighter needs to do this, but it seems to take thereanalyze>> > : approach as well (I admit, though, that I have littleexperience
>> > with
>> > : the highlighter.)
>> >
>> > as i understand it the default case is to reanalyze, but if you
>> have
>> > TermFreqVector info stored with positions (ie: a
>> > TermPositionVector) then
>> > it can use that to construct a TokenStream by iterating over all
>> > terms and
>> > writing them into a big array in position order (see the
>> > TermSources class
>> > in the highlighter)
>>
>>
>> Ah, I see that now.  Thanks.
>> >
>> > this makes sense when highlighting because it doesn't know what
>> > kind of
>> > fragmenter is going to be used so it needs the wholeTokenStream,
>> > but it
>> > seems less then ideal when you are only interested in a small
>> > number of
>> > position ranges that you know in advance.
>> >
>> > : I am wondering if it would be useful to have an alternativeTerm>> > : Vector storage mechanism that was position centric.Because we
>> > : couldn't take advantage of the lexicographic compression, it
>> would
>> > : take up more disk space, but it would be a lot faster forthese
>> > kinds
>> >
>> > i'm not sure if it's really neccessary to store the data in a
>> position
>> > centric manner, assuming we have a way to "seek" by positionlike i>> > described above -- but then again i don't really know thatwhat i
>> > described above is all that possible/practical/performant.
>> >
>>
>> I suppose I could use my Mapper approach to organize things in a
>> position centric way now that I think about it more. Justmeans some>> unpacking and repacking. Still, probably would perform wellenough>> since I can setup the correct structure on the fly. I willgive this
>> a try.  Maybe even add a Mapper to do this.
>>
>>
>> -Grant
>>
>>---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Best Practices for getting Strings from a position range

Reply via email to