Re: Best Practices for getting Strings from a position range

Peter Keegan Fri, 10 Aug 2007 15:27:54 -0700

ok, glad we're on the same page.

I did some performance testing with span queries and, unfortunately,  the
results are discouraging for my intended use. When I added a simple
SpanNearQuery to existing queries, the throughput decreased by a factor of
10+. I figured spans would be expensive, but not that much. I haven't done
profiling yet, but here's a typical thread stack during execution:


at org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java:137)
at org.apache.lucene.util.PriorityQueue.adjustTop(PriorityQueue.java:101)
at org.apache.lucene.search.spans.NearSpansUnordered.next(
NearSpansUnordered.java:128)
at org.apache.lucene.search.spans.SpanScorer.setFreqCurrentDoc(
SpanScorer.java:83)
at org.apache.lucene.search.spans.SpanScorer.next(SpanScorer.java:57)
at org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java
:56)
at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:327)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
at org.apache.lucene.search.Searcher.search(Searcher.java:118)

Peter


On 8/10/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> Sorry for the confusion.  I thought you just wanted access to the
> term info per position.    I think we will have to add something to
> the Spans like we talked about before.
>
> -Grant
>
> On Aug 10, 2007, at 11:03 AM, Peter Keegan wrote:
>
> > Grant,
> >
> > I'm afraid I don't understand how to use this mapper in the context
> > of a
> > SpanQuery. It seems like I would have to modify SpanScorer to fetch
> > payload
> > data and provide a new method to access the payloads while
> > iterating through
> > the documents. If this can be accomplished without modifying Spans,
> > could
> > you provide a bit more detail?
> >
> > Thanks,
> > Peter
> >
> > On 8/9/07, Peter Keegan <[EMAIL PROTECTED]> wrote:
> >>
> >> Hi Grant,
> >>
> >> I'm hoping to check this out soon.
> >>
> >> Thanks,
> >> Peter
> >>
> >> On 8/7/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:
> >>>
> >>> Hi Peter,
> >>>
> >>> Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It
> >>> provides a TermVectorMapper that loads by position.
> >>>
> >>> Still not what ideally what you want, but I haven't had time to
> >>> scope
> >>> that one out yet.,
> >>>
> >>> -Grant
> >>>
> >>> On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:
> >>>
> >>>> Hi Grant,
> >>>>
> >>>> No problem - I know you are very busy.  I just wanted to get a
> >>>> sense for the
> >>>> timing because I'd like to use this for a release this Fall. If I
> >>>> can get a
> >>>> prototype working in the coming weeks AND the performance is
> >>>> great :) , this
> >>>> would be terrific. If not, I'll have to fall back on a more complex
> >>>> design
> >>>> that handles the query outside of Lucene :(
> >>>>
> >>>> In the meantime, I'll try playing with LUCENE-868.
> >>>>
> >>>> Thanks for the update.
> >>>> Peter
> >>>>
> >>>> On 7/24/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:
> >>>>>
> >>>>> Sorry, Peter, I haven't had a chance to work on it.  I don't
> >>>>> see it
> >>>>> happening this week, but maybe next.
> >>>>>
> >>>>> I do think the Mapper approach via TermVectors will work.  It will
> >>>>> require implementing a new mapper that orders by position, but I
> >>>>> don't think that is too hard.   I started on one on the LUCENE-868
> >>>>> patch (version 4) but it is not complete.  Maybe you want to pick
> >>>>> it up?
> >>>>>
> >>>>> With this approach, you would iterate your spans, when you come
> >>>>> to a
> >>>>> new doc, you would load the term vector using the
> >>>>> PositionMapper, and
> >>>>> then you could index into the positions for the matches in the
> >>>>> document.
> >>>>>
> >>>>> I realize this does not cover the just wanting to get the
> >>>>> Payload at
> >>>>> the match issue.  Maybe next week...
> >>>>>
> >>>>> Cheers,
> >>>>> Grant
> >>>>>
> >>>>> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
> >>>>>
> >>>>>> Any idea on when this might be available (days, weeks...)?
> >>>>>>
> >>>>>> Peter
> >>>>>>
> >>>>>> On 7/16/07, Grant Ingersoll < [EMAIL PROTECTED]> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> : Do we have a best practice for going from, say a SpanQuery
> >>>>> doc/
> >>>>>>>> : position information and retrieving the actual range of
> >>>>>>> positions of
> >>>>>>>> : content from the Document?  Is it just to reanalyze the
> >>>>> Document
> >>>>>>>> : using the appropriate Analyzer and start recording once you
> >>>>>>> hit the
> >>>>>>>> : positions you are interested in?    Seems like Term Vectors
> >>>>>>> _could_
> >>>>>>>> : help, but even my new Mapper approach patch (LUCENE-868)
> >>>>> doesn't
> >>>>>>>> : really help, because they are stored in a term-centric
> >>>>> manner.  I
> >>>>>>>> : guess what I am after is a position centric approach.  That
> >>>>>>> is, give
> >>>>>>>>
> >>>>>>>> this is kind of what i was suggesting in the last message i
> >>>>>>>> sent
> >>>>>>>> to the java-user thread about paylods and SpanQueries (which
> >>>>>>>> i'm
> >>>>>>>> guessing is what prompted this thread as well)...
> >>>>>>>>
> >>>>>>>> http://www.nabble.com/Payloads-and-PhraseQuery-
> >>>>>>>> tf3988826.html#a11551628
> >>>>>>>
> >>>>>>>
> >>>>>>> This is one use case, the other is related to the new patch I
> >>>>>>> submitted for LUCENE-960.  In this case, I have a
> >>>>>>> SpanQueryFilter
> >>>>>>> that identifies a bunch of docs and positions ahead of time.
> >>>>>>> Then
> >>>
> >>>>>>> the user enters new Span Query and I want to relate the matches
> >>>>> from
> >>>>>>> the user query with the positions of matches in the filter and
> >>>>> then
> >>>>>>> show that window.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> my point was that currently, to retrieve a payload you need a
> >>>>>>>> TermPositions instance, which is designed for iterating in the
> >>>>>>>> order of...
> >>>>>>>>     seek(term)
> >>>>>>>>       skipTo(doc)
> >>>>>>>>          nextPosition()
> >>>>>>>>             getPayload()
> >>>>>>>> ...which is great for getting the payload of every instance
> >>>>>>>> (ie:position) of a specific term in a given document (or in
> >>>>> every
> >>>>>>>> document) but without serious changes to the Spans API, the
> >>>>> ideal
> >>>>>>>> payload
> >>>>>>>> API would let you say...
> >>>>>>>>     skipTo(doc)
> >>>>>>>>        advance(startPosition)
> >>>>>>>>          getPayload()
> >>>>>>>>        while (nextPosition() < endPosition)
> >>>>>>>>          getPosition()
> >>>>>>>>
> >>>>>>>> but this seems like a nearly impossible API to implement
> >>>>> given the
> >>>>>>>> natore
> >>>>>>>> of hte inverted index and the fact that terms aren't ever
> >>>>> stored in
> >>>>>>>> position order.
> >>>>>>>>
> >>>>>>>> there's a lot i really don't know/understand about the lucene
> >>>>> term
> >>>>>>>> position internals ... but as i recall, the datastructure
> >>>>> written
> >>>>>>>> to disk
> >>>>>>>> isn't actually a tree structure inverted index, it's a long
> >>>>>>>> sequence of
> >>>>>>>> tuples correct?  so in theory you could scan along the tuples
> >>>>>>>> untill you
> >>>>>>>> find the doc you are interested in, ignoring all of the term
> >>>>> info
> >>>>>>>> along
> >>>>>>>> the way, then whatever term you happen be on at the moment, you
> >>>>>>>> could scan
> >>>>>>>> along all of the positions until you find one in the range
> >>>>> you are
> >>>>>>>> interested in -- assuming you do, then you record the current
> >>>>> Term
> >>>>>>>> (and
> >>>>>>>> read your payload data if interested)
> >>>>>>>
> >>>>>>> I think the main issue I see is in both the payloads and the
> >>>>> matching
> >>>>>>> case above is that they require a document centric approach.
> >>>>>>> And
> >>>>>>> then, for each Document,
> >>>>>>> you ideally want to be able to just index into an array so that
> >>>>> you
> >>>>>>> can go directly to the position that is needed based on
> >>>>>>> Span.getStart()
> >>>>>>>
> >>>>>>>>
> >>>>>>>> if i remember correctly, the first part of this is easy, and
> >>>>>>>> relative fast
> >>>>>>>> -- i think skipTo(doc) on a TermDoc or TermPositions will
> >>>>> happily
> >>>>>>>> scan for
> >>>>>>>> the first <term,doc> pair with the correct docId,
> >>>>> irregardless of
> >>>>>>>> the term
> >>>>>>>> ... the only thing i'm not sure about is how efficient it is to
> >>>>>>>> loop over
> >>>>>>>> nextPosition() for every term you find to see if any of them
> >>>>> are in
> >>>>>>>> your
> >>>>>>>> range ... the best case scenerio is that the first position
> >>>>>>>> returned is
> >>>>>>>> above the high end of your range, in which case you can stop
> >>>>>>>> immediately
> >>>>>>>> and seek to the next term -- butthe worst case is that you call
> >>>>>>>> nextPosition() over an over a lot of times before you get a
> >>>>>>>> position in
> >>>>>>>> (or above) your rnage .... an advancePosition(pos) that
> >>>>> wokred like
> >>>>>>>> seek
> >>>>>>>> or skipTo might be helpful here.
> >>>>>>>>
> >>>>>>>> : I feel like I am missing something obvious.  I would
> >>>>> suspect the
> >>>>>>>> : highlighter needs to do this, but it seems to take the
> >>>>> reanalyze
> >>>>>>>> : approach as well (I admit, though, that I have little
> >>>>> experience
> >>>>>>>> with
> >>>>>>>> : the highlighter.)
> >>>>>>>>
> >>>>>>>> as i understand it the default case is to reanalyze, but if you
> >>>>>>> have
> >>>>>>>> TermFreqVector info stored with positions (ie: a
> >>>>>>>> TermPositionVector) then
> >>>>>>>> it can use that to construct a TokenStream by iterating over
> >>>>>>>> all
> >>>>>>>> terms and
> >>>>>>>> writing them into a big array in position order (see the
> >>>>>>>> TermSources class
> >>>>>>>> in the highlighter)
> >>>>>>>
> >>>>>>>
> >>>>>>> Ah, I see that now.  Thanks.
> >>>>>>>>
> >>>>>>>> this makes sense when highlighting because it doesn't know what
> >>>>>>>> kind of
> >>>>>>>> fragmenter is going to be used so it needs the whole
> >>>>> TokenStream,
> >>>>>>>> but it
> >>>>>>>> seems less then ideal when you are only interested in a small
> >>>>>>>> number of
> >>>>>>>> position ranges that you know in advance.
> >>>>>>>>
> >>>>>>>> : I am wondering if it would be useful to have an alternative
> >>>>> Term
> >>>>>>>> : Vector storage mechanism that was position centric.
> >>>>> Because we
> >>>>>>>> : couldn't take advantage of the lexicographic compression, it
> >>>>>>> would
> >>>>>>>> : take up more disk space, but it would be a lot faster for
> >>>>> these
> >>>>>>>> kinds
> >>>>>>>>
> >>>>>>>> i'm not sure if it's really neccessary to store the data in a
> >>>>>>> position
> >>>>>>>> centric manner, assuming we have a way to "seek" by position
> >>>>> like i
> >>>>>>>> described above -- but then again i don't really know that
> >>>>> what i
> >>>>>>>> described above is all that possible/practical/performant.
> >>>>>>>>
> >>>>>>>
> >>>>>>> I suppose I could use my Mapper approach to organize things in a
> >>>>>>> position centric way now that I think about it more.  Just
> >>>>> means some
> >>>>>>> unpacking and repacking.  Still, probably would perform well
> >>>>> enough
> >>>>>>> since I can setup the correct structure on the fly.  I will
> >>>>> give this
> >>>>>>> a try.  Maybe even add a Mapper to do this.
> >>>>>>>
> >>>>>>>
> >>>>>>> -Grant
> >>>>>>>
> >>>>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> ---
> >>>
> >>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>> ------------------------------------------------------
> >>>>> Grant Ingersoll
> >>>>> http://www.grantingersoll.com/
> >>>>> http://lucene.grantingersoll.com
> >>>>> http://www.paperoftheweek.com/
> >>>>>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> ---
> >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>
> >>>>>
> >>>
> >>> --------------------------
> >>> Grant Ingersoll
> >>> http://lucene.grantingersoll.com
> >>>
> >>> Lucene Helpful Hints:
> >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >>> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>>
> >>>
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>
> >>>
> >>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Best Practices for getting Strings from a position range

Reply via email to