Re: Best Practices for getting Strings from a position range

Peter Keegan Wed, 15 Aug 2007 07:47:05 -0700

Grant,

I built an index as described here:
http://www.nabble.com/SpanQuery-and-database-join-tf4262902.html


Many documents have only 1 or 2 rows, some have dozens.
Here is a typical query without spans:

+((+contents:quaker +contents:cereal) (+boost50:quaker +boost50:cereal))
+literals:co$us), sort=<custom:"feedbabe":
[EMAIL PROTECTED]>,"dateactiveR"!


Here is a typical query with spans:

+spanNear([adliterals:jb$1, adliterals:co$us], 8, false)
+(+((+contents:quaker +contents:cereal) (+boost50:quaker +boost50:cereal))
+literals:co$us), sort=<custom:"feedbabe":
[EMAIL PROTECTED]>,"dateactiveR"!

The addition of the spanNear clause caused the 10X decrease in throughput. I
could probably change the way rows are indexed and use ordered terms, which
seems to be a bit faster (only 5X decrease)

Peter


On 8/14/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> Hi Peter,
>
> Could you give more details on this test?  What are you comparing,
> etc.?  Sample queries would be good.  I would like to write up a
> contrib/Benchmark algorithm to begin investigating this and see if
> there is anything that can be done.
>
> Thanks,
> Grant
>
> On Aug 10, 2007, at 6:27 PM, Peter Keegan wrote:
>
> > ok, glad we're on the same page.
> >
> > I did some performance testing with span queries and,
> > unfortunately,  the
> > results are discouraging for my intended use. When I added a simple
> > SpanNearQuery to existing queries, the throughput decreased by a
> > factor of
> > 10+. I figured spans would be expensive, but not that much. I
> > haven't done
> > profiling yet, but here's a typical thread stack during execution:
> >
> > at org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java:
> > 137)
> > at org.apache.lucene.util.PriorityQueue.adjustTop
> > (PriorityQueue.java:101)
> > at org.apache.lucene.search.spans.NearSpansUnordered.next(
> > NearSpansUnordered.java:128)
> > at org.apache.lucene.search.spans.SpanScorer.setFreqCurrentDoc(
> > SpanScorer.java:83)
> > at org.apache.lucene.search.spans.SpanScorer.next(SpanScorer.java:57)
> > at org.apache.lucene.search.ConjunctionScorer.next
> > (ConjunctionScorer.java
> > :56)
> > at org.apache.lucene.search.BooleanScorer2.score
> > (BooleanScorer2.java:327)
> > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:
> > 146)
> > at org.apache.lucene.search.Searcher.search(Searcher.java:118)
> >
> > Peter
> >
> >
> > On 8/10/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> >>
> >> Sorry for the confusion.  I thought you just wanted access to the
> >> term info per position.    I think we will have to add something to
> >> the Spans like we talked about before.
> >>
> >> -Grant
> >>
> >> On Aug 10, 2007, at 11:03 AM, Peter Keegan wrote:
> >>
> >>> Grant,
> >>>
> >>> I'm afraid I don't understand how to use this mapper in the context
> >>> of a
> >>> SpanQuery. It seems like I would have to modify SpanScorer to fetch
> >>> payload
> >>> data and provide a new method to access the payloads while
> >>> iterating through
> >>> the documents. If this can be accomplished without modifying Spans,
> >>> could
> >>> you provide a bit more detail?
> >>>
> >>> Thanks,
> >>> Peter
> >>>
> >>> On 8/9/07, Peter Keegan <[EMAIL PROTECTED]> wrote:
> >>>>
> >>>> Hi Grant,
> >>>>
> >>>> I'm hoping to check this out soon.
> >>>>
> >>>> Thanks,
> >>>> Peter
> >>>>
> >>>> On 8/7/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:
> >>>>>
> >>>>> Hi Peter,
> >>>>>
> >>>>> Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It
> >>>>> provides a TermVectorMapper that loads by position.
> >>>>>
> >>>>> Still not what ideally what you want, but I haven't had time to
> >>>>> scope
> >>>>> that one out yet.,
> >>>>>
> >>>>> -Grant
> >>>>>
> >>>>> On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:
> >>>>>
> >>>>>> Hi Grant,
> >>>>>>
> >>>>>> No problem - I know you are very busy.  I just wanted to get a
> >>>>>> sense for the
> >>>>>> timing because I'd like to use this for a release this Fall. If I
> >>>>>> can get a
> >>>>>> prototype working in the coming weeks AND the performance is
> >>>>>> great :) , this
> >>>>>> would be terrific. If not, I'll have to fall back on a more
> >>>>>> complex
> >>>>>> design
> >>>>>> that handles the query outside of Lucene :(
> >>>>>>
> >>>>>> In the meantime, I'll try playing with LUCENE-868.
> >>>>>>
> >>>>>> Thanks for the update.
> >>>>>> Peter
> >>>>>>
> >>>>>> On 7/24/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:
> >>>>>>>
> >>>>>>> Sorry, Peter, I haven't had a chance to work on it.  I don't
> >>>>>>> see it
> >>>>>>> happening this week, but maybe next.
> >>>>>>>
> >>>>>>> I do think the Mapper approach via TermVectors will work.  It
> >>>>>>> will
> >>>>>>> require implementing a new mapper that orders by position, but I
> >>>>>>> don't think that is too hard.   I started on one on the
> >>>>>>> LUCENE-868
> >>>>>>> patch (version 4) but it is not complete.  Maybe you want to
> >>>>>>> pick
> >>>>>>> it up?
> >>>>>>>
> >>>>>>> With this approach, you would iterate your spans, when you come
> >>>>>>> to a
> >>>>>>> new doc, you would load the term vector using the
> >>>>>>> PositionMapper, and
> >>>>>>> then you could index into the positions for the matches in the
> >>>>>>> document.
> >>>>>>>
> >>>>>>> I realize this does not cover the just wanting to get the
> >>>>>>> Payload at
> >>>>>>> the match issue.  Maybe next week...
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Grant
> >>>>>>>
> >>>>>>> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
> >>>>>>>
> >>>>>>>> Any idea on when this might be available (days, weeks...)?
> >>>>>>>>
> >>>>>>>> Peter
> >>>>>>>>
> >>>>>>>> On 7/16/07, Grant Ingersoll < [EMAIL PROTECTED]> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> : Do we have a best practice for going from, say a SpanQuery
> >>>>>>> doc/
> >>>>>>>>>> : position information and retrieving the actual range of
> >>>>>>>>> positions of
> >>>>>>>>>> : content from the Document?  Is it just to reanalyze the
> >>>>>>> Document
> >>>>>>>>>> : using the appropriate Analyzer and start recording once you
> >>>>>>>>> hit the
> >>>>>>>>>> : positions you are interested in?    Seems like Term Vectors
> >>>>>>>>> _could_
> >>>>>>>>>> : help, but even my new Mapper approach patch (LUCENE-868)
> >>>>>>> doesn't
> >>>>>>>>>> : really help, because they are stored in a term-centric
> >>>>>>> manner.  I
> >>>>>>>>>> : guess what I am after is a position centric approach.  That
> >>>>>>>>> is, give
> >>>>>>>>>>
> >>>>>>>>>> this is kind of what i was suggesting in the last message i
> >>>>>>>>>> sent
> >>>>>>>>>> to the java-user thread about paylods and SpanQueries (which
> >>>>>>>>>> i'm
> >>>>>>>>>> guessing is what prompted this thread as well)...
> >>>>>>>>>>
> >>>>>>>>>> http://www.nabble.com/Payloads-and-PhraseQuery-
> >>>>>>>>>> tf3988826.html#a11551628
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> This is one use case, the other is related to the new patch I
> >>>>>>>>> submitted for LUCENE-960.  In this case, I have a
> >>>>>>>>> SpanQueryFilter
> >>>>>>>>> that identifies a bunch of docs and positions ahead of time.
> >>>>>>>>> Then
> >>>>>
> >>>>>>>>> the user enters new Span Query and I want to relate the
> >>>>>>>>> matches
> >>>>>>> from
> >>>>>>>>> the user query with the positions of matches in the filter and
> >>>>>>> then
> >>>>>>>>> show that window.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> my point was that currently, to retrieve a payload you need a
> >>>>>>>>>> TermPositions instance, which is designed for iterating in
> >>>>>>>>>> the
> >>>>>>>>>> order of...
> >>>>>>>>>>     seek(term)
> >>>>>>>>>>       skipTo(doc)
> >>>>>>>>>>          nextPosition()
> >>>>>>>>>>             getPayload()
> >>>>>>>>>> ...which is great for getting the payload of every instance
> >>>>>>>>>> (ie:position) of a specific term in a given document (or in
> >>>>>>> every
> >>>>>>>>>> document) but without serious changes to the Spans API, the
> >>>>>>> ideal
> >>>>>>>>>> payload
> >>>>>>>>>> API would let you say...
> >>>>>>>>>>     skipTo(doc)
> >>>>>>>>>>        advance(startPosition)
> >>>>>>>>>>          getPayload()
> >>>>>>>>>>        while (nextPosition() < endPosition)
> >>>>>>>>>>          getPosition()
> >>>>>>>>>>
> >>>>>>>>>> but this seems like a nearly impossible API to implement
> >>>>>>> given the
> >>>>>>>>>> natore
> >>>>>>>>>> of hte inverted index and the fact that terms aren't ever
> >>>>>>> stored in
> >>>>>>>>>> position order.
> >>>>>>>>>>
> >>>>>>>>>> there's a lot i really don't know/understand about the lucene
> >>>>>>> term
> >>>>>>>>>> position internals ... but as i recall, the datastructure
> >>>>>>> written
> >>>>>>>>>> to disk
> >>>>>>>>>> isn't actually a tree structure inverted index, it's a long
> >>>>>>>>>> sequence of
> >>>>>>>>>> tuples correct?  so in theory you could scan along the tuples
> >>>>>>>>>> untill you
> >>>>>>>>>> find the doc you are interested in, ignoring all of the term
> >>>>>>> info
> >>>>>>>>>> along
> >>>>>>>>>> the way, then whatever term you happen be on at the
> >>>>>>>>>> moment, you
> >>>>>>>>>> could scan
> >>>>>>>>>> along all of the positions until you find one in the range
> >>>>>>> you are
> >>>>>>>>>> interested in -- assuming you do, then you record the current
> >>>>>>> Term
> >>>>>>>>>> (and
> >>>>>>>>>> read your payload data if interested)
> >>>>>>>>>
> >>>>>>>>> I think the main issue I see is in both the payloads and the
> >>>>>>> matching
> >>>>>>>>> case above is that they require a document centric approach.
> >>>>>>>>> And
> >>>>>>>>> then, for each Document,
> >>>>>>>>> you ideally want to be able to just index into an array so
> >>>>>>>>> that
> >>>>>>> you
> >>>>>>>>> can go directly to the position that is needed based on
> >>>>>>>>> Span.getStart()
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> if i remember correctly, the first part of this is easy, and
> >>>>>>>>>> relative fast
> >>>>>>>>>> -- i think skipTo(doc) on a TermDoc or TermPositions will
> >>>>>>> happily
> >>>>>>>>>> scan for
> >>>>>>>>>> the first <term,doc> pair with the correct docId,
> >>>>>>> irregardless of
> >>>>>>>>>> the term
> >>>>>>>>>> ... the only thing i'm not sure about is how efficient it
> >>>>>>>>>> is to
> >>>>>>>>>> loop over
> >>>>>>>>>> nextPosition() for every term you find to see if any of them
> >>>>>>> are in
> >>>>>>>>>> your
> >>>>>>>>>> range ... the best case scenerio is that the first position
> >>>>>>>>>> returned is
> >>>>>>>>>> above the high end of your range, in which case you can stop
> >>>>>>>>>> immediately
> >>>>>>>>>> and seek to the next term -- butthe worst case is that you
> >>>>>>>>>> call
> >>>>>>>>>> nextPosition() over an over a lot of times before you get a
> >>>>>>>>>> position in
> >>>>>>>>>> (or above) your rnage .... an advancePosition(pos) that
> >>>>>>> wokred like
> >>>>>>>>>> seek
> >>>>>>>>>> or skipTo might be helpful here.
> >>>>>>>>>>
> >>>>>>>>>> : I feel like I am missing something obvious.  I would
> >>>>>>> suspect the
> >>>>>>>>>> : highlighter needs to do this, but it seems to take the
> >>>>>>> reanalyze
> >>>>>>>>>> : approach as well (I admit, though, that I have little
> >>>>>>> experience
> >>>>>>>>>> with
> >>>>>>>>>> : the highlighter.)
> >>>>>>>>>>
> >>>>>>>>>> as i understand it the default case is to reanalyze, but
> >>>>>>>>>> if you
> >>>>>>>>> have
> >>>>>>>>>> TermFreqVector info stored with positions (ie: a
> >>>>>>>>>> TermPositionVector) then
> >>>>>>>>>> it can use that to construct a TokenStream by iterating over
> >>>>>>>>>> all
> >>>>>>>>>> terms and
> >>>>>>>>>> writing them into a big array in position order (see the
> >>>>>>>>>> TermSources class
> >>>>>>>>>> in the highlighter)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Ah, I see that now.  Thanks.
> >>>>>>>>>>
> >>>>>>>>>> this makes sense when highlighting because it doesn't know
> >>>>>>>>>> what
> >>>>>>>>>> kind of
> >>>>>>>>>> fragmenter is going to be used so it needs the whole
> >>>>>>> TokenStream,
> >>>>>>>>>> but it
> >>>>>>>>>> seems less then ideal when you are only interested in a small
> >>>>>>>>>> number of
> >>>>>>>>>> position ranges that you know in advance.
> >>>>>>>>>>
> >>>>>>>>>> : I am wondering if it would be useful to have an alternative
> >>>>>>> Term
> >>>>>>>>>> : Vector storage mechanism that was position centric.
> >>>>>>> Because we
> >>>>>>>>>> : couldn't take advantage of the lexicographic
> >>>>>>>>>> compression, it
> >>>>>>>>> would
> >>>>>>>>>> : take up more disk space, but it would be a lot faster for
> >>>>>>> these
> >>>>>>>>>> kinds
> >>>>>>>>>>
> >>>>>>>>>> i'm not sure if it's really neccessary to store the data in a
> >>>>>>>>> position
> >>>>>>>>>> centric manner, assuming we have a way to "seek" by position
> >>>>>>> like i
> >>>>>>>>>> described above -- but then again i don't really know that
> >>>>>>> what i
> >>>>>>>>>> described above is all that possible/practical/performant.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I suppose I could use my Mapper approach to organize things
> >>>>>>>>> in a
> >>>>>>>>> position centric way now that I think about it more.  Just
> >>>>>>> means some
> >>>>>>>>> unpacking and repacking.  Still, probably would perform well
> >>>>>>> enough
> >>>>>>>>> since I can setup the correct structure on the fly.  I will
> >>>>>>> give this
> >>>>>>>>> a try.  Maybe even add a Mapper to do this.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> -Grant
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>> ----------------------------------------------------------------
> >>>>>>> --
> >>>>>>> ---
> >>>>>
> >>>>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>>>>>> For additional commands, e-mail: java-dev-
> >>>>>>>>> [EMAIL PROTECTED]
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>> ------------------------------------------------------
> >>>>>>> Grant Ingersoll
> >>>>>>> http://www.grantingersoll.com/
> >>>>>>> http://lucene.grantingersoll.com
> >>>>>>> http://www.paperoftheweek.com/
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ----------------------------------------------------------------
> >>>>>>> --
> >>>>>>> ---
> >>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>> --------------------------
> >>>>> Grant Ingersoll
> >>>>> http://lucene.grantingersoll.com
> >>>>>
> >>>>> Lucene Helpful Hints:
> >>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>>>>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> --
> >>>>> -
> >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>
> >>>>>
> >>>>
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://lucene.grantingersoll.com
> >>
> >> Lucene Helpful Hints:
> >> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Best Practices for getting Strings from a position range

Reply via email to