ok, glad we're on the same page. I did some performance testing with span queries and, unfortunately, the results are discouraging for my intended use. When I added a simple SpanNearQuery to existing queries, the throughput decreased by a factor of 10+. I figured spans would be expensive, but not that much. I haven't done profiling yet, but here's a typical thread stack during execution:
at org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java:137) at org.apache.lucene.util.PriorityQueue.adjustTop(PriorityQueue.java:101) at org.apache.lucene.search.spans.NearSpansUnordered.next( NearSpansUnordered.java:128) at org.apache.lucene.search.spans.SpanScorer.setFreqCurrentDoc( SpanScorer.java:83) at org.apache.lucene.search.spans.SpanScorer.next(SpanScorer.java:57) at org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java :56) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:327) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146) at org.apache.lucene.search.Searcher.search(Searcher.java:118) Peter On 8/10/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > Sorry for the confusion. I thought you just wanted access to the > term info per position. I think we will have to add something to > the Spans like we talked about before. > > -Grant > > On Aug 10, 2007, at 11:03 AM, Peter Keegan wrote: > > > Grant, > > > > I'm afraid I don't understand how to use this mapper in the context > > of a > > SpanQuery. It seems like I would have to modify SpanScorer to fetch > > payload > > data and provide a new method to access the payloads while > > iterating through > > the documents. If this can be accomplished without modifying Spans, > > could > > you provide a bit more detail? > > > > Thanks, > > Peter > > > > On 8/9/07, Peter Keegan <[EMAIL PROTECTED]> wrote: > >> > >> Hi Grant, > >> > >> I'm hoping to check this out soon. > >> > >> Thanks, > >> Peter > >> > >> On 8/7/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote: > >>> > >>> Hi Peter, > >>> > >>> Give https://issues.apache.org/jira/browse/LUCENE-975 a try. It > >>> provides a TermVectorMapper that loads by position. > >>> > >>> Still not what ideally what you want, but I haven't had time to > >>> scope > >>> that one out yet., > >>> > >>> -Grant > >>> > >>> On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote: > >>> > >>>> Hi Grant, > >>>> > >>>> No problem - I know you are very busy. I just wanted to get a > >>>> sense for the > >>>> timing because I'd like to use this for a release this Fall. If I > >>>> can get a > >>>> prototype working in the coming weeks AND the performance is > >>>> great :) , this > >>>> would be terrific. If not, I'll have to fall back on a more complex > >>>> design > >>>> that handles the query outside of Lucene :( > >>>> > >>>> In the meantime, I'll try playing with LUCENE-868. > >>>> > >>>> Thanks for the update. > >>>> Peter > >>>> > >>>> On 7/24/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote: > >>>>> > >>>>> Sorry, Peter, I haven't had a chance to work on it. I don't > >>>>> see it > >>>>> happening this week, but maybe next. > >>>>> > >>>>> I do think the Mapper approach via TermVectors will work. It will > >>>>> require implementing a new mapper that orders by position, but I > >>>>> don't think that is too hard. I started on one on the LUCENE-868 > >>>>> patch (version 4) but it is not complete. Maybe you want to pick > >>>>> it up? > >>>>> > >>>>> With this approach, you would iterate your spans, when you come > >>>>> to a > >>>>> new doc, you would load the term vector using the > >>>>> PositionMapper, and > >>>>> then you could index into the positions for the matches in the > >>>>> document. > >>>>> > >>>>> I realize this does not cover the just wanting to get the > >>>>> Payload at > >>>>> the match issue. Maybe next week... > >>>>> > >>>>> Cheers, > >>>>> Grant > >>>>> > >>>>> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote: > >>>>> > >>>>>> Any idea on when this might be available (days, weeks...)? > >>>>>> > >>>>>> Peter > >>>>>> > >>>>>> On 7/16/07, Grant Ingersoll < [EMAIL PROTECTED]> wrote: > >>>>>>> > >>>>>>> > >>>>>>> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> : Do we have a best practice for going from, say a SpanQuery > >>>>> doc/ > >>>>>>>> : position information and retrieving the actual range of > >>>>>>> positions of > >>>>>>>> : content from the Document? Is it just to reanalyze the > >>>>> Document > >>>>>>>> : using the appropriate Analyzer and start recording once you > >>>>>>> hit the > >>>>>>>> : positions you are interested in? Seems like Term Vectors > >>>>>>> _could_ > >>>>>>>> : help, but even my new Mapper approach patch (LUCENE-868) > >>>>> doesn't > >>>>>>>> : really help, because they are stored in a term-centric > >>>>> manner. I > >>>>>>>> : guess what I am after is a position centric approach. That > >>>>>>> is, give > >>>>>>>> > >>>>>>>> this is kind of what i was suggesting in the last message i > >>>>>>>> sent > >>>>>>>> to the java-user thread about paylods and SpanQueries (which > >>>>>>>> i'm > >>>>>>>> guessing is what prompted this thread as well)... > >>>>>>>> > >>>>>>>> http://www.nabble.com/Payloads-and-PhraseQuery- > >>>>>>>> tf3988826.html#a11551628 > >>>>>>> > >>>>>>> > >>>>>>> This is one use case, the other is related to the new patch I > >>>>>>> submitted for LUCENE-960. In this case, I have a > >>>>>>> SpanQueryFilter > >>>>>>> that identifies a bunch of docs and positions ahead of time. > >>>>>>> Then > >>> > >>>>>>> the user enters new Span Query and I want to relate the matches > >>>>> from > >>>>>>> the user query with the positions of matches in the filter and > >>>>> then > >>>>>>> show that window. > >>>>>>> > >>>>>>>> > >>>>>>>> my point was that currently, to retrieve a payload you need a > >>>>>>>> TermPositions instance, which is designed for iterating in the > >>>>>>>> order of... > >>>>>>>> seek(term) > >>>>>>>> skipTo(doc) > >>>>>>>> nextPosition() > >>>>>>>> getPayload() > >>>>>>>> ...which is great for getting the payload of every instance > >>>>>>>> (ie:position) of a specific term in a given document (or in > >>>>> every > >>>>>>>> document) but without serious changes to the Spans API, the > >>>>> ideal > >>>>>>>> payload > >>>>>>>> API would let you say... > >>>>>>>> skipTo(doc) > >>>>>>>> advance(startPosition) > >>>>>>>> getPayload() > >>>>>>>> while (nextPosition() < endPosition) > >>>>>>>> getPosition() > >>>>>>>> > >>>>>>>> but this seems like a nearly impossible API to implement > >>>>> given the > >>>>>>>> natore > >>>>>>>> of hte inverted index and the fact that terms aren't ever > >>>>> stored in > >>>>>>>> position order. > >>>>>>>> > >>>>>>>> there's a lot i really don't know/understand about the lucene > >>>>> term > >>>>>>>> position internals ... but as i recall, the datastructure > >>>>> written > >>>>>>>> to disk > >>>>>>>> isn't actually a tree structure inverted index, it's a long > >>>>>>>> sequence of > >>>>>>>> tuples correct? so in theory you could scan along the tuples > >>>>>>>> untill you > >>>>>>>> find the doc you are interested in, ignoring all of the term > >>>>> info > >>>>>>>> along > >>>>>>>> the way, then whatever term you happen be on at the moment, you > >>>>>>>> could scan > >>>>>>>> along all of the positions until you find one in the range > >>>>> you are > >>>>>>>> interested in -- assuming you do, then you record the current > >>>>> Term > >>>>>>>> (and > >>>>>>>> read your payload data if interested) > >>>>>>> > >>>>>>> I think the main issue I see is in both the payloads and the > >>>>> matching > >>>>>>> case above is that they require a document centric approach. > >>>>>>> And > >>>>>>> then, for each Document, > >>>>>>> you ideally want to be able to just index into an array so that > >>>>> you > >>>>>>> can go directly to the position that is needed based on > >>>>>>> Span.getStart() > >>>>>>> > >>>>>>>> > >>>>>>>> if i remember correctly, the first part of this is easy, and > >>>>>>>> relative fast > >>>>>>>> -- i think skipTo(doc) on a TermDoc or TermPositions will > >>>>> happily > >>>>>>>> scan for > >>>>>>>> the first <term,doc> pair with the correct docId, > >>>>> irregardless of > >>>>>>>> the term > >>>>>>>> ... the only thing i'm not sure about is how efficient it is to > >>>>>>>> loop over > >>>>>>>> nextPosition() for every term you find to see if any of them > >>>>> are in > >>>>>>>> your > >>>>>>>> range ... the best case scenerio is that the first position > >>>>>>>> returned is > >>>>>>>> above the high end of your range, in which case you can stop > >>>>>>>> immediately > >>>>>>>> and seek to the next term -- butthe worst case is that you call > >>>>>>>> nextPosition() over an over a lot of times before you get a > >>>>>>>> position in > >>>>>>>> (or above) your rnage .... an advancePosition(pos) that > >>>>> wokred like > >>>>>>>> seek > >>>>>>>> or skipTo might be helpful here. > >>>>>>>> > >>>>>>>> : I feel like I am missing something obvious. I would > >>>>> suspect the > >>>>>>>> : highlighter needs to do this, but it seems to take the > >>>>> reanalyze > >>>>>>>> : approach as well (I admit, though, that I have little > >>>>> experience > >>>>>>>> with > >>>>>>>> : the highlighter.) > >>>>>>>> > >>>>>>>> as i understand it the default case is to reanalyze, but if you > >>>>>>> have > >>>>>>>> TermFreqVector info stored with positions (ie: a > >>>>>>>> TermPositionVector) then > >>>>>>>> it can use that to construct a TokenStream by iterating over > >>>>>>>> all > >>>>>>>> terms and > >>>>>>>> writing them into a big array in position order (see the > >>>>>>>> TermSources class > >>>>>>>> in the highlighter) > >>>>>>> > >>>>>>> > >>>>>>> Ah, I see that now. Thanks. > >>>>>>>> > >>>>>>>> this makes sense when highlighting because it doesn't know what > >>>>>>>> kind of > >>>>>>>> fragmenter is going to be used so it needs the whole > >>>>> TokenStream, > >>>>>>>> but it > >>>>>>>> seems less then ideal when you are only interested in a small > >>>>>>>> number of > >>>>>>>> position ranges that you know in advance. > >>>>>>>> > >>>>>>>> : I am wondering if it would be useful to have an alternative > >>>>> Term > >>>>>>>> : Vector storage mechanism that was position centric. > >>>>> Because we > >>>>>>>> : couldn't take advantage of the lexicographic compression, it > >>>>>>> would > >>>>>>>> : take up more disk space, but it would be a lot faster for > >>>>> these > >>>>>>>> kinds > >>>>>>>> > >>>>>>>> i'm not sure if it's really neccessary to store the data in a > >>>>>>> position > >>>>>>>> centric manner, assuming we have a way to "seek" by position > >>>>> like i > >>>>>>>> described above -- but then again i don't really know that > >>>>> what i > >>>>>>>> described above is all that possible/practical/performant. > >>>>>>>> > >>>>>>> > >>>>>>> I suppose I could use my Mapper approach to organize things in a > >>>>>>> position centric way now that I think about it more. Just > >>>>> means some > >>>>>>> unpacking and repacking. Still, probably would perform well > >>>>> enough > >>>>>>> since I can setup the correct structure on the fly. I will > >>>>> give this > >>>>>>> a try. Maybe even add a Mapper to do this. > >>>>>>> > >>>>>>> > >>>>>>> -Grant > >>>>>>> > >>>>>>> > >>>>> ------------------------------------------------------------------ > >>>>> --- > >>> > >>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>>>>> > >>>>>>> > >>>>> > >>>>> ------------------------------------------------------ > >>>>> Grant Ingersoll > >>>>> http://www.grantingersoll.com/ > >>>>> http://lucene.grantingersoll.com > >>>>> http://www.paperoftheweek.com/ > >>>>> > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------ > >>>>> --- > >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>>> > >>>>> > >>> > >>> -------------------------- > >>> Grant Ingersoll > >>> http://lucene.grantingersoll.com > >>> > >>> Lucene Helpful Hints: > >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance > >>> http://wiki.apache.org/lucene-java/LuceneFAQ > >>> > >>> > >>> > >>> -------------------------------------------------------------------- > >>> - > >>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>> For additional commands, e-mail: [EMAIL PROTECTED] > >>> > >>> > >> > > -------------------------- > Grant Ingersoll > http://lucene.grantingersoll.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >