Grant, I built an index as described here: http://www.nabble.com/SpanQuery-and-database-join-tf4262902.html
Many documents have only 1 or 2 rows, some have dozens. Here is a typical query without spans: +((+contents:quaker +contents:cereal) (+boost50:quaker +boost50:cereal)) +literals:co$us), sort=<custom:"feedbabe": [EMAIL PROTECTED]>,"dateactiveR"! Here is a typical query with spans: +spanNear([adliterals:jb$1, adliterals:co$us], 8, false) +(+((+contents:quaker +contents:cereal) (+boost50:quaker +boost50:cereal)) +literals:co$us), sort=<custom:"feedbabe": [EMAIL PROTECTED]>,"dateactiveR"! The addition of the spanNear clause caused the 10X decrease in throughput. I could probably change the way rows are indexed and use ordered terms, which seems to be a bit faster (only 5X decrease) Peter On 8/14/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > Hi Peter, > > Could you give more details on this test? What are you comparing, > etc.? Sample queries would be good. I would like to write up a > contrib/Benchmark algorithm to begin investigating this and see if > there is anything that can be done. > > Thanks, > Grant > > On Aug 10, 2007, at 6:27 PM, Peter Keegan wrote: > > > ok, glad we're on the same page. > > > > I did some performance testing with span queries and, > > unfortunately, the > > results are discouraging for my intended use. When I added a simple > > SpanNearQuery to existing queries, the throughput decreased by a > > factor of > > 10+. I figured spans would be expensive, but not that much. I > > haven't done > > profiling yet, but here's a typical thread stack during execution: > > > > at org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java: > > 137) > > at org.apache.lucene.util.PriorityQueue.adjustTop > > (PriorityQueue.java:101) > > at org.apache.lucene.search.spans.NearSpansUnordered.next( > > NearSpansUnordered.java:128) > > at org.apache.lucene.search.spans.SpanScorer.setFreqCurrentDoc( > > SpanScorer.java:83) > > at org.apache.lucene.search.spans.SpanScorer.next(SpanScorer.java:57) > > at org.apache.lucene.search.ConjunctionScorer.next > > (ConjunctionScorer.java > > :56) > > at org.apache.lucene.search.BooleanScorer2.score > > (BooleanScorer2.java:327) > > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java: > > 146) > > at org.apache.lucene.search.Searcher.search(Searcher.java:118) > > > > Peter > > > > > > On 8/10/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > >> > >> Sorry for the confusion. I thought you just wanted access to the > >> term info per position. I think we will have to add something to > >> the Spans like we talked about before. > >> > >> -Grant > >> > >> On Aug 10, 2007, at 11:03 AM, Peter Keegan wrote: > >> > >>> Grant, > >>> > >>> I'm afraid I don't understand how to use this mapper in the context > >>> of a > >>> SpanQuery. It seems like I would have to modify SpanScorer to fetch > >>> payload > >>> data and provide a new method to access the payloads while > >>> iterating through > >>> the documents. If this can be accomplished without modifying Spans, > >>> could > >>> you provide a bit more detail? > >>> > >>> Thanks, > >>> Peter > >>> > >>> On 8/9/07, Peter Keegan <[EMAIL PROTECTED]> wrote: > >>>> > >>>> Hi Grant, > >>>> > >>>> I'm hoping to check this out soon. > >>>> > >>>> Thanks, > >>>> Peter > >>>> > >>>> On 8/7/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote: > >>>>> > >>>>> Hi Peter, > >>>>> > >>>>> Give https://issues.apache.org/jira/browse/LUCENE-975 a try. It > >>>>> provides a TermVectorMapper that loads by position. > >>>>> > >>>>> Still not what ideally what you want, but I haven't had time to > >>>>> scope > >>>>> that one out yet., > >>>>> > >>>>> -Grant > >>>>> > >>>>> On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote: > >>>>> > >>>>>> Hi Grant, > >>>>>> > >>>>>> No problem - I know you are very busy. I just wanted to get a > >>>>>> sense for the > >>>>>> timing because I'd like to use this for a release this Fall. If I > >>>>>> can get a > >>>>>> prototype working in the coming weeks AND the performance is > >>>>>> great :) , this > >>>>>> would be terrific. If not, I'll have to fall back on a more > >>>>>> complex > >>>>>> design > >>>>>> that handles the query outside of Lucene :( > >>>>>> > >>>>>> In the meantime, I'll try playing with LUCENE-868. > >>>>>> > >>>>>> Thanks for the update. > >>>>>> Peter > >>>>>> > >>>>>> On 7/24/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote: > >>>>>>> > >>>>>>> Sorry, Peter, I haven't had a chance to work on it. I don't > >>>>>>> see it > >>>>>>> happening this week, but maybe next. > >>>>>>> > >>>>>>> I do think the Mapper approach via TermVectors will work. It > >>>>>>> will > >>>>>>> require implementing a new mapper that orders by position, but I > >>>>>>> don't think that is too hard. I started on one on the > >>>>>>> LUCENE-868 > >>>>>>> patch (version 4) but it is not complete. Maybe you want to > >>>>>>> pick > >>>>>>> it up? > >>>>>>> > >>>>>>> With this approach, you would iterate your spans, when you come > >>>>>>> to a > >>>>>>> new doc, you would load the term vector using the > >>>>>>> PositionMapper, and > >>>>>>> then you could index into the positions for the matches in the > >>>>>>> document. > >>>>>>> > >>>>>>> I realize this does not cover the just wanting to get the > >>>>>>> Payload at > >>>>>>> the match issue. Maybe next week... > >>>>>>> > >>>>>>> Cheers, > >>>>>>> Grant > >>>>>>> > >>>>>>> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote: > >>>>>>> > >>>>>>>> Any idea on when this might be available (days, weeks...)? > >>>>>>>> > >>>>>>>> Peter > >>>>>>>> > >>>>>>>> On 7/16/07, Grant Ingersoll < [EMAIL PROTECTED]> wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote: > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> : Do we have a best practice for going from, say a SpanQuery > >>>>>>> doc/ > >>>>>>>>>> : position information and retrieving the actual range of > >>>>>>>>> positions of > >>>>>>>>>> : content from the Document? Is it just to reanalyze the > >>>>>>> Document > >>>>>>>>>> : using the appropriate Analyzer and start recording once you > >>>>>>>>> hit the > >>>>>>>>>> : positions you are interested in? Seems like Term Vectors > >>>>>>>>> _could_ > >>>>>>>>>> : help, but even my new Mapper approach patch (LUCENE-868) > >>>>>>> doesn't > >>>>>>>>>> : really help, because they are stored in a term-centric > >>>>>>> manner. I > >>>>>>>>>> : guess what I am after is a position centric approach. That > >>>>>>>>> is, give > >>>>>>>>>> > >>>>>>>>>> this is kind of what i was suggesting in the last message i > >>>>>>>>>> sent > >>>>>>>>>> to the java-user thread about paylods and SpanQueries (which > >>>>>>>>>> i'm > >>>>>>>>>> guessing is what prompted this thread as well)... > >>>>>>>>>> > >>>>>>>>>> http://www.nabble.com/Payloads-and-PhraseQuery- > >>>>>>>>>> tf3988826.html#a11551628 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> This is one use case, the other is related to the new patch I > >>>>>>>>> submitted for LUCENE-960. In this case, I have a > >>>>>>>>> SpanQueryFilter > >>>>>>>>> that identifies a bunch of docs and positions ahead of time. > >>>>>>>>> Then > >>>>> > >>>>>>>>> the user enters new Span Query and I want to relate the > >>>>>>>>> matches > >>>>>>> from > >>>>>>>>> the user query with the positions of matches in the filter and > >>>>>>> then > >>>>>>>>> show that window. > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> my point was that currently, to retrieve a payload you need a > >>>>>>>>>> TermPositions instance, which is designed for iterating in > >>>>>>>>>> the > >>>>>>>>>> order of... > >>>>>>>>>> seek(term) > >>>>>>>>>> skipTo(doc) > >>>>>>>>>> nextPosition() > >>>>>>>>>> getPayload() > >>>>>>>>>> ...which is great for getting the payload of every instance > >>>>>>>>>> (ie:position) of a specific term in a given document (or in > >>>>>>> every > >>>>>>>>>> document) but without serious changes to the Spans API, the > >>>>>>> ideal > >>>>>>>>>> payload > >>>>>>>>>> API would let you say... > >>>>>>>>>> skipTo(doc) > >>>>>>>>>> advance(startPosition) > >>>>>>>>>> getPayload() > >>>>>>>>>> while (nextPosition() < endPosition) > >>>>>>>>>> getPosition() > >>>>>>>>>> > >>>>>>>>>> but this seems like a nearly impossible API to implement > >>>>>>> given the > >>>>>>>>>> natore > >>>>>>>>>> of hte inverted index and the fact that terms aren't ever > >>>>>>> stored in > >>>>>>>>>> position order. > >>>>>>>>>> > >>>>>>>>>> there's a lot i really don't know/understand about the lucene > >>>>>>> term > >>>>>>>>>> position internals ... but as i recall, the datastructure > >>>>>>> written > >>>>>>>>>> to disk > >>>>>>>>>> isn't actually a tree structure inverted index, it's a long > >>>>>>>>>> sequence of > >>>>>>>>>> tuples correct? so in theory you could scan along the tuples > >>>>>>>>>> untill you > >>>>>>>>>> find the doc you are interested in, ignoring all of the term > >>>>>>> info > >>>>>>>>>> along > >>>>>>>>>> the way, then whatever term you happen be on at the > >>>>>>>>>> moment, you > >>>>>>>>>> could scan > >>>>>>>>>> along all of the positions until you find one in the range > >>>>>>> you are > >>>>>>>>>> interested in -- assuming you do, then you record the current > >>>>>>> Term > >>>>>>>>>> (and > >>>>>>>>>> read your payload data if interested) > >>>>>>>>> > >>>>>>>>> I think the main issue I see is in both the payloads and the > >>>>>>> matching > >>>>>>>>> case above is that they require a document centric approach. > >>>>>>>>> And > >>>>>>>>> then, for each Document, > >>>>>>>>> you ideally want to be able to just index into an array so > >>>>>>>>> that > >>>>>>> you > >>>>>>>>> can go directly to the position that is needed based on > >>>>>>>>> Span.getStart() > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> if i remember correctly, the first part of this is easy, and > >>>>>>>>>> relative fast > >>>>>>>>>> -- i think skipTo(doc) on a TermDoc or TermPositions will > >>>>>>> happily > >>>>>>>>>> scan for > >>>>>>>>>> the first <term,doc> pair with the correct docId, > >>>>>>> irregardless of > >>>>>>>>>> the term > >>>>>>>>>> ... the only thing i'm not sure about is how efficient it > >>>>>>>>>> is to > >>>>>>>>>> loop over > >>>>>>>>>> nextPosition() for every term you find to see if any of them > >>>>>>> are in > >>>>>>>>>> your > >>>>>>>>>> range ... the best case scenerio is that the first position > >>>>>>>>>> returned is > >>>>>>>>>> above the high end of your range, in which case you can stop > >>>>>>>>>> immediately > >>>>>>>>>> and seek to the next term -- butthe worst case is that you > >>>>>>>>>> call > >>>>>>>>>> nextPosition() over an over a lot of times before you get a > >>>>>>>>>> position in > >>>>>>>>>> (or above) your rnage .... an advancePosition(pos) that > >>>>>>> wokred like > >>>>>>>>>> seek > >>>>>>>>>> or skipTo might be helpful here. > >>>>>>>>>> > >>>>>>>>>> : I feel like I am missing something obvious. I would > >>>>>>> suspect the > >>>>>>>>>> : highlighter needs to do this, but it seems to take the > >>>>>>> reanalyze > >>>>>>>>>> : approach as well (I admit, though, that I have little > >>>>>>> experience > >>>>>>>>>> with > >>>>>>>>>> : the highlighter.) > >>>>>>>>>> > >>>>>>>>>> as i understand it the default case is to reanalyze, but > >>>>>>>>>> if you > >>>>>>>>> have > >>>>>>>>>> TermFreqVector info stored with positions (ie: a > >>>>>>>>>> TermPositionVector) then > >>>>>>>>>> it can use that to construct a TokenStream by iterating over > >>>>>>>>>> all > >>>>>>>>>> terms and > >>>>>>>>>> writing them into a big array in position order (see the > >>>>>>>>>> TermSources class > >>>>>>>>>> in the highlighter) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Ah, I see that now. Thanks. > >>>>>>>>>> > >>>>>>>>>> this makes sense when highlighting because it doesn't know > >>>>>>>>>> what > >>>>>>>>>> kind of > >>>>>>>>>> fragmenter is going to be used so it needs the whole > >>>>>>> TokenStream, > >>>>>>>>>> but it > >>>>>>>>>> seems less then ideal when you are only interested in a small > >>>>>>>>>> number of > >>>>>>>>>> position ranges that you know in advance. > >>>>>>>>>> > >>>>>>>>>> : I am wondering if it would be useful to have an alternative > >>>>>>> Term > >>>>>>>>>> : Vector storage mechanism that was position centric. > >>>>>>> Because we > >>>>>>>>>> : couldn't take advantage of the lexicographic > >>>>>>>>>> compression, it > >>>>>>>>> would > >>>>>>>>>> : take up more disk space, but it would be a lot faster for > >>>>>>> these > >>>>>>>>>> kinds > >>>>>>>>>> > >>>>>>>>>> i'm not sure if it's really neccessary to store the data in a > >>>>>>>>> position > >>>>>>>>>> centric manner, assuming we have a way to "seek" by position > >>>>>>> like i > >>>>>>>>>> described above -- but then again i don't really know that > >>>>>>> what i > >>>>>>>>>> described above is all that possible/practical/performant. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> I suppose I could use my Mapper approach to organize things > >>>>>>>>> in a > >>>>>>>>> position centric way now that I think about it more. Just > >>>>>>> means some > >>>>>>>>> unpacking and repacking. Still, probably would perform well > >>>>>>> enough > >>>>>>>>> since I can setup the correct structure on the fly. I will > >>>>>>> give this > >>>>>>>>> a try. Maybe even add a Mapper to do this. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -Grant > >>>>>>>>> > >>>>>>>>> > >>>>>>> ---------------------------------------------------------------- > >>>>>>> -- > >>>>>>> --- > >>>>> > >>>>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>>>>>> For additional commands, e-mail: java-dev- > >>>>>>>>> [EMAIL PROTECTED] > >>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>>> ------------------------------------------------------ > >>>>>>> Grant Ingersoll > >>>>>>> http://www.grantingersoll.com/ > >>>>>>> http://lucene.grantingersoll.com > >>>>>>> http://www.paperoftheweek.com/ > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> ---------------------------------------------------------------- > >>>>>>> -- > >>>>>>> --- > >>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>>>>> > >>>>>>> > >>>>> > >>>>> -------------------------- > >>>>> Grant Ingersoll > >>>>> http://lucene.grantingersoll.com > >>>>> > >>>>> Lucene Helpful Hints: > >>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance > >>>>> http://wiki.apache.org/lucene-java/LuceneFAQ > >>>>> > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------ > >>>>> -- > >>>>> - > >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>>> > >>>>> > >>>> > >> > >> -------------------------- > >> Grant Ingersoll > >> http://lucene.grantingersoll.com > >> > >> Lucene Helpful Hints: > >> http://wiki.apache.org/lucene-java/BasicsOfPerformance > >> http://wiki.apache.org/lucene-java/LuceneFAQ > >> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > ------------------------------------------------------ > Grant Ingersoll > http://www.grantingersoll.com/ > http://lucene.grantingersoll.com > http://www.paperoftheweek.com/ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >