eliminating scoring for the sake of efficiency
Hello We don't need any scoring in our application domain, but efficiency is the key because we are getting tens thousand of hits for span queries; all these hits are necessary to collect. Is there a simple way to turn scoring off while indexing, while search and while delivering document IDs to save on time? Best regards Boris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: eliminating scoring for the sake of efficiency
On Thursday 11 May 2006 22:42, Boris Galitsky wrote: > Hello > > We don't need any scoring in our application domain, but > efficiency is the key because we are getting tens thousand of hits for > span queries; all these hits are necessary to collect. > Is there a simple way to turn scoring off while indexing, while > search and while delivering document IDs to save on time? You could use getSpans() on the top level SpanQuery, and use a loop calling next() on the Spans, and ignore duplicate doc() values from the Spans in that loop. A counter in the loop would also give you the number of matching occurrences of the SpanQuery. This way of using the Spans directly should be slightly more efficient than using a HitCollector, but don't hold your breath. In case you have ordered SpanQuery's without overlaps, the NearSpansOrdered here might be a bit faster than the NearSpans currently in Lucene: http://issues.apache.org/jira/browse/LUCENE-413 (you'll also need the patch to SpanNearQuery). Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
accelerate hits.id(i) function: eliminating scoring for the sake of efficiency
Yes, thanks Paul. We are already using getSpans() on the top level SpanQuery, and use a loop calling next() on the Spans, and ignore duplicate doc() values from the Spans in that loop. A counter in the loop would also give you the number of matching occurrences of the SpanQuery. I will look into NearSpansOrdered here might be a bit faster than the NearSpans However what significantly slows us down is the hits.id(i) function. Can we accelerate it somehow "cleaning" Lucene code itself from scoring? Best regards Boris On Thursday 11 May 2006 22:42, Boris Galitsky wrote: Hello We don't need any scoring in our application domain, but efficiency is the key because we are getting tens thousand of hits for span queries; all these hits are necessary to collect. Is there a simple way to turn scoring off while indexing, while search and while delivering document IDs to save on time? You could use getSpans() on the top level SpanQuery, and use a loop calling next() on the Spans, and ignore duplicate doc() values from the Spans in that loop. A counter in the loop would also give you the number of matching occurrences of the SpanQuery. This way of using the Spans directly should be slightly more efficient than using a HitCollector, but don't hold your breath. In case you have ordered SpanQuery's without overlaps, the NearSpansOrdered here might be a bit faster than the NearSpans currently in Lucene: http://issues.apache.org/jira/browse/LUCENE-413 (you'll also need the patch to SpanNearQuery). Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: accelerate hits.id(i) function: eliminating scoring for the sake of efficiency
: However what significantly slows us down is the hits.id(i) function. : Can we accelerate it somehow "cleaning" Lucene code itself from : scoring? you said in your last message... : We don't need any scoring in our application domain, but : efficiency is the key because we are getting tens thousand of hits for : span queries; all these hits are necessary to collect. if you are iterating over all of the matching documents for each query, and you are getting more then a few dozen matches for each query, then you should not be using the Hits obejct at all. Hits is designed for the "common case" or paginated searches with 10-20 items per page, that rarely care about going past page 5 or 6, and don't mind if the high numbered pages take a little longer. If you are iterating over all the matches, then you want do be using a HitCollector. If you use a Hits object, and you iterate past the first 100 results: it will do your search twice under the covers; if you go past the 200th result, it will do your search threetimes. past 400, it will do it 4 times, etc... -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]