eliminating scoring for the sake of efficiency

2006-05-11 Thread Boris Galitsky

Hello

   We don't need any scoring in our application domain, but 
efficiency is the key because we are getting tens thousand of hits for 
span queries; all these hits are necessary to collect.
   Is there a simple way to turn scoring off while indexing, while 
search  and while delivering document IDs to save on time?


Best regards
Boris

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: eliminating scoring for the sake of efficiency

2006-05-11 Thread Paul Elschot
On Thursday 11 May 2006 22:42, Boris Galitsky wrote:
> Hello
> 
> We don't need any scoring in our application domain, but 
> efficiency is the key because we are getting tens thousand of hits for 
> span queries; all these hits are necessary to collect.
> Is there a simple way to turn scoring off while indexing, while 
> search  and while delivering document IDs to save on time?

You could use getSpans() on the top level SpanQuery, and use a loop
calling next() on the Spans, and ignore duplicate doc() values from the Spans
in that loop.
A counter in the loop would also give you the number of matching occurrences
of the SpanQuery.

This way of using the Spans directly should be slightly more efficient than
using a HitCollector, but don't hold your breath.

In case you have ordered SpanQuery's without overlaps, the
NearSpansOrdered here  might be a bit faster than the NearSpans
currently in Lucene:
http://issues.apache.org/jira/browse/LUCENE-413
(you'll also need the patch to SpanNearQuery).

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



accelerate hits.id(i) function: eliminating scoring for the sake of efficiency

2006-05-11 Thread Boris Galitsky

Yes, thanks Paul.

 We are already using

 getSpans() on the top level SpanQuery, and use a loop
calling next() on the Spans, and ignore duplicate doc() values from 
the Spans

in that loop.
A counter in the loop would also give you the number of matching 
occurrences

of the SpanQuery.


I will look into

NearSpansOrdered here  might be a bit faster than the NearSpans


However what significantly slows us down is the hits.id(i) function.
Can we accelerate it somehow "cleaning" Lucene code itself from 
scoring?


Best regards
Boris




On Thursday 11 May 2006 22:42, Boris Galitsky wrote:

Hello

We don't need any scoring in our application domain, but 
efficiency is the key because we are getting tens thousand of hits 
for 
span queries; all these hits are necessary to collect.
Is there a simple way to turn scoring off while indexing, while 
search  and while delivering document IDs to save on time?


You could use getSpans() on the top level SpanQuery, and use a loop
calling next() on the Spans, and ignore duplicate doc() values from 
the Spans

in that loop.
A counter in the loop would also give you the number of matching 
occurrences

of the SpanQuery.

This way of using the Spans directly should be slightly more 
efficient than

using a HitCollector, but don't hold your breath.

In case you have ordered SpanQuery's without overlaps, the
NearSpansOrdered here  might be a bit faster than the NearSpans
currently in Lucene:
http://issues.apache.org/jira/browse/LUCENE-413
(you'll also need the patch to SpanNearQuery).

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: accelerate hits.id(i) function: eliminating scoring for the sake of efficiency

2006-05-11 Thread Chris Hostetter

: However what significantly slows us down is the hits.id(i) function.
: Can we accelerate it somehow "cleaning" Lucene code itself from
: scoring?

you said in your last message...

: We don't need any scoring in our application domain, but
: efficiency is the key because we are getting tens thousand of hits for
: span queries; all these hits are necessary to collect.

if you are iterating over all of the matching documents for each query,
and you are getting more then a few dozen matches for each query, then you
should not be using the Hits obejct at all.

Hits is designed for the "common case" or paginated searches with
10-20 items per page, that rarely care about going past page 5 or 6, and
don't mind if the high numbered pages take a little longer.

If you are iterating over all the matches, then you want do be using a
HitCollector.  If you use a Hits object, and you iterate past the first
100 results: it will do your search twice under the covers; if you go past
the 200th result, it will do your search threetimes. past 400, it will do
it 4 times, etc...



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]