The problem here is how *could* a system return even the top
10,000 results without scoring them all? What if the millionth
hit resulted in the very best match in the entire corpus?

That said, sorting may well be the issue here rather than scoring.
You can use a TopDocCollector to get the top N matches (unsorted)
and then do something like use the FieldSortedHitQueue to sort
those N matches, leaving out all the rest of the matches. Note
this assumes that when you say "sorting" you mean sorting
by something other than relevance.....

Hope this helps
Erick

On Thu, Dec 4, 2008 at 3:27 PM, Tim Sturge <[EMAIL PROTECTED]> wrote:

> Hi all,
>
> I have an interesting problem with my query traffic. Most of the queries
> run
> in a fairly short amount of time (< 100ms) but a few take over 1000ms.
> These
> queries are predominantly those with a huge number of hits (>1 million hits
> in a >100 million document index). The time taken (as far as I can tell) is
> for lucene to sit there while it scores and sorts all these results.
>
> However it turns out these queries really don¹t have top results. That is,
> of the million documents, there are easily 10000 which are decent results
> (basically those above some threshold score). Frankly, just returning some
> consistent (so paging and reload work) but
> otherwise arbitrary ranking of these 10000 results would be more than good
> enough.
>
> It seems to me that a solution would be to impose some sort of
> pseudo-random
> filter (e.g. consider only every n-th document assuming they are uniformly
> distributed). I¹m wondering if anyone else has experience with this sort of
> issue and what solutions they have found to work well in practice.
>
> Thanks,
>
> Tim
>

Reply via email to