Tim:
     How about implementing your own HitCollector and stop when you have
collected 100 docs with score above certain threshold?

     BTW, are there lotsa concurrent searches?

-John

On Thu, Dec 4, 2008 at 12:52 PM, Tim Sturge <[EMAIL PROTECTED]> wrote:

> That makes sense. I should be more precise in that all I need is 100 of the
> 10000 "reasonable" results.
>
> The concern I would have with a TopDocCollector is that this is biased
> towards the top of the index which translates for me into a bias for older
> documents. I'd prefer no age bias or a newer document bias. So I'll see
> what
> I can do with a "BottomDocCollector" :-)
>
> Tim
>
>
> On 12/4/08 12:39 PM, "Erick Erickson" <[EMAIL PROTECTED]> wrote:
>
> > The problem here is how *could* a system return even the top
> > 10,000 results without scoring them all? What if the millionth
> > hit resulted in the very best match in the entire corpus?
> >
> > That said, sorting may well be the issue here rather than scoring.
> > You can use a TopDocCollector to get the top N matches (unsorted)
> > and then do something like use the FieldSortedHitQueue to sort
> > those N matches, leaving out all the rest of the matches. Note
> > this assumes that when you say "sorting" you mean sorting
> > by something other than relevance.....
> >
> > Hope this helps
> > Erick
> >
> > On Thu, Dec 4, 2008 at 3:27 PM, Tim Sturge <[EMAIL PROTECTED]> wrote:
> >
> >> Hi all,
> >>
> >> I have an interesting problem with my query traffic. Most of the queries
> >> run
> >> in a fairly short amount of time (< 100ms) but a few take over 1000ms.
> >> These
> >> queries are predominantly those with a huge number of hits (>1 million
> hits
> >> in a >100 million document index). The time taken (as far as I can tell)
> is
> >> for lucene to sit there while it scores and sorts all these results.
> >>
> >> However it turns out these queries really don¹t have top results. That
> is,
> >> of the million documents, there are easily 10000 which are decent
> results
> >> (basically those above some threshold score). Frankly, just returning
> some
> >> consistent (so paging and reload work) but
> >> otherwise arbitrary ranking of these 10000 results would be more than
> good
> >> enough.
> >>
> >> It seems to me that a solution would be to impose some sort of
> >> pseudo-random
> >> filter (e.g. consider only every n-th document assuming they are
> uniformly
> >> distributed). I¹m wondering if anyone else has experience with this sort
> of
> >> issue and what solutions they have found to work well in practice.
> >>
> >> Thanks,
> >>
> >> Tim
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Reply via email to