Tim: How about implementing your own HitCollector and stop when you have collected 100 docs with score above certain threshold?
BTW, are there lotsa concurrent searches? -John On Thu, Dec 4, 2008 at 12:52 PM, Tim Sturge <[EMAIL PROTECTED]> wrote: > That makes sense. I should be more precise in that all I need is 100 of the > 10000 "reasonable" results. > > The concern I would have with a TopDocCollector is that this is biased > towards the top of the index which translates for me into a bias for older > documents. I'd prefer no age bias or a newer document bias. So I'll see > what > I can do with a "BottomDocCollector" :-) > > Tim > > > On 12/4/08 12:39 PM, "Erick Erickson" <[EMAIL PROTECTED]> wrote: > > > The problem here is how *could* a system return even the top > > 10,000 results without scoring them all? What if the millionth > > hit resulted in the very best match in the entire corpus? > > > > That said, sorting may well be the issue here rather than scoring. > > You can use a TopDocCollector to get the top N matches (unsorted) > > and then do something like use the FieldSortedHitQueue to sort > > those N matches, leaving out all the rest of the matches. Note > > this assumes that when you say "sorting" you mean sorting > > by something other than relevance..... > > > > Hope this helps > > Erick > > > > On Thu, Dec 4, 2008 at 3:27 PM, Tim Sturge <[EMAIL PROTECTED]> wrote: > > > >> Hi all, > >> > >> I have an interesting problem with my query traffic. Most of the queries > >> run > >> in a fairly short amount of time (< 100ms) but a few take over 1000ms. > >> These > >> queries are predominantly those with a huge number of hits (>1 million > hits > >> in a >100 million document index). The time taken (as far as I can tell) > is > >> for lucene to sit there while it scores and sorts all these results. > >> > >> However it turns out these queries really don¹t have top results. That > is, > >> of the million documents, there are easily 10000 which are decent > results > >> (basically those above some threshold score). Frankly, just returning > some > >> consistent (so paging and reload work) but > >> otherwise arbitrary ranking of these 10000 results would be more than > good > >> enough. > >> > >> It seems to me that a solution would be to impose some sort of > >> pseudo-random > >> filter (e.g. consider only every n-th document assuming they are > uniformly > >> distributed). I¹m wondering if anyone else has experience with this sort > of > >> issue and what solutions they have found to work well in practice. > >> > >> Thanks, > >> > >> Tim > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >