Do you have a dataset and queries I can test on?

On Dec 10, 2007 1:16 PM, Michael McCandless <[EMAIL PROTECTED]>
wrote:

> Shai Erera wrote:
>
> > No - I didn't try to populate an index with real data and run real
> > queries
> > (what is "real" after all?). I know from my experience of indexes with
> > several millions of documents where there are queries with several
> > hundred
> > thousands results (one query even hit 2.5 M documents). This is
> > typical in
> > search: users type on average 2.3 terms in a query. The chances
> > you'd hit a
> > query with huge result set are not that small in such cases (I'm
> > not saying
> > this is the most common case though, I agree that most of the
> > searches don't
> > process that many documents).
>
> Agreed: many queries do hit a great many results.  But I agree with
> Paul:
> it's not clear how this "typically" translates into how many ScoreDocs
> get created?
>
> > However, this change will improve performance from the algorithm
> > point of
> > view - you allocate as many as numRequestedHits+1 no matter how many
> > documents your query processes.
>
> It's definitely a good step forward: not creating extra garbage in hot
> spots is worthwhile, so I think we should make this change.  Still I'm
> wondering how much this helps in practice.
>
> I think benchmarking on "real" use cases (vs synthetic tests) is
> worthwhile: it keeps you focused on what really counts, in the end.
>
> In this particular case there are at least 2 things it could show us:
>
>   * How many ScoreDocs really get created, or, what %tg of hits
>     actually result in an insertion into the PQ?
>
>   * How much is this savings as a %tg of the overall time spent
>     searching?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,

Shai Erera

Reply via email to