I have access to TREC. I can try this. W.r.t the large indexes - I don't have access to the data, just scenarios our customers ran into the past. Does the benchmark package includes code to crawl Wikipedia? If not, do you have such code? I don't want to write it from scratch for this specific task.
On Dec 10, 2007 1:50 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > I don't offhand. Working on the indexing side is so much easier :) > > You mentioned your experience with large indices & large result sets > -- is that something you could draw on? > > There have also been discussions lately about finding real search > logs we could use for exactly this reason, though I don't think > that's come to a good solution yet. > > As a simple test you could break Wikipedia into smallish docs (~4K > each = ~2.1 million docs), build the index, and make up a set of > queries, or randomly pick terms for queries? Obviously the queries > aren't "real", but it's at least a step closer.... progress not > perfection. > > Or, if you have access to TREC... > > Mike > > Shai Erera wrote: > > > Do you have a dataset and queries I can test on? > > > > On Dec 10, 2007 1:16 PM, Michael McCandless > > <[EMAIL PROTECTED]> > > wrote: > > > >> Shai Erera wrote: > >> > >>> No - I didn't try to populate an index with real data and run real > >>> queries > >>> (what is "real" after all?). I know from my experience of indexes > >>> with > >>> several millions of documents where there are queries with several > >>> hundred > >>> thousands results (one query even hit 2.5 M documents). This is > >>> typical in > >>> search: users type on average 2.3 terms in a query. The chances > >>> you'd hit a > >>> query with huge result set are not that small in such cases (I'm > >>> not saying > >>> this is the most common case though, I agree that most of the > >>> searches don't > >>> process that many documents). > >> > >> Agreed: many queries do hit a great many results. But I agree with > >> Paul: > >> it's not clear how this "typically" translates into how many > >> ScoreDocs > >> get created? > >> > >>> However, this change will improve performance from the algorithm > >>> point of > >>> view - you allocate as many as numRequestedHits+1 no matter how many > >>> documents your query processes. > >> > >> It's definitely a good step forward: not creating extra garbage in > >> hot > >> spots is worthwhile, so I think we should make this change. Still > >> I'm > >> wondering how much this helps in practice. > >> > >> I think benchmarking on "real" use cases (vs synthetic tests) is > >> worthwhile: it keeps you focused on what really counts, in the end. > >> > >> In this particular case there are at least 2 things it could show us: > >> > >> * How many ScoreDocs really get created, or, what %tg of hits > >> actually result in an insertion into the PQ? > >> > >> * How much is this savings as a %tg of the overall time spent > >> searching? > >> > >> Mike > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > > > > > -- > > Regards, > > > > Shai Erera > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Shai Erera