Re: Performance Improvement for Search using PriorityQueue

Michael McCandless Mon, 10 Dec 2007 05:55:30 -0800


OK, sounds like a plan, thanks!

Yes, contrib/benchmark has EnwikiDocMaker to generate docs off theWikipedia XML export file.


Mike

On Dec 10, 2007, at 7:03 AM, Shai Erera wrote:

I have access to TREC. I can try this.

W.r.t the large indexes - I don't have access to the data, justscenarios

our customers ran into the past.

Does the benchmark package includes code to crawl Wikipedia? Ifnot, do youhave such code? I don't want to write it from scratch for thisspecific

task.

On Dec 10, 2007 1:50 PM, Michael McCandless<[EMAIL PROTECTED]>

wrote:


I don't offhand.  Working on the indexing side is so much easier :)

You mentioned your experience with large indices & large result sets
-- is that something you could draw on?

There have also been discussions lately about finding real search
logs we could use for exactly this reason, though I don't think
that's come to a good solution yet.

As a simple test you could break Wikipedia into smallish docs (~4K
each = ~2.1 million docs), build the index, and make up a set of
queries, or randomly pick terms for queries?  Obviously the queries
aren't "real", but it's at least a step closer.... progress not
perfection.

Or, if you have access to TREC...

Mike

Shai Erera wrote:

Do you have a dataset and queries I can test on?

On Dec 10, 2007 1:16 PM, Michael McCandless
<[EMAIL PROTECTED]>
wrote:

Shai Erera wrote:

No - I didn't try to populate an index with real data and run real
queries
(what is "real" after all?). I know from my experience of indexes
with
several millions of documents where there are queries with several
hundred
thousands results (one query even hit 2.5 M documents). This is
typical in
search: users type on average 2.3 terms in a query. The chances
you'd hit a
query with huge result set are not that small in such cases (I'm
not saying
this is the most common case though, I agree that most of the
searches don't
process that many documents).


Agreed: many queries do hit a great many results.  But I agree with
Paul:
it's not clear how this "typically" translates into how many
ScoreDocs
get created?

However, this change will improve performance from the algorithm
point of
view - you allocate as many as numRequestedHits+1 no matter howmany
documents your query processes.


It's definitely a good step forward: not creating extra garbage in
hot
spots is worthwhile, so I think we should make this change.  Still
I'm
wondering how much this helps in practice.

I think benchmarking on "real" use cases (vs synthetic tests) is
worthwhile: it keeps you focused on what really counts, in the end.

In this particular case there are at least 2 things it couldshow us:


  * How many ScoreDocs really get created, or, what %tg of hits
    actually result in an insertion into the PQ?

  * How much is this savings as a %tg of the overall time spent
    searching?

Mike

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Regards,

Shai Erera



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Regards,

Shai Erera



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Performance Improvement for Search using PriorityQueue

Reply via email to