Fair points.
I've tried several sized indexes and blends of query term frequencies now and 
the results swing only marginally between the 2 implementations.
Sometimes the "exiting early" logic is marginally faster and other times 
marginally slower. Using a larger index seemed to reduce the improvement I had 
seen on my initial results.

So overall, not a clear improvement and not worth bothering with because, as 
you suggest, various disk caching strategies probably mitigate the cost of the 
added reads.

Based on your comments re the added int comparison cost in that "hot" loop it 
made me think that the abstract docIdSetIterator.docId() method call could be 
questioned on that basis too?
It looks like all DocIdSetIterator subclasses maintain a doc variable mutated 
elsewhere in advance() and next() calls and docID() is meant to be idempotent 
so presumably a shared variable in the base class could avoid a docID() method 
invocation? 
Anyhoo the profiler did not show that method up as any sort of hotspot so I 
don't think it's an issue.


Thanks, Mike.




----- Original Message -----
From: Michael McCandless <luc...@mikemccandless.com>
To: dev@lucene.apache.org; mark harwood <markharw...@yahoo.co.uk>
Cc: 
Sent: Thursday, 1 March 2012, 14:18
Subject: Re: ConjunctionScorer.doNext() overstays?

On Thu, Mar 1, 2012 at 8:49 AM, mark harwood <markharw...@yahoo.co.uk> wrote:
> I would have assumed the many int comparisons would cost less than the 
> superfluous disk accesses? (I bow to your considerable experience in this 
> area!)
> What is the worst-case scenario on added disk reads? Could it be as bad 
> as numberOfSegments x numberOfOtherscorers before the query winds up?

Well, it depends -- the disk access is a one-time thing but the added
per-hit check is per-hit.  At some point it'll cross over...

I think likely the advance(NO_MORE_DOCS) will not usually hit disk:
our skipper impl fully pre-buffers (in RAM) the top skip lists I
think?  Even if we do go to disk it's likely the OS pre-cached those
bytes in its IO buffer.

> On the index I tried, it looked like an improvement - the spreadsheet I 
> linked to has the source for the benchmark on a second worksheet if you want 
> to give it a whirl on a different dataset.

Maybe try it on a more balanced case?  Ie, N high-freq terms whose
freq is "close-ish"?  And on slow queries (I think the results in your
spreadsheet are very fast queries right?  The slowest one was ~0.95
msec per query, if I'm reading it right?).

In general I think not slowing down the worst-case queries is much
more important that speeding up the super-fast queries.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to