Fair points. I've tried several sized indexes and blends of query term frequencies now and the results swing only marginally between the 2 implementations. Sometimes the "exiting early" logic is marginally faster and other times marginally slower. Using a larger index seemed to reduce the improvement I had seen on my initial results.
So overall, not a clear improvement and not worth bothering with because, as you suggest, various disk caching strategies probably mitigate the cost of the added reads. Based on your comments re the added int comparison cost in that "hot" loop it made me think that the abstract docIdSetIterator.docId() method call could be questioned on that basis too? It looks like all DocIdSetIterator subclasses maintain a doc variable mutated elsewhere in advance() and next() calls and docID() is meant to be idempotent so presumably a shared variable in the base class could avoid a docID() method invocation? Anyhoo the profiler did not show that method up as any sort of hotspot so I don't think it's an issue. Thanks, Mike. ----- Original Message ----- From: Michael McCandless <luc...@mikemccandless.com> To: dev@lucene.apache.org; mark harwood <markharw...@yahoo.co.uk> Cc: Sent: Thursday, 1 March 2012, 14:18 Subject: Re: ConjunctionScorer.doNext() overstays? On Thu, Mar 1, 2012 at 8:49 AM, mark harwood <markharw...@yahoo.co.uk> wrote: > I would have assumed the many int comparisons would cost less than the > superfluous disk accesses? (I bow to your considerable experience in this > area!) > What is the worst-case scenario on added disk reads? Could it be as bad > as numberOfSegments x numberOfOtherscorers before the query winds up? Well, it depends -- the disk access is a one-time thing but the added per-hit check is per-hit. At some point it'll cross over... I think likely the advance(NO_MORE_DOCS) will not usually hit disk: our skipper impl fully pre-buffers (in RAM) the top skip lists I think? Even if we do go to disk it's likely the OS pre-cached those bytes in its IO buffer. > On the index I tried, it looked like an improvement - the spreadsheet I > linked to has the source for the benchmark on a second worksheet if you want > to give it a whirl on a different dataset. Maybe try it on a more balanced case? Ie, N high-freq terms whose freq is "close-ish"? And on slow queries (I think the results in your spreadsheet are very fast queries right? The slowest one was ~0.95 msec per query, if I'm reading it right?). In general I think not slowing down the worst-case queries is much more important that speeding up the super-fast queries. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org