On Thu, Oct 15, 2009 at 3:12 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> If I remembering it right... this (matching MultiSearcher's approach)
> was nearly the first thing we tried with LUCENE-1483.  But the CPU
> cost was higher in our tests.  I think we had tested unbalanced and
> balanced segments, but memory is definitely somewhat hazy at this
> point...
>

Yeah, from what I could tell, the way you used the PQ api, comparing
only with the bottom until you saw it beat the bottom, instead of always
just addWithOverflow, leads to some significant performance improvement,
but this can be done with the MultiSearcher / multiple PQ approach as
well - like I said, I'm pretty sure the overall complexity is the same
(ie. leading complexity, with the factor of numDocs = N * M out front), and
we're just talking about sub-leading behavior at this point.


> I suspect even in the balanced case the cost will be higher with the
> "separate PQs then merge at the end" approach.  It's not only the int
> ord comparisons (which are cheap) that take CPU.  It's the if
> statements that are executed as a result of the comparison when
> re-heapifying the PQ.  Those are far more costly especially if the CPU
> has trouble predicting which way the if will go, which it does here.


> Ie, it's the insertions into each PQ that are expensive; by tracking a
> separate PQ per reader you'll have many more insertions since
> each PQ must re-converge.
>

Insertions are expensive - totally agreed, and subtle branching tweaks
make a big difference at subleading level.  So we need to do a careful
analysis of the sub-leading behavior (O(K,M) but O(1) in terms of N):

The multi-PQ approach requires more insertions: O(M * K log(K))
insertions, but each one can be made very fast, with careful branch
tuning similar to the tweaks currently in the 2.9 code - as all of the
comparisons are ints.  The final merging requires K log(M) string
compares as well.

The single-PQ approach requires only O(K * log(K) * log(M))
insertions, but each one requires, instead of O(log(K)) int compares,
O(log(K)) *string* compares, and a segment conversion.

Plugging in some numbers, with K = 100, and M = 20

  multiPQ : 14,000 insertions which have 7 int compares each,
                  followed by 400 string compares for the merge
vs

  singlePQ: 2,800 insertions which have maybe 4 string compares
                   and 3 int compares each

---------

So yeah, executive summary in theory: *at sub-leading order*, Yonik
is right, singlePQ has less overall operations (log(M) vs M), but it
has way more slow string compares (K log(K)^2 log(M)  vs K log(M) ).


> But I agree it's not obvious which way it'll go... so why not simply
> build a comparator that does this and then let the computer tell us?
>

I agree completely - this is all talk until I see it run on an actual index.
I know John was writing up some code to port bobo to 2.9, so perhaps
you can try and compare these two, with a couple different values of K,
John?

I guess a bigger question, than just the sub-leading behavior (if we're
talking 3.5 ms vs. 3.1ms, this is really academic), is if the overall
performance is basically the same (which I'm guessing is the case),
then for developers, writing and understanding Comparators for the
old API is *way* easier than with the current one, and the multi-PQ
technique makes it totally easy to use the same logic and understanding
that already exists for MultiSearcher.

Of course, if my calculations above are wrong, or the sub-leading
behavior is dominated by the branching and string compares are super
fast, then difficulty of API be damned, we stick with what performs
best.

  -jake

Reply via email to