The Quick Mom Algorithm

Andrei Alexandrescu via Digitalmars-d Fri, 29 Jan 2016 14:50:45 -0800

http://dpaste.dzfl.pl/05a82699acc8

So over the past few days I've been in the zone working on a smoothimplementation of "Median of Medians"(https://en.wikipedia.org/wiki/Median_of_medians). Its performance ismuch better compared to the straightforward implementation. However, inpractice it's very hard to get it to beat simple heuristics (such asmedian of five, random pivot etc). These heuristics run `n` comparisonsfor an input of length `n`, whereas the MoM needs over twice as manycomparisons. Also it does a bunch of swapping around the data. Overall Igot it within 2.5x of the heuristic-based topN for most data sizes up totens of millions.

While thinking of MoM and the core reasons of its being slow (adds nicestructure to its input and then "forgets" most of it when recursing), Istumbled upon a different algorithm. It's much simpler, alsodeterministic and faster than MoM for many (most?) inputs. But it's notguaranteed to be linear. After having pounded at this for many hours, itis clear that I am in need of some serious due destruction. I call it a"quick median of medians" or in short "quick mom".

Consider the algorithm defined as follows over a range `r` of length`n`. It returns an index to an element likely to be in the secondtertile of r. (A tertile is a third of the range. So the expectation isthat the returned index x is such that e<=r[x] for at least one third ofthe range, and r[x]<=e again for at least one third of the range.)


0. If n<=3, compute median by rote and return its index.

1. Divide r in three equal adjacent subranges r0 = r[0 .. $/3], r1 =r[$/3 .. $*2/3], r2 = r[$*2/3 .. $].


2. Recurse to get the median of these three, call them m0, m1, m2.

3. Return the median of m0, m1, m2.

Note that no data has been written yet; we only have an estimate of apivot. On random inputs it's expected that the median thus gotten isgreater than 1/6 + 1/6 = 1/3 elements, and less than the same fraction.


The algorithm completes in linear time.

However, the pivot obtained is just an approximation. In the worst caseit's possible that e.g. all recursions return the leftmost of theallowed index, so the bound deteriorates by 1/3 for each recursiondepth, or generally to (1/3)^^log3(n).

Not good! That's where the quick mom pays attention. After it partitionsthe data using the pivot obtained by the quick mom method, thepartitioning stage checks whether the resulting pivot falls withinbounds. If not, it runs a precise selection method (such as propermedian of medians - "thorough mom") to bring the pivot where it needs tobe. The data patterns that cause the quick mom to fail systematicallyare rather odd but worst case is what it is.

Overall the partitioning either succeeds with the quick mom pivot inlinear time, or does one more linear pass if "unlucky". So overallpartition is linear. (Micro-optimization: not all data needs to berepartitioned, only that outside the not-so-good pivot.) Unlike thequick mom, the partition guarantees a pivot in the mid tertile.


After partitioning the classic quickselect algorithm may be implemented.

There's one more _really_ juicy detail. In fact, the quick mom mayfinish without looking at all data. Look at the implementation - thereare two overloads of quickMom. The second gets the bounds and works asfollows: if you've already computed two elements of a median, the thirdmay be chosen only if in between the two. Otherwise, you only carewhether it's larger or smaller than the other two. This allows thealgorithm to finish certain recursion branches early.



Destroy!

Andrei

The Quick Mom Algorithm

Reply via email to