------- Comment #5 from roger at eyesopen dot com 2008-04-24 15:01 ------- Well, I've now had time to read the Barriato, Hofri et al. 2002 paper, and the bad news is that such an approximate median selection algorithm can't be used to guarantee an O(N) worst-case std::nth_element implementation. It could be used in an implementation to guess a good pivot, but the quality of this median, i.e. how approximate it is, doesn't meet the necessary criterion to ensure an O(N) worst case. You'd still need a fallback method with guaranteed bounds or an exact median in order to achieve O(N). i.e. it could help improve the average case performance, but doesn't help with the worst case.
For the mathematically inclined, in order to achieve an O(N) worst-case performance, you need to guarantee a constant fraction of elements can be eliminated at each level of the recursion. In comment #4, Steven fixates on "just as long as N/2 elements are reduced each time round", but the equations for sum of geometric series show that doing better than any constant fraction guarantees O(N) worst case. Hence even if you only guarantee that you can eliminate 10% each round, you still achieve O(N) worst-case. Hence you need a method that provides an approximate median that worst-case can guarantee elimination of say 10% of elements from consideration. This is why approximate medians offer some utility over exact medians if they can be found faster. Unfortunately, the method of Battiato referenced in comment #1 doesn't provide such a constant fraction guarantee. An analysis shows that at each round, it can only eliminate (2^n-1)/3^n of the elements in its worst case, where n is log_3(N). By hand, naming the ranks 0..N-1, when N=3, the true median at rank 1 is selected. For N=9, the elements at rank 3,4 or 5 may be considered as a median, i.e. 1/3 eliminated. For N=27, the elements between ranks 8 and 20 may be returned as the median, i.e. 7/27 eliminated. In the limit, as N tends towards infinity (and n tends to infinity), the eliminated fraction (2^n-1)/3^n tends to zero unbounded. i.e. the larger the input size the less useful is the worst-case median. The poor quality of the median is lamented by the authors in the penultimate paragraph of section 4.1 of the paper. They then go on to show that statistically such a worst-case is rare, but unfortunately even a rare worst case breaks the C++ standard libraries O(N) constraint. This Achilles heel is already well documented in the algorithmic complexity community. The Blum, Floyd, Pratt, Rivest and Trarjan paper [BFRT73] and the Floyd and Rivest paper [FR75] analyse the issues with median-of-k-medians, and show that k>=5 is the lowest value capable of guaranteed fractional worst case. i.e. they already consider and reject the algorithm given in the cited work (k=3) for the purpose of exact median finding. Anyway, I hope you find this interesting. There will always be efficient methods for finding approximate medians. The question is how efficient vs. how approximate. Many quicksort implementation select the first element as a pivot, an O(1) method for selecting an extremely approximate median! Statistically over all possible input orders, this first element will on average partition the input array at the median, with some variance. It's not that the paper is wrong or incorrect; it does what it describes in finding a statistically good approximate median very efficiently with excellent worst case performance. Unfortunately for the problem we need to solve, which is not the problem the paper's authors were attempting to solve, we need a better approximation perhaps using a more complex implementation. Anyway, thanks again for the reference. I'd not come across it before and really enjoyed reading it. Let me know if you spot a flaw in my reasoning above. Dr Roger Sayle, Ph.D. Computer Science -- -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968