On Sunday, 17 January 2016 at 02:37:48 UTC, Timon Gehr wrote:
On 01/17/2016 03:09 AM, Andrei Alexandrescu wrote:
On 1/16/16 8:00 PM, Timon Gehr wrote:
The implementation falls back to topNHeap whenever k is within the first
or last ~n/8 elements and therefore is Ω(n log n) on average.

Correct. The pedantic approach would be to only do that when k is up to
log(n). -- Andrei


Ivan's analysis suggests that even something significantly larger, like n/log(n)² might work as an upper bound for k. I don't think that meeting the documented runtime bounds amounts to pedantry. (Having linear average running time of course does not even imply linear expected running time, as is still written in the documentation.)

I'd suggest being able to switch between implementations.

Recall that, when sorting, we have SwapStrategy.stable which is, well, stable, but additionally guarantees n log n with a larger constant (MergeSort).

And we have SwapStrategy.unstable which uses Introsort, which, in turn, has smaller constant on sane inputs.

Here, Andrei's suggestion is to also have two implementations, let us call them TopNStrategy.quickSelect and TopNStrategy.heap.

The quickSelect one is O(n) on sane inputs but O(n^2) on an artificial worst case.

The heap implementation is also O(n) when k is close to 0 or n, but O(n log n) otherwise. What's important is that it is also O(n log n) worst case.

The current proposal switches between strategies based on a heuristic (k < n/8 or k > 7n/8 if I understand correctly), which may not be the best strategy in all cases.

So, what can be done is to introduce TopNStrategy.auto which is the default and uses the (current or better) heuristic to switch between strategies, but also leave a way to explicitly select one of the two strategies, just like one can do now with sort!(..., SwapStrategy.whatWeExplicitlyWant).

All the code for both strategies is already there. Each strategy has its benefits not covered by the other (quickSelect is faster for k~n and sane inputs, heap is faster for k close to boundary and special inputs). So, provide the default but let the user choose.

Ivan Kazmenko.

Reply via email to