On Sunday, 17 January 2016 at 16:06:31 UTC, Andrei Alexandrescu wrote:
On 01/17/2016 06:41 AM, Ivan Kazmenko wrote:
The average case is O(n + (k log n log k)) for small enough k. So, any k below roughly n / log^2 (n) makes the second summand less than the first.

I don't understand how you derived the average case complexity, and I don't understand how you derived the bound for k.

Regarding the complexity: for all k (small or large), it takes O(k) to build a heap. Then in the worst case we do (n - k) heap insertions, i.e. total complexity is O(k + (n - k) * log(k)). Is that correct?

Yeah, I have the same notion.

If k is any fraction of n, the worst case complexity comes to O(n + n log n) i.e. O(n log n). By your calculation, the average complexity comes to O(n log^2 n), i.e. greater than the worst case. So there's a mistake somewhere.

The upper bound O(n + k log k log n) is correct, but it's tight only for small k. I'll explain below.

Could you please clarify matters? Thanks!

Here is a more verbose version.

Suppose for simplicity that the array consists real numbers of the same continuous random distribution. (A discrete random distribution would complicate the analysis a bit, since the probability of two elements being equal would become greater than zero.) Then, element a_i is the minimum of a_1, a_2, ..., a_i with probability 1/i. Moreover, it is the second minimum with probability 1/i, the third minimum with probability 1/i, ..., the maximum of a_1, a_2, ..., a_i with the same probability 1/i. To prove this, we can show that, if we look at the ordering permutation p_1, p_2, ..., p_i such that of a_{p_1} < a_{p_2} < ... a_{p_i}, each of the i! possible permutations is equally probable.

What immediately follows is that when we track k smallest elements and consider element i>k, the probability that it will change the k smallest elements encountered so far is k/i. Summing that for i from k+1 to n, we get the expected number of heap insertions:
(k/(k+1)) + (k/(k+2)) + ... + k/n.

Now, this is just k multiplied by:
(1/1 + 1/2 + 1/3 + ... + 1/n) *minus*
(1/1 + 1/2 + 1/3 + ... + 1/k).
As these are harmonic series, the first line is asymptotically equal to log n, and the second line asymptotically equal to log k. To summarize, the expected number of heap insertions for an array of random reals is k (log n - log k), and the complexity is O(n + k log k (log n - log k)).

What I did here for simplicity is just omit the "minus logarithm of k" part to get an upper bound.
For small enough k, the bound is tight.
For example, when k=sqrt(n), log n - log k = log(n/k) = (log n) / 2. On the other hand, when k=Theta(n), log n - log k = log(n/k) = Theta(1).

Ivan Kazmenko.

Reply via email to