[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2008-04-28 Thread sjhowe at dial dot pipex dot com


--- Comment #7 from sjhowe at dial dot pipex dot com  2008-04-28 20:17 
---
Roger

I agree with your analysis. I am slightly crestfallen as I was suspicious that
Barriato, Hofri etc's paper never mentioned worst case only approximate
case.  And I have now seen BFPRT73 paper where it mentions for the number of
columns (c) Note, however, that we must have c = 5 for PICK to run in linear
time. So yes, median-of median of 5 seems best you can do for worst case.

The only point in shifting to a different algorithm for the worse case is if
(i)  It is cheaper in terms of total comparisons, swaps, assignments etc
(ii) It is still linear in N in the worse case

Cheers

Stephen Howe


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968



[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2008-04-21 Thread sjhowe at dial dot pipex dot com


--- Comment #4 from sjhowe at dial dot pipex dot com  2008-04-21 18:51 
---
Yes. You want a partition that is O(1) that each time round eliminates N/2
elements (bearing in mind the post-condition for nth_element where iterators
greater than the kth iterator have elements that are = the kth element and
iterators less than the kth iterator have elements that are = the kth element)
So median-of-3 or for large N is a must. And this is run for 3/2 * log2(N)
steps.

If it has not converged by end of steps (so it has been a bad run) or new N is
 some constant (making binary insertion sort worthwhile) then at that point
you want the cheapest approximate median algorithm that is _guaranteed_ O(N).
The algorithm is still O(N) as the choosing median and partitioning is occuring
in serial. In this case, it is minimising the constant factor that matters. 
The median-of-median of 5 is well known
But this approximate median is less well known.
So it is the combination of
   (i) guaranteed O(N) partitioning
   (ii) cheapest constant factor (so minimising iterator traversal, copy ctor,
swapping, comparing etc)
I have not yet checked median of median-of-5 against this, but I believe (ii)
works out cheaper.
And if it works that there exists an even cheaper guranteed O(N) partitioning
that finds an approximate median, gcc should use that. It does not matter if
the exact median is found each time round just as long as N/2 elements are
reduced each time round. 

I have also been alerted to a intraselect variation of nth_element() that looks
promising. It is this:
   If you wanted the iterator that was 20% in from the start and you sampled
elements to choose a partition element, instead of choosing an approximate
median, choose the element that is 20%. So if the sample was 5 elements, choose
the 2nd element. Now if partitioning was lop-sided but you reduced the number
of elements drastically leaving a tiny amount as candidates, you have massively
reduced the problem. This is brand new research in nth_element but I have not
yet seen much analysis.

Stephen Howe


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968



[Bug libstdc++/35968] New: nth_element fails to meet its complexity requirements

2008-04-17 Thread sjhowe at dial dot pipex dot com
nth_element in 4.3.0 fails to meet its complexity requirements.

This is basically intraselect by Dave Musser. 
The question is What algorithm should you switch to when quickselect fails?

In the codebase __heapselect() is called. But this in O(N * log N).
__heapselect() is O(middle-first) + O((last-middle) * log(last-first)) which is
O(N * log N) and the C++ standard requires nth_element to be O(N).
__heapselect() is no use at all here.
Better still is median-of-medians of groups of 5 (and plenty of proofs that it
is O(N)) or approximate median finding and that is still O(N)
See 
www.cs.wpi.edu/~hofri/medsel.pdf

In the best or average case, intraselect is
O(N) partitioning + O(1) finding a suitable partition element 
In the worse case, intraselect is
O(N) partitioning + O(N) finding a suitable partition element

Stephen Howe


-- 
   Summary: nth_element fails to meet its complexity requirements
   Product: gcc
   Version: 4.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: sjhowe at dial dot pipex dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968