subject:"\[Bug libstdc\+\+\/35968\] nth_element fails to meet its complexity requirements"

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2020-06-19 Thread sjhowe at dial dot pipex.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

--- Comment #14 from Stephen Howe  ---
(In reply to Anders Kaseorg from comment #13)
> (In reply to Patrick J. LoPresti from comment #12)
> > I am familiar with the usual algorithmic complexity definitions.
> > 
> > So, just to be clear... Your assertion is that the C++ standards committee
> > adopted a specification that rules out a deterministic implementation?
> 
> I should have been clearer: I’m saying it rules out quickselect with a
> deterministic pivot selection rule that doesn’t inspect Θ(n) elements. 
> Quickselect with randomized Θ(1) pivot selection would satisfy the
> specification, as would quickselect with deterministic Θ(n) pivot selection
> by median-of-medians or similar, but not quickselect with deterministic Θ(1)
> pivot selection by taking the first element or similar.

I am the original bug reporter.

The C++ standard is not good enough for this algorithm.
In David Musser's original paper
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.5196=rep1=pdf
he describes Introsort, and in the ISO C++ 2011 standard, the complexity
requirements tightened for std::sort() from O(n * log n) on average, to O(n *
log n) in the worse case. Introsort guarantees that. It uses quicksort unless
it detects that pivot selection is going bad for portions of the array, it
which case it uses heapsort for which has no known worse case. So guaranteed
worse case performance.

Now David Musser also wrote about intraselect which analogously uses
quickselect but his paper did not say which algorithm should be swapped to if
quickselect went bad. The median-of-medians (in at least groups of 5) is such
an algorithm with O(n) performance. Heapselect is not O(n).
So the C++ standard could use Intraselect with guaranteed O(n) performance, not
just on average. The big O complexity could be tightened up in the same way
that std::sort() was tightened up in ISO C++ 2011.

> Quickselect with randomized Θ(1) pivot selection
No. It can be beaten. Even randomized Θ(1) pivot selection is not good enough.
Antiqsort by Doug McIlroy can beat it. See
https://www.cs.dartmouth.edu/~doug/aqsort.c

Cheers

Stephen Howe

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2020-04-06 Thread andersk at mit dot edu

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

--- Comment #13 from Anders Kaseorg  ---
(In reply to Patrick J. LoPresti from comment #12)
> I am familiar with the usual algorithmic complexity definitions.
> 
> So, just to be clear... Your assertion is that the C++ standards committee
> adopted a specification that rules out a deterministic implementation?

I should have been clearer: I’m saying it rules out quickselect with a
deterministic pivot selection rule that doesn’t inspect Θ(n) elements. 
Quickselect with randomized Θ(1) pivot selection would satisfy the
specification, as would quickselect with deterministic Θ(n) pivot selection by
median-of-medians or similar, but not quickselect with deterministic Θ(1) pivot
selection by taking the first element or similar.

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2020-04-06 Thread lopresti at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

--- Comment #12 from Patrick J. LoPresti  ---
(In reply to Anders Kaseorg from comment #11)
> (In reply to Patrick J. LoPresti from comment #10)
> > Complexity: Linear on average.
> > 
> > It is not obvious (to me) what distribution the "on average" refers to. All
> > permutations of input with equal probability, I suppose (?)
> 
> Expected running time is typically measured over the random choices (if any)
> made internally by the algorithm, not over a random input distribution.  For
> example, quickselect with random pivot selection would satisfy this, but not
> quickselect with deterministic pivot selection.

I am familiar with the usual algorithmic complexity definitions.

So, just to be clear... Your assertion is that the C++ standards committee
adopted a specification that rules out a deterministic implementation?

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2020-04-05 Thread andersk at mit dot edu

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

Anders Kaseorg  changed:

   What|Removed |Added

 CC||andersk at mit dot edu

--- Comment #11 from Anders Kaseorg  ---
(In reply to Patrick J. LoPresti from comment #10)
> Complexity: Linear on average.
> 
> It is not obvious (to me) what distribution the "on average" refers to. All
> permutations of input with equal probability, I suppose (?)

Expected running time is typically measured over the random choices (if any)
made internally by the algorithm, not over a random input distribution.  For
example, quickselect with random pivot selection would satisfy this, but not
quickselect with deterministic pivot selection.

Sometimes expected running time on average-case inputs is studied, but
referring to that definitely requires more words.

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2015-05-13 Thread lopresti at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

Patrick J. LoPresti lopresti at gmail dot com changed:

   What|Removed |Added

 CC||lopresti at gmail dot com

--- Comment #10 from Patrick J. LoPresti lopresti at gmail dot com ---
Worth noting, perhaps, that the C++ standard does _not_ require O(n) worst-case
behavior for nth_element. The exact wording (C++11 section 25.4.2
[alg.nth.element] paragraph (3)) says:

Complexity: Linear on average.

It is not obvious (to me) what distribution the on average refers to. All
permutations of input with equal probability, I suppose (?)

Anyway, just because GCC falls back to an O(N log N) algorithm under some
circumstances does not necessarily mean it violates the C++ specification, as
long as that fallback only happens in log N of the cases or thereabouts.

Not that I am up for actually performing this complexity analysis.

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2010-01-28 Thread paolo dot carlini at oracle dot com



--- Comment #9 from paolo dot carlini at oracle dot com  2010-01-28 17:16 
---
Roger told me in private that he doesn't mean to actively work on this issue
any time soon. Unassigning.


-- 

paolo dot carlini at oracle dot com changed:

   What|Removed |Added

 AssignedTo|roger at eyesopen dot com   |unassigned at gcc dot gnu
   ||dot org
 Status|ASSIGNED|NEW


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2009-12-15 Thread paolo dot carlini at oracle dot com



--- Comment #8 from paolo dot carlini at oracle dot com  2009-12-15 17:19 
---
Roger, are you still actively working on this?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2008-04-28 Thread sjhowe at dial dot pipex dot com



--- Comment #7 from sjhowe at dial dot pipex dot com  2008-04-28 20:17 
---
Roger

I agree with your analysis. I am slightly crestfallen as I was suspicious that
Barriato, Hofri etc's paper never mentioned worst case only approximate
case.  And I have now seen BFPRT73 paper where it mentions for the number of
columns (c) Note, however, that we must have c = 5 for PICK to run in linear
time. So yes, median-of median of 5 seems best you can do for worst case.

The only point in shifting to a different algorithm for the worse case is if
(i)  It is cheaper in terms of total comparisons, swaps, assignments etc
(ii) It is still linear in N in the worse case

Cheers

Stephen Howe


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2008-04-24 Thread roger at eyesopen dot com



--- Comment #5 from roger at eyesopen dot com  2008-04-24 15:01 ---
Well, I've now had time to read the Barriato, Hofri et al. 2002 paper, and the
bad news is that such an approximate median selection algorithm can't be used
to guarantee an O(N) worst-case std::nth_element implementation.  It could be
used in an implementation to guess a good pivot, but the quality of this
median, i.e. how approximate it is, doesn't meet the necessary criterion to
ensure an O(N) worst case.  You'd still need a fallback method with guaranteed
bounds or an exact median in order to achieve O(N).  i.e. it could help improve
the average case performance, but doesn't help with the worst case.

For the mathematically inclined, in order to achieve an O(N) worst-case
performance, you need to guarantee a constant fraction of elements can be
eliminated at each level of the recursion.  In comment #4, Steven fixates
on just as long as N/2 elements are reduced each time round, but the
equations for sum of geometric series show that doing better than any constant
fraction guarantees O(N) worst case.  Hence even if you only guarantee that
you can eliminate 10% each round, you still achieve O(N) worst-case.

Hence you need a method that provides an approximate median that worst-case
can guarantee elimination of say 10% of elements from consideration.  This is
why approximate medians offer some utility over exact medians if they can be
found faster.  Unfortunately, the method of Battiato referenced in comment #1
doesn't provide such a constant fraction guarantee.  An analysis shows that at
each round, it can only eliminate (2^n-1)/3^n of the elements in its worst
case, where n is log_3(N).

By hand, naming the ranks 0..N-1, when N=3, the true median at rank 1 is
selected.  For N=9, the elements at rank 3,4 or 5 may be considered as a
median, i.e. 1/3 eliminated.  For N=27, the elements between ranks 8 and 20 may
be returned as the median, i.e. 7/27 eliminated.  In the limit, as N tends
towards infinity (and n tends to infinity), the eliminated fraction (2^n-1)/3^n
tends to zero unbounded.  i.e. the larger the input size the less useful is the
worst-case median.

The poor quality of the median is lamented by the authors in the penultimate
paragraph of section 4.1 of the paper.  They then go on to show that
statistically such a worst-case is rare, but unfortunately even a rare worst
case breaks the C++ standard libraries O(N) constraint.

This Achilles heel is already well documented in the algorithmic complexity
community.  The Blum, Floyd, Pratt, Rivest and Trarjan paper [BFRT73] and the
Floyd and Rivest paper [FR75] analyse the issues with median-of-k-medians, and
show that k=5 is the lowest value capable of guaranteed fractional worst case.
i.e. they already consider and reject the algorithm given in the cited work
(k=3) for the purpose of exact median finding.

Anyway, I hope you find this interesting.  There will always be efficient
methods for finding approximate medians.  The question is how efficient vs.
how approximate.  Many quicksort implementation select the first element as a
pivot, an O(1) method for selecting an extremely approximate median! 
Statistically over all possible input orders, this first element will on
average partition the input array at the median, with some variance.  It's not
that the paper is wrong or incorrect; it does what it describes in finding a
statistically good approximate median very efficiently with excellent worst
case performance.  Unfortunately for the problem we need to solve, which is not
the problem the paper's authors were attempting to solve, we need a better
approximation perhaps using a more complex implementation.

Anyway, thanks again for the reference.  I'd not come across it before and
really enjoyed reading it.  Let me know if you spot a flaw in my reasoning
above.

Dr Roger Sayle,
Ph.D. Computer Science
--


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2008-04-24 Thread bkoz at gcc dot gnu dot org



--- Comment #6 from bkoz at gcc dot gnu dot org  2008-04-24 23:39 ---

Can I re-classify this as an enhancement?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2008-04-21 Thread pcarlini at suse dot de



--- Comment #3 from pcarlini at suse dot de  2008-04-21 08:24 ---
Many thanks Roger for your further help on nth_element, excellent news.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2008-04-21 Thread sjhowe at dial dot pipex dot com



--- Comment #4 from sjhowe at dial dot pipex dot com  2008-04-21 18:51 
---
Yes. You want a partition that is O(1) that each time round eliminates N/2
elements (bearing in mind the post-condition for nth_element where iterators
greater than the kth iterator have elements that are = the kth element and
iterators less than the kth iterator have elements that are = the kth element)
So median-of-3 or for large N is a must. And this is run for 3/2 * log2(N)
steps.

If it has not converged by end of steps (so it has been a bad run) or new N is
 some constant (making binary insertion sort worthwhile) then at that point
you want the cheapest approximate median algorithm that is _guaranteed_ O(N).
The algorithm is still O(N) as the choosing median and partitioning is occuring
in serial. In this case, it is minimising the constant factor that matters. 
The median-of-median of 5 is well known
But this approximate median is less well known.
So it is the combination of
   (i) guaranteed O(N) partitioning
   (ii) cheapest constant factor (so minimising iterator traversal, copy ctor,
swapping, comparing etc)
I have not yet checked median of median-of-5 against this, but I believe (ii)
works out cheaper.
And if it works that there exists an even cheaper guranteed O(N) partitioning
that finds an approximate median, gcc should use that. It does not matter if
the exact median is found each time round just as long as N/2 elements are
reduced each time round. 

I have also been alerted to a intraselect variation of nth_element() that looks
promising. It is this:
   If you wanted the iterator that was 20% in from the start and you sampled
elements to choose a partition element, instead of choosing an approximate
median, choose the element that is 20%. So if the sample was 5 elements, choose
the 2nd element. Now if partitioning was lop-sided but you reduced the number
of elements drastically leaving a tiny amount as candidates, you have massively
reduced the problem. This is brand new research in nth_element but I have not
yet seen much analysis.

Stephen Howe


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2008-04-20 Thread pcarlini at suse dot de



--- Comment #1 from pcarlini at suse dot de  2008-04-20 17:30 ---
Roger, can you have a look? Thanks in advance.


-- 

pcarlini at suse dot de changed:

   What|Removed |Added

 CC||roger at eyesopen dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

2008-04-20 Thread roger at eyesopen dot com



--- Comment #2 from roger at eyesopen dot com  2008-04-21 03:22 ---
Yep, now that we're back in stage1, it's about time I got around to submitting
the O(n) worst case nth_element implementation that I mentioned last year.  For
Steven's benefit, the implementation I've already coded up uses the
median-of-medians in groups of five strategy as a fallback to a modified
quickselect.
[Though I'll need to quickly read the paper you cite]

The trick for libstdc++ is to attempt to make the typical case as fast or
faster than the existing implementation.  Whilst the standards now require
O(n) worst case, the perceived performance of g++'s users is the average
case and changing to an O(n) implementation that has a large co-efficient
constant may upset some folks.


-- 

roger at eyesopen dot com changed:

   What|Removed |Added

 AssignedTo|unassigned at gcc dot gnu   |roger at eyesopen dot com
   |dot org |
 Status|UNCONFIRMED |ASSIGNED
 Ever Confirmed|0   |1
   Last reconfirmed|-00-00 00:00:00 |2008-04-21 03:22:22
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35968

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

[Bug libstdc++/35968] nth_element fails to meet its complexity requirements

14 matches

Site Navigation

Mail list logo

Footer information