[HACKERS] Re: Why do we still perform a check for pre-sorted input within qsort variants?

Dann Corbit Sat, 09 Mar 2013 12:52:59 -0800

-----Original Message-----
From: gsst...@gmail.com [mailto:gsst...@gmail.com] On Behalf Of Greg Stark
Sent: Saturday, March 09, 2013 11:39 AM
To: Dann Corbit
Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
Subject: Re: Why do we still perform a check for pre-sorted input within qsort 
variants?

On Sat, Mar 9, 2013 at 10:32 AM, Dann Corbit <dcor...@connx.com> wrote:
> There is no such thing as a quicksort that never goes quadratic.  It was 
> formally proven

The median of medians selection of the pivot gives you O(n*log(n)).

>>
No.  It does make O(n*n)  far less probable, but it does not eliminate it.  If 
it were possible, then introspective sort would be totally without purpose. 
 And yet introspective sort is the algorithm of choice for the ANSI/ISO C++ 
standard directly because it will never go quadratic.
Bentley and Mcilroy's  "Engineering a Sort Function" says this about their 
final qsort source code model chosen for their implementation in the footnote:
"Of course, quadratic behavior is still possible. One can generate fiendish 
inputs by bugging Quicksort: Consider
key values to be unknown initially. In the code for selecting a partition 
element, assign values in increasing
order as unknown keys are encountered. In the partitioning code, make unknown 
keys compare high."

Interestingly, some standard library qsort routines will go quadratic if simply 
given a set that contains random 0 and 1 values for the input set.

It has been formally proven that no matter what method used to choose the 
median (other than calculating the exact median of every partition which would 
make quicksort slower than heapsort) there exists an "adversary" distribution 
that makes quicksort go quadratic. (unfortunately, I can't find the proof or I 
would give you the citation).  Median selection that is randomized does not 
select the true median unless it operates over all the elements of the entire 
partition.  With some small probability, it makes the worst possible choice for 
the median.  It also does not matter if you choose the first, last and middle 
elements for a median of three (or other evenly selected points for larger 
sample sets) or if you select the sample with randomization.  Both methods have 
adversaries.  Now, the quickselect algorithm can select the median in O(n) but 
that is again an average.  There are also adversary distributions for 
quickselect.

On the other hand, various safeguards can make O(n*n) extremely improbable 
{even probabilities approaching zero}.  There is a tradeoff with more and more 
complicated safeguards.  If I choose large subsets of a partition and calculate 
the true median of the subset, it will make the routine safer from quadratic 
behavior but it will also make it slower.  At some point, you end up with a 
routine no faster than heapsort, while also being much more complex and 
therefore more difficult to maintain.  Hence, the necessity of introspective 
sort.

On the other hand, there are also data specific sorts that have very special 
properties.  There are cache conscious versions of burstsort that can sort 
strings much faster than quicksort or even radix sort according to 
measurements.  There is a special version of LSD radix sort that will sort an 
array in one counting pass and N distribution passes, where N is the radix.  
So, for instance, if you are sorting 4 byte integers and your radix is 256 {one 
byte} then you can sort in 5 passes.  One pass to count the bytes in each 
integer by position, and 4 passes to distribute the data.  This can also be 
done in place.  MSD radix sort can also be used as an outer sort container 
which sorts by distribution until the partitions are small enough and then 
switches to introspective sort (which eventually switches to insertion sort).  
Radix sorts can also be made totally generic via callback routines like 
quicksort.  Instead of a comparison operator, the callback gets request for the 
radix bits for a specific bin number and then returns the radix count bits for 
that specific bin.  For unsigned char and radix 256, this is trivially 
character [i] from the string for bin [i], for example.

I guess improving sort routines all boils down to how much energy you want to 
put into it.  Donald Knuth once stated that according to measurement, about 40% 
of business related mainframe computer cycles are involved in ordering data.  
So it is clearly an important problem.
<<

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Why do we still perform a check for pre-sorted input within qsort variants?

Reply via email to