Re: the classic "which is the fastest way to sort" question

Shlomi Fish Mon, 09 Nov 2009 10:38:20 -0800

On Monday 09 Nov 2009 19:40:29 Uri Guttman wrote:
> >>>>> "MA" == Michael Alipio <daem0n...@yahoo.com> writes:
> 
>   MA> i'm planning to sort an input file (which was File::Slurp'ed, most
>   MA> likely megabyte-sized file) in various ways. I did some readings
>   MA> and learned several methods that people have come up with in
>   MA> recent years. So to summarize, the default sort is fast (uses
>   MA> quick sort), explicit (using sub) is a bit slower, other method
>   MA> that uses caching is faster. Then there's Schwartzian Transform
>   MA> and a packed version by Guttman. Seems like everything is
>   MA> clear. Guttman is the fastest, until I went to cpan. Found
>   MA> Sort::Key, which claims to be the fastest, even faster that ST,
>   MA> GRT. Now, before someone says, why not try each one and see for
>   MA> yourself (which doing such could be another subject for me to
>   MA> learn), my question is this: if such faster sorting algorithms
>   MA> exist, why don't they just replace the default sort function in
>   MA> Perl?
> 
> you need to understand how those techniques work and some sort theory as
> well. perl's internal sort is a generic sort on strings (using the
> quicksort algorithm).


Just a small correction. Starting from perl-5.8.x, Perl's sort operator is 
using the Merge sort algorithm, which was shown to perform somewhat better 
than Quicksort in the benchmarks that were conducted:

http://en.wikipedia.org/wiki/Merge_sort#Comparison_with_other_sort_algorithms

Here's a little theory:

1. Merge sort has a *worst-case* complexity of O(N*log(N)), but it was 
believed to be less performant than Quicksort.

2. Quick Sort has a worst-case complexity of O(N**2) and a provably average 
case complexity of O(N*log(N)) , and was usually believed to be more 
performant than Merge sort. The standard C library (libc, the Windows C RTL, 
etc.) contains a qsort() function that implements Quicksort, but suffers from 
several philosophical limitations. perl 5 (the Perl 5 implementation) used to 
use qsort() up to version 5.005 or so, when it was replaced with a custom 
function.

Some of the traditional Quick Sort's worst (O(N**2)) performance is when it is 
used to sort in-order or close-to-in-order data, though some of the variations 
of it can be avoided.

3. Merge sort can also be efficiently used for sorting linked-lists as opposed 
to random-access arrays.

4. Generally, with small data sets, it is a better idea to use a naïve O(N**2) 
sorting algorithm such as Insertion Sort (which is generally better than 
Bubble Sort and saner).

5. Given some known assumptions on the data, one can sometimes sort at a 
better asymptotic complexity than O(N*log(N)):

* http://en.wikipedia.org/wiki/Counting_sort

* http://en.wikipedia.org/wiki/Radix_sort

--------------

Oh and "general sort on strings" should be "general sort on scalars", though I 
suppose that's what you meant. Using perl's "||" operator one can conviently 
sort based on several aspects. E.g:

<<<<<<<
my @sorted_records =
(sort 
{ 
        ($a->{'last_name'} cmp $b->{'last_names})
                ||
        ($a->{'first_name'} cmp $b->{'first_name'})
                ||
        ($a->{'age'} <=> $b->{'age'})
}
@records)
>>>>>>>

(Untested.)

Too bad the || operator is magical in Perl and cannot be easily extended in 
user-land, though you can do something similar with or(sub { ... }, sub { ... 
}, sub { ... }, sub { ... }).

Regards,

        Shlomi Fish

-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
Original Riddles - http://www.shlomifish.org/puzzles/

Chuck Norris read the entire English Wikipedia in 24 hours. Twice.

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: the classic "which is the fastest way to sort" question

Reply via email to