[issue34561] Replace list sorting merge_collapse()?

Tim Peters Fri, 21 Sep 2018 14:59:12 -0700


Tim Peters <[email protected]> added the comment:


Thank you, Vincent!  I very much enjoyed - and appreciated - your paper I 
referenced at the start.  Way back when, I thought I had a proof of O(N log N), 
but never wrote it up because some details weren't convincing - even to me ;-) 
.  Then I had to move on to other things, and never got back to it.  I was 
probably mistaken.  So it was delightful to see your elegant proof of that, and 
even more.

I'm attaching a new runstack.py that includes code modeling your new adaptive 
Shivers strategy.  Like all these approximations, it has its own "bad cases" on 
smallish inputs.  Based on the specific cases this code runs, powersort remains 
in a category of its own, with timsort, twomerge, and shivers roughly tied on 
_average_ merge cost.

Part of the reason, I suspect:  powersort is the only strategy here that's 
always optimal for 3-run cases.  Which it buys at the cost of never merging the 
run just discovered (unless it's the last run in the array).  For example, in

    31, 16, 1

twomerge and shivers merge 31 and 16 first, before seeing 1, which is "far 
from" optimal.  timsort and powersort merge 16 and 1 first, although that's by 
design in powersort and by luck in timsort.  In

    16, 31, 1
    
only powersort does the best thing (note: runstack.py doesn't model peeksort; 
I'm sure it would merge 31 and 1 first on that too).

I care about small-number-of-runs because real-life Python lists aren't 
asymptotic in nature ;-)  People deliberately construct lists with a small 
number of runs now before sorting, because they know Python's sort can exploit 
that.  For many other cases, all runs are artificially forced to length 
`minrun`, and then any scheme at all that approximates balanced merging is 
about as good as any other.

What I can't know before we time things is whether order-of-merging makes a 
measurable difference in real life, and whether powersort's possible delay in 
merging the most recent run has bad cache effects that overwhelm its smaller 
"merge cost".

In any case, I'm keen to replace timsort's merge-order strategy with 
_something_ that's much easier to analyze.  The makes your Shivers variant and 
powersort the only two real contenders now.  It seems quite remarkable that the 
Shivers variant has such good asymptotics!  It's certainly the easiest to 
modify Python's C code to implement (straightforward edits in a single 
function).

----------
Added file: https://bugs.python.org/file47818/runstack.py

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34561>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue34561] Replace list sorting merge_collapse()?

Reply via email to