Re: [HACKERS] Sorting Improvements for 8.4

Brian Hurt Fri, 21 Dec 2007 06:53:45 -0800

Brian Hurt wrote:

While we're blue skying things, I've had an idea for a sortingalgorithm kicking around for a couple of years that might beinteresting. It's a variation on heapsort to make it significantlymore block-friendly. I have no idea if the idea would work, or howwell it'd work, but it might be worthwhile kicking around.
Now, the core idea of heapsort is that the array is put into heaporder- basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the0-based array version here). The problem is that, assuming that thelength of a is larger than memory, then a[2i+1] is likely going to beon a different page or block than a[i]. That means every time youhave to bubble down a new element, you end up reading O(log N) blocks-this is *per element*.
The variation is to instead work with blocks, so you have a block ofentries b[i], and you change the definition of heap order, so thatmin(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]). Also, duringbubble down, you need to be carefull to only change the minimum valueof one of the two child blocks b[2i+1] and b[2i+2]. Other than that,the algorithm works as normal. The advantage of doing it this way isthat while each bubble down still takes O(log N) blocks being touched,you get a entire block worth of results for your effort. Make yourblocks large enough (say, 1/4 the size of workmem) and you greatlyreduce N, the number of blocks you have to deal with, and get muchbetter I/O (when you're reading, you're reading megabytes at a shot).
Now, there are boatloads of complexities I'm glossing over here. Thisis more of a sketch of the idea. But it's something to consider.

Following up to myself (my apologies), but it's occurred to me thatthere are three advantages to this proposal that I've since thought of:

1) The two child blocks b[2i+1] and b[2i+2]- the one with the largerminimum element is the one we might replace. In other words, ifmin(b[2i+1]) > min(b[2i+2]) and min(b[i]) < min(b[2i+1]), then we knowwe're going to want the blocks b[4i+3] and b[4i+4]- before we're donewith blocks b[2i+1] and b[2i+2]. The point here is that this would workwonders with the posix_fadvise/asyncio ideas kicking around. It'd beeasy for the code to keep 2 large writes and 2 large reads going prettyconstantly.

2) There is some easy parallelization available. I'm not sure how muchworth this is, but the bubble down code is fairly easy to parallelize.If we have two bubble-downs going on in parallel, once they go downdifferent branches (one thread goes to block b[2i+1] while the othergoes to b[2i+2]) they no longer interact. Blocks near the root of theheap would be contended over, and multiple threads means smaller blocksto keep the total memory foot print the same. Personally, I think theasyncio idea above is more likely to be worthwhile.

3) It's possible to perform the sort lazily. You have the initial O(N)pass over the list, but then each block is only O(log N) cost. If it'slikely that only the first part of the result is needed, then much ofthe work can be avoided.


Brian


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Re: [HACKERS] Sorting Improvements for 8.4

Reply via email to