Just wanted to review a few thoughts and ideas around improving external sorts, as recently encouraged to do by Jim Nasby.
Current issues/opportunities are these: ISSUES a) Memory is always in short supply, so using what we have more effectively is going to be welcome. b) Heap sort has a reasonably strong anti-memory effect, meaning that there is an optimum amount of memory for any sort. This shows itself with the CPU time increasing during run forming, making this stage of the sort CPU bound. c) Many sorts are performed prior to aggregation. It might be possible to aggregate prior to writing to disk, as a way of reducing the overall I/O cost. Benefit would occur when the total CPU cost was same no matter when aggregation occurred; that would not apply in all cases, so we would need to sense when benefit was possible. d) Generally reducing the I/O cost of sorting may help the merging stages of a sort. SOLUTIONS The ideas that Greg Stark, Jim Nasby, Heikki and myself have discussed to date were the following: 1. Sort I/O Compression 2. Aggregation during Sort 3. Memory Pools 4. Dynamic Heap Management 5. Dynamic Run Handling I've added (5) to the list as well, which hasn't yet been discussed. 1. SORT I/O COMPRESSION This idea is not dead yet, it just needs a full set of tests to confirm that there is benefit in all cases. If there's not benefit in all cases, we may be able to work out which cases those are, so we know when to use it. 2. AGGREGATION DURING SORT Many sorts are preliminary steps before aggregation. Aggregation during run forming would potentially reduce size of heap and reduce number of comparisons. For many types of aggregate this would not theoretically increase the number of ops since sum(), avg(), min(), max() are all commutative according to their inputs. We would probably need to add another option to Aggregate Functions to indicate the possibility of calculating the aggregate in this way, since some aggregates might rely on the current situation that they expect all their inputs at once in sorted order. (Windowed aggregates are unlikely to be this way). 3. MEMORY POOLS Solving a) could be done by sensible management and allocation of resources. Discussed before, so not rehashed here. 4. DYNAMIC HEAP MANAGEMENT The size of the active heap required to produce the fewest number of runs varies as the sort progresses. For example, sorting an already sorted input needs a trivial heap size. Larger heap sizes simply avoid forming more runs, which is not necessarily a bad thing. More runs only become bad things when we go beyond our ability to perform a single final merge (see Dynamic Run Handling below). Smaller heap sizes reduce the number of comparisons required, plus increase the L2+ cache efficiencies. Those two things are the cause of the anti-memory effect. Because of b), optimising the size of the heap could potentially be a good thing. This can make a considerable difference for nearly sorted data (measurements required...). When we have M amount of memory available to us, we don't start by using it all. We start with m memory and only increase up to M if required. Runs are built with memory set at m. If a tuple arrives that would force the formation of a new run we assess i) do we care if another run is formed? Use our knowledge of the likely amount of data coming our way, compared with number of runs formed so far and see if we really care. If we don't care, allow the new run to be formed and carry on with just heap size of m. (see Dynamic Run Handling later). ii) if we do care about number of runs, then allow the heap to grow by increments up to the full size of M. Increments would be at least x2 and possibly x4. That way we always have work space to rearrange the heap. All of this dances too cleverly around the exact technique and potential costs of rearranging the heap. That is not to be ignored and is the next task in evaluating and accepting/dismissing this potential technique. In combination with memory pooling this technique might also allow memory to be better distributed to other users. 5. DYNAMIC RUN HANDLING (in Final Merge) Another way of addressing a) is to simply make better use of memory itself. Let's look at that in more detail: Number of runs that can be merged at once is currently fixed, based upon available memory. This has the underlying assumption that all runs will be concurrently active during final merging, which may not always be true. If we have random data then almost all runs will overlap with all other runs, i.e. the min and max values are sufficiently wide that the runs do all overlap. In many cases, data arrives in somewhat sorted order, e.g. financial data is fairly regular with some late payers but not many, and those trail off with a fairly tight decay. In the somewhat sorted case we find that the actual overlap is less than total, so there are many later runs that don't overlap the earlier ones. In the best case we would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4 overlap. This is also the point where I suggest breaking away from Knuth completely. All of the main algorithms described by Knuth are tape sorts. A run is written to a particular tape and then stays there until "moved" to another tape. That means we have to get super-clever about how runs should be written and formed (see Knuth). If we realise that the runs aren't fixed to particular tapes they are all just independent runs, we can radically rethink sorting. There is no need to implement Cascade Sort, but we do need to rethink merging from the ground up. (All of which is a relief, because Knuth et al are definitely smarter than me, but I've got disks and lots of memory and those guys had tapes.). If we track the min and max values for each run, when run building is finished we will be able to build a merging plan that allows us to be smart about the runs we should bring together. We start with the run with the lowest min value, as well as all runs that overlap that run. When that run is exhausted we move to the next lowest and at that point start merging all runs that overlap that one. This then means we may be able to begin final merging with more runs than the current cut-off. It's possible that we could merge an infinite number of runs in final merge with fixed memory. If we *do* need to merge we can work out which runs should be our best pre-merge candidates, based upon how big they are and which other runs they overlap. (That's much better than being forced to merge tapes 2, 7 and 17 because some bizarre math says so (see Knuth).) Anyway, claiming to have found a better way than Knuth makes me feel a little nervous, so some searching questions on this are very welcome. Interestingly, if we combine this technique with dynamic heap management we may be able to allow a very large number of efficiently written runs to form without it causing any merging. mac_man recently noted the possibility that some runs don't overlap at all and so can be merged for free. That's true, though doesn't actually improve the basic idea here which is building a merge plan after runs have been formed, with an eye on minimizing and potentially elimination the merge phase. There's probably some typos or thinkos above, so go easy on me Greg! They aren't there because I want to skim over anything. I'm not likely to get a chance to do all of this in the near future, so documenting it now should help others to carry things forward. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq