On Thu, Feb 4, 2016 at 6:14 AM, Peter Geoghegan <p...@heroku.com> wrote: > The economics of using 4MB or even 20MB to sort 10GB of data are > already preposterously bad for everyone that runs a database server, > no matter how budget conscious they may be. I can reluctantly accept > that we need to still use a heap with very low work_mem settings to > avoid the risk of a regression (in the event of a strong correlation) > on general principle, but I'm well justified in proposing "just don't > do that" as the best practical advice. > > I thought I had your agreement on that point, Robert; is that actually the > case?
Peter and I spent a few hours talking on Skype this morning about this point and I believe we have agreed on an algorithm that I think will address all of my concerns and hopefully also be acceptable to him. Peter, please weigh in and let me know if I've gotten anything incorrect here or if you think of other concerns afterwards. The basic idea is that we will add a new GUC with a name like replacement_sort_mem that will have a default value in the range of 20-30MB; or possibly we will hardcode this value, but for purposes of this email I'm going to assume it's a GUC. If the value of work_mem or maintenance_work_mem, whichever applies, is smaller than the value of replacement_sort_mem, then the latter has no effect. However, if replacement_sort_mem is the smaller value, then the amount of memory that can be used for a heap with replacement selection is limited to replacement_sort_mem: we can use more memory than that in total for the sort, but the amount that can be used for a heap is restricted to that value. The way we do this is explained in more detail below. One thing I just thought of (after the call) is that it might be better for this GUC to be in units of tuples rather than in units of memory; it's not clear to me why the optimal heap size should be dependent on the tuple size, so we could have a threshold like 300,000 tuples or whatever. But that's a secondary issue and I might be wrong about it: the point is that in order to have a chance of winning, a heap used for replacement selection needs to be not very big at all by the standards of modern hardware, so the plan is to limit it to a size at which it may have a chance. Here's how that will work, assuming Peter and I understand each other: 1. We start reading the input data. If we reach the end of the input data before (maintenance_)work_mem is exhausted, then we can simply quicksort the data and we're done. This is no different than what we already do today. 2. If (maintenance_)work_mem fills up completely, we will quicksort all of the data we have in memory. We will then regard the tail end of that sorted data, in an amount governed by replacement_sort_mem, as a heap, and use it to perform replacement selection until no tuples remain for the current run. Meanwhile, the rest of the sorted data remains in memory untouched. Logically, we're constructing a run of tuples which is split between memory and disk: the head of the run (what fits in all of (maintenance_)work_mem except for replacement_sort_mem) is in memory, and the tail of the run is on disk. 3. If we reach the end of input before replacement selection runs out of tuples for the current run, and if it finds no tuples for the next run prior to that time, then we are done. All of the tuples form a single run and we can return the tuples in memory first followed by the tuples on disk. This case is highly likely to be a huge win over what we have today, because (a) some portion of the tuples were sorted via quicksort rather than heapsort and that's faster, (b) the tuples that were sorted using a heap were sorted using a small heap rather than a big one, and (c) we only wrote out the minimal number of tuples to tape instead of, as we would have done today, all of them. 4. If we reach this step, then replacement selection with a small heap wasn't able to sort the input in a single run. We have a bunch of sorted data in memory which is the head of the same run whose tail is already on disk; we now spill all of these tuples to disk. That leaves only the heapified tuples in memory. We just ignore the fact that they are a heap and treat them as unsorted. We repeatedly do the following: read tuples until work_mem is full, sort them, and dump the result to disk as a run. When all runs have been created, we merge runs just as we do today. This algorithm seems very likely to beat what we do today in practically all cases. The benchmarking Peter and others have already done shows that building runs with quicksort rather than replacement selection can often win even if the larger number of tapes requires a multi-pass merge. The only cases where it didn't seem to be a clear win involved data that was already in sorted order, or very close to it. But with this algorithm, presorted input is fine: we'll quicksort some of it (which is faster than replacement selection because quicksort checks for presorted input) and sort the rest with a *small* heap (which is faster than our current approach of sorting it with a big heap when the data is already in order). On top of that, we'll only write out the minimal amount of data to disk rather than all of it. So we should still win. On the other hand, if the data is out of order, then we will do only a little bit of replacement selection before switching over to building runs by quicksorting, which should also win. The worst case I was able to think of for this algorithm is an input stream that is larger than work_mem and almost sorted: the only exception is that the record that should be exactly in the middle is all the way at the end. In that case, today's code will use a large heap and will consequently produce only a single run. The algorithm above will end up producing two runs, the second containing only that one tuple. That means we're going to incur the additional cost of a merge pass. On the other hand, we're also going to have substantial savings to offset that - the building-runs stage will save by using quicksort for some of the data and a small heap for the rest. So the cost to merge the runs will be at least partially, maybe completely, offset by reduced time spent building them. Furthermore, Peter has got other improvements in the patch which also make merging faster, so if we don't buy enough building the runs to completely counterbalance the cost of the merge, well, we may still win for that reason. Even if not, this is so much faster overall that a regression in some sort of constructed worst case isn't really important. I feel that presorted input is a sufficiently common case that we should try hard not to regress it - but presorted input with the middle value moved to the end is not. We need to not be horrible in that case, but there's absolutely no reason to believe that we will be. We may even be faster, but we certainly shouldn't be abysmally slower. Doing it this way also avoids the need to have a cost model that makes decisions on how to sort based on the anticipated size of the input. I'm really very happy about that, because I feel that any such cost model, no matter how good, is a risk: estimation errors are not uncommon. Maybe a really sturdy cost model would be OK in the end, but not needing one is better. We don't need to fear burning a lot of time on replacement selection, because the heap is small - any significant amount of out-of-order data will cause us to switch to the main algorithm, which is building runs by quicksorting. The decision is made based on the actual data we see rather than any estimate. There's only one potentially tunable parameter - replacement_sort_mem - but it probably won't hurt you very much even if it's wrong by a factor of two - and there's no reason to believe that value is going to be very different on one machine than another. So this seems like it should be pretty robust. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers