In tuplesort.c, it initially reads tuples into memory until availMem is exhausted.
It then switches to the tape sort algorithm, and allocates buffer space for each tape it will use. This substantially over-runs the allowedMem, and drives availMem negative. It works off this deficit by writing tuples to tape, and pfree-ing their spot in the heap, which pfreed space is more or less randomly scattered. Once this is done it proceeds to alternate between freeing more space in the heap and adding things to the heap (in a nearly strict 1+1 alternation if the tuples are constant size). The space freed up by the initial round of pfree where it is working off the space deficit from inittapes is never re-used. It also cannot be paged out by the VM system, because it is scattered among actively used memory. I don't think that small chunks can be reused from one memory context to another, but I haven't checked. Even if it can be, during a big sort like an index build the backend process doing the build may have no other contexts which need to use it. So having over-ran workMem and stomped all over it to ensure no one else can re-use it, we then scrupulously refuse to benefit from that over-run amount ourselves. The attached patch allows it to reuse that memory. On my meager system it reduced the building of an index on an integer column in a skinny 200 million row totally randomly ordered table by about 3% from a baseline of 25 minutes. I think it would be better to pre-deduct the tape overhead amount we will need if we decide to switch to tape sort from the availMem before we even start reading (and then add it back if we do indeed make that switch). That way we wouldn't over-run the memory in the first place. However, that would cause apparent regressions in which sorts that previously fit into maintenance_work_mem no longer do. Boosting maintenance_work_mem to a level that was actually being used previously would fix those regressions, but pointing out that the previous behavior was not optimal doesn't change the fact that people are used to it and perhaps tuned to it. So the attached patch seems more backwards-friendly. Cheers, Jeff
sort_mem_usage_v1.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers