On 7/30/21 10:38 PM, Andres Freund wrote:
Hi,
On 2021-07-30 18:42:18 +1200, David Rowley wrote:
While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum
sorts for single column sorts), I noticed that we use a separate
memory context to store tuple data and we just reset when we're done
if the sort fits in memory, or we dump the tuples to disk in the same
order they're added and reset the context when it does not. There is
a little pfree() work going on via writetup_heap() which I think
possibly could just be removed to get some additional gains.
Anyway, this context that stores tuples uses the standard aset.c
allocator which has the usual power of 2 wastage and additional
overheads of freelists etc. I wondered how much faster it would go if
I set it to use a generation context instead of an aset.c one.
After running make installcheck to make the tenk1 table, running the
attached tuplesortbench script, I get this:
So it seems to save quite a bit of memory getting away from rounding
up allocations to the next power of 2.
Performance-wise, there's some pretty good gains. (Results in TPS)
Very nice!
Yes, very nice. I wouldn't have expected such significant difference,
particularly in CPU usage. It's pretty interesting that it both reduces
memory and CPU usage, I'd have guessed it's either one of the other.
I wonder if there's cases where generation.c would regress performance
over aset.c due to not having an initial / "keeper" block?
Not sure. I guess such workload would need to allocate and free a single
block (so very little memory) very often. I guess that's possible, but
I'm not aware of a place doing that very often. Although, maybe decoding
could do that for simple (serial) workload.
I'm not opposed to adding a keeper block to Generation, similarly to
what was discussed for Slab not too long ago.
The patch is just a simple 1-liner at the moment. I likely do need to
adjust what I'm passing as the blockSize to GenerationContextCreate().
Maybe a better number would be something that's calculated from
work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64))
so that we just allocate at most a 16th of work_mem per chunk, but not
bigger than 8MB. I don't think changing this will affect the
performance of the above very much.
I think it's bad that both genereration and slab don't have internal
handling of block sizes. Needing to err on the size of too big blocks to
handle large amounts of memory well, just so the contexts don't need to
deal with variably sized blocks isn't a sensible tradeoff.
Well, back then it seemed like a sensible trade off to me, but I agree
it may have negative consequences. I'm not opposed to revisiting this.
I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or
Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for
tuplesort.c. There's plenty cases where we'll just sort a handful of
tuples, and increasing the memory usage of those by a factor of 1024
isn't good. The Min() won't do any good if a larger work_mem is used.
Nor will it be good to use thousands of small allocations for a large
in-memory tuplesort just because we're concerned about the initial
allocation size. Both because of the allocation overhead, but
importantly also because that will make context resets more expensive.
True.
To me this says that we should transplant aset.c's block size growing
into generation.c.
Yeah, maybe.
There is at least one path using tuplecontext that reaches code that
could end up freeing memory to a significant enough degree to care about
generation.c effectively not using that memory:
tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
On a quick look I didn't find any expanded record user that frees
nontrivial amounts of memory, but I didn't look all that carefully.
Not sure, I'm not familiar with EOH_flatten_into or expanded records.
But I wonder if there's some sort of metric that we could track in
Generation and use it to identify "interesting" places.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company