Re: Use generation context to speed up tuplesorts
Le dimanche 18 juin 2023, 20:22:17 CEST Tomas Vondra a écrit : > Hi Ronan, > > We briefly chatted about the glibc-tuning part of this thread at pgcon, > so I wonder if you're still planning to pursue that. If you do, I > suggest we start a fresh thread, so that it's not mixed with the already > committed improvements of generation context. > > I wonder what's the situation with the generation context improvements > already pushed - does setting the glibc thresholds still help, and if > yes how much? Hi Tomas ! I'm currently working on it, trying to evaluate the possible benefits on the current ASet and Generation contexts usecases. I'll make sure to use a new thread to post my findings. Thank you for remembering ! Best regards, -- Ronan Dunklau
Re: Use generation context to speed up tuplesorts
Hi Ronan, We briefly chatted about the glibc-tuning part of this thread at pgcon, so I wonder if you're still planning to pursue that. If you do, I suggest we start a fresh thread, so that it's not mixed with the already committed improvements of generation context. I wonder what's the situation with the generation context improvements already pushed - does setting the glibc thresholds still help, and if yes how much? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Use generation context to speed up tuplesorts
Thanks for raising this. On Sun, 24 Apr 2022 at 12:59, Noah Misch wrote: > This (commit 77bae39) did not change function parameter counts, and > TUPLESORT_RANDOMACCESS generally has same the same numeric value as "true". I > get no warning if I pass "true" for the "sortopt" flags parameter. Hence, I > suspect this did not break the API. Should we be happy about that? I'm fine > with it. I'd be open to doing something like make sortopt the first argument of each of the tuplesort_begin* functions if people have concerns about this causing problems. However, given that TUPLESORT_RANDOMACCESS and true likely have the same value, it seems we'd be more likely to break code down the line if we added some option that's needed that wouldn't get set by passing "true" as the sortopt. If anyone was to use tuplesort_set_bound() without updating their tuplesort_begin* then on non-assert builds, nothing would misbehave. There's just a risk of having bad memory fragmentation from using the generation context for bounded sorts. David
Re: Use generation context to speed up tuplesorts
On Sat, Apr 23, 2022 at 5:59 PM Noah Misch wrote: > This (commit 77bae39) did not change function parameter counts, and > TUPLESORT_RANDOMACCESS generally has same the same numeric value as "true". I > get no warning if I pass "true" for the "sortopt" flags parameter. Hence, I > suspect this did not break the API. Should we be happy about that? I'm fine > with it. If I happened to believe that this issue (or one like it) might have real negative consequences, and that those consequences could easily be avoided (by making the API break impossible to overlook), then I would object -- why even take a small chance? Fortunately I don't believe that we're even taking a small chance here, all things considered. And so I agree; this issue isn't a concern. -- Peter Geoghegan
Re: Use generation context to speed up tuplesorts
On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote: > 0002: > This modifies the tuplesort API so that instead of having a > randomAccess bool flag, this is changed to a bitwise flag that we can > add further options in the future. It's slightly annoying to break > the API, but it's not exactly going to be hard for people to code > around that. For what it's worth, the three PGXN extensions using tuplesort_begin_* are "citus", "pg_bulkload", and "vector". Nothing calls tuplesort_set_bound(). This (commit 77bae39) did not change function parameter counts, and TUPLESORT_RANDOMACCESS generally has same the same numeric value as "true". I get no warning if I pass "true" for the "sortopt" flags parameter. Hence, I suspect this did not break the API. Should we be happy about that? I'm fine with it.
Re: Use generation context to speed up tuplesorts
On Sat, 2 Apr 2022 at 02:25, Justin Pryzby wrote: > > On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote: > > 0002: > > This modifies the tuplesort API so that instead of having a > > randomAccess bool flag, > > Per cirrus, this missed updating one line for dtrace. Thanks. I've pushed all 3 of the patches now. David
Re: Use generation context to speed up tuplesorts
On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote: > 0002: > This modifies the tuplesort API so that instead of having a > randomAccess bool flag, Per cirrus, this missed updating one line for dtrace. On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote: > I feel this is a pretty simple patch and if nobody has any objections > then I plan to push all 3 of them on my New Zealand Monday morning. I'll mark the CF as such. -- Justin >From a6271ac626b561cdafe60152eb4fb208e7fe7240 Mon Sep 17 00:00:00 2001 From: David Rowley Date: Wed, 16 Feb 2022 10:23:32 +1300 Subject: [PATCH 1/4] Improve the generation memory allocator Here we make a series of improvements to the generation memory allocator: 1. Allow generation contexts to have a minimum, initial and maximum block sizes. The standard allocator allows this already but when the generation context was added, it only allowed fixed-sized blocks. The problem with fixed-sized blocks is that it's difficult to choose how large to make the blocks. If the chosen size is too small then we'd end up with a large number of blocks and a large number of malloc calls. If the block size is made too large, then memory is wasted. 2. Add support for "keeper" blocks. This is a special block that is allocated along with the context itself but is never freed. Instead, when the last chunk in the keeper block is freed, we simply mark the block as empty to allow new allocations to make use of it. 3. Add facility to "recycle" newly empty blocks instead of freeing them and having to later malloc an entire new block again. We do this by recording a single GenerationBlock which has become empty of any chunks. When we run out of space in the current block, we check to see if there is a "freeblock" and use that if it contains enough space for the allocation. Author: David Rowley, Tomas Vondra Discussion: https://postgr.es/m/d987fd54-01f8-0f73-af6c-519f799a0...@enterprisedb.com --- src/backend/access/gist/gistvacuum.c | 6 + .../replication/logical/reorderbuffer.c | 7 + src/backend/utils/mmgr/generation.c | 383 ++ src/include/utils/memutils.h | 4 +- 4 files changed, 322 insertions(+), 78 deletions(-) diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c index aac4afab8ff..f190decdff2 100644 --- a/src/backend/access/gist/gistvacuum.c +++ b/src/backend/access/gist/gistvacuum.c @@ -158,9 +158,15 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, * pages in page_set_context. Internally, the integer set will remember * this context so that the subsequent allocations for these integer sets * will be done from the same context. + * + * XXX the allocation sizes used below pre-date generation context's block + * growing code. These values should likely be benchmarked and set to + * more suitable values. */ vstate.page_set_context = GenerationContextCreate(CurrentMemoryContext, "GiST VACUUM page set context", + 16 * 1024, + 16 * 1024, 16 * 1024); oldctx = MemoryContextSwitchTo(vstate.page_set_context); vstate.internal_page_set = intset_create(); diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index c2d9be81fae..4702750a2e7 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -370,8 +370,15 @@ ReorderBufferAllocate(void) SLAB_DEFAULT_BLOCK_SIZE, sizeof(ReorderBufferTXN)); + /* + * XXX the allocation sizes used below pre-date generation context's block + * growing code. These values should likely be benchmarked and set to + * more suitable values. + */ buffer->tup_context = GenerationContextCreate(new_ctx, "Tuples", + SLAB_LARGE_BLOCK_SIZE, + SLAB_LARGE_BLOCK_SIZE, SLAB_LARGE_BLOCK_SIZE); hash_ctl.keysize = sizeof(TransactionId); diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c index 95c006f689f..29065201556 100644 --- a/src/backend/utils/mmgr/generation.c +++ b/src/backend/utils/mmgr/generation.c @@ -20,18 +20,15 @@ * * The memory context uses a very simple approach to free space management. * Instead of a complex global freelist, each block tracks a number - * of allocated and freed chunks. Freed chunks are not reused, and once all - * chunks in a block are freed, the whole block is thrown away. When the - * chunks allocated in the same block have similar lifespan, this works - * very well and is very cheap. + * of allocated and freed chunks. The block is classed as empty when the + * number of free chunks is equal to the number of allocated chunks. When + * this occurs, instead of freeing the block, we try to "recycle" it, i.e. + * reuse it for new allocations. This is done by setting the block in the
Re: Use generation context to speed up tuplesorts
On Wed, 23 Mar 2022 at 04:08, David Rowley wrote: > If nobody has any objections to the 0001 patch then I'd like to move > ahead with it in the next few days. For the 0002 patch, I'm currently > feeling like we shouldn't be using the Generation context for bounded > sorts. The only way I can think to do that is with an API change to > tuplesort. I feel 0001 is useful even without 0002. 0001: I've made some small revisions to the 0001 patch so that the keeper and freeblock are only used again when they're entirely empty. The situation I want to avoid here is that when the current block does not have enough free space to store some new allocation, that we'll then try the freeblock and then keeper block. The problem I saw there is that we may previously have partially filled the keeper or freeblock block and have been unable to store some medium sized allocation which caused us to move to a new block. If we didn't check that the keeper and freeblock blocks were empty first then we could end up being able to store some small allocation in there where some previous medium sized allocation couldn't fit. While that's not necessarily an actual problem, what it does mean is that consecutive allocations might not end up in the same block or the next block. Over time in a FIFO type workload it would be possible to get fragmentation, which could result in being unable to free blocks. For longer lived contexts I imagine that could end up fairly bad. The updated patch should avoid that problem. 0002: This modifies the tuplesort API so that instead of having a randomAccess bool flag, this is changed to a bitwise flag that we can add further options in the future. It's slightly annoying to break the API, but it's not exactly going to be hard for people to code around that. It might also mean we don't have to break the API in the future if we're doing some change where we can just add a new bitwise flag. 0003: This adds a new flag for TUPLESORT_ALLOWBOUNDED and modifies the places where we set a sort bound to pass the flag. The patch only uses the generation context when the flag is not passed. I feel this is a pretty simple patch and if nobody has any objections then I plan to push all 3 of them on my New Zealand Monday morning. David From 053066f8276809af1b3f2c26e1e86b8e1db85bac Mon Sep 17 00:00:00 2001 From: David Rowley Date: Wed, 16 Feb 2022 10:23:32 +1300 Subject: [PATCH v5 1/3] Improve the generation memory allocator Here we make a series of improvements to the generation memory allocator: 1. Allow generation contexts to have a minimum, initial and maximum block sizes. The standard allocator allows this already but when the generation context was added, it only allowed fixed-sized blocks. The problem with fixed-sized blocks is that it's difficult to choose how large to make the blocks. If the chosen size is too small then we'd end up with a large number of blocks and a large number of malloc calls. If the block size is made too large, then memory is wasted. 2. Add support for "keeper" blocks. This is a special block that is allocated along with the context itself but is never freed. Instead, when the last chunk in the keeper block is freed, we simply mark the block as empty to allow new allocations to make use of it. 3. Add facility to "recycle" newly empty blocks instead of freeing them and having to later malloc an entire new block again. We do this by recording a single GenerationBlock which has become empty of any chunks. When we run out of space in the current block, we check to see if there is a "freeblock" and use that if it contains enough space for the allocation. Author: David Rowley, Tomas Vondra Discussion: https://postgr.es/m/d987fd54-01f8-0f73-af6c-519f799a0...@enterprisedb.com --- src/backend/access/gist/gistvacuum.c | 6 + .../replication/logical/reorderbuffer.c | 7 + src/backend/utils/mmgr/generation.c | 383 ++ src/include/utils/memutils.h | 4 +- 4 files changed, 322 insertions(+), 78 deletions(-) diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c index aac4afab8f..f190decdff 100644 --- a/src/backend/access/gist/gistvacuum.c +++ b/src/backend/access/gist/gistvacuum.c @@ -158,9 +158,15 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, * pages in page_set_context. Internally, the integer set will remember * this context so that the subsequent allocations for these integer sets * will be done from the same context. +* +* XXX the allocation sizes used below pre-date generation context's block +* growing code. These values should likely be benchmarked and set to +* more suitable values. */ vstate.page_set_context = GenerationContextCreate(CurrentMemoryContext,
Re: Use generation context to speed up tuplesorts
On Wed, 23 Feb 2022 at 15:25, Andres Freund wrote: > On 2022-02-18 12:10:51 +1300, David Rowley wrote: > > The other way I thought to fix it was by changing the logic for when > > generation blocks are freed. In the problem case mentioned above, the > > block being freed is the current block (which was just allocated). I > > made some changes to adjust this behaviour so that we no longer free > > the block when the final chunk is pfree()'d. Instead, that now lingers > > and can be reused by future allocations, providing they fit inside it. > > That makes sense to me, as long as we keep just one such block. I've rewritten the patch in such a way that it no longer matters which block becomes free. What I did was add a new field to the GenerationContext named "freeblock". We now simply store up to 1 recently emptied block there instead of calling free() on it. When we're allocating new memory, we'll try and make use of the "freeblock" when we don't have enough space in the current block. I feel like this is quite a good change as it saves continual free/malloc calls for FIFO pattern users of the generation context. Since that's pretty much the workload that this context is intended for, that seems like a very good change to make. I did consider only recording a block as free once it reaches the maximum size. As I have the code currently, we'll continually reuse all blocks, including the initial sized ones. It will end up filling blocks quicker as we'll be reusing those smaller initial blocks continually with FIFO workloads. I'm not entirely sure if that matters. I just wanted to point out that I thought about it. Aside from the freeblock, I've also added code for adding a "keeper" block. This is allocated in the same malloc'd chunk as the Generation context itself. One thing I'm not too sure about is if the keeper block change is worth the trouble. If we pfree() all of the chunks out of the keeper block, there's a special case in GenerationFree() to empty that block rather than free() it. This also means we need a special case in GenerationAlloc() so we try to see if the keeper block has any space before we go and allocate a new block. My current thoughts are that the keeper block is really only useful for Generation contexts that remain very small and don't ever require allocation of additional blocks. This will mean all the memory can be allocated along with the context, which saves a malloc and free call. Does anyone have any thoughts on this? > > The problem I see with this method is that there still could be some > > pathological case that causes us to end up storing just a single tuple per > > generation block. > > Crazy idea: Detect the situation, and recompact. Create a new context, copy > all the tuples over, delete the old context. That could be a win even in less > adversarial situations than "a single tuple per generation block". I think that would need some new MemoryContext API functions in which callers could use to determine the fragmentation and do something about it on the consumer side. Otherwise, as Tomas says, if we did it from within the context code itself we'd have no visibility of what is pointing to the memory we're moving around. If nobody has any objections to the 0001 patch then I'd like to move ahead with it in the next few days. For the 0002 patch, I'm currently feeling like we shouldn't be using the Generation context for bounded sorts. The only way I can think to do that is with an API change to tuplesort. I feel 0001 is useful even without 0002. David From 546ac7a3f5ad993c76576e09dd031a453a70c692 Mon Sep 17 00:00:00 2001 From: David Rowley Date: Wed, 16 Feb 2022 10:23:32 +1300 Subject: [PATCH v4 1/2] Improve the generation memory allocator Here we make a series of improvements to the generation memory allocator: 1. Allow generation contexts to have a minimum, initial and maximum block sizes. The standard allocator allows this already but when the generation context was added, it only allowed fixed-sized blocks. The problem with fixed-sized blocks is that it's difficult to choose how large to make the blocks. If the chosen size is too small then we'd end up with a large number of blocks and a large number of malloc calls. If the block size is made too large, then memory is wasted. 2. Add support for "keeper" blocks. This is a special block that is allocated along with the context itself but is never freed. Instead, when the last chunk in the keeper block is freed, we simply mark the block as empty to allow new allocations to make use of it. 3. Add facility to "recycle" newly empty blocks instead of freeing them and having to later malloc an entire new block again. We do this by recording a single GenerationBlock which has become empty of any chunks. When we run out of space in the current block, we check to see if there is a "freeblock" and use that if it contains enough space for the allocation. Author: David Rowley, Tomas Vondra Discussion:
Re: Use generation context to speed up tuplesorts
On 2/23/22 03:25, Andres Freund wrote: > Hi, > > On 2022-02-18 12:10:51 +1300, David Rowley wrote: >> The other way I thought to fix it was by changing the logic for when >> generation blocks are freed. In the problem case mentioned above, the >> block being freed is the current block (which was just allocated). I >> made some changes to adjust this behaviour so that we no longer free >> the block when the final chunk is pfree()'d. Instead, that now lingers >> and can be reused by future allocations, providing they fit inside it. > > That makes sense to me, as long as we keep just one such block. > > >> The problem I see with this method is that there still could be some >> pathological case that causes us to end up storing just a single tuple per >> generation block. > > Crazy idea: Detect the situation, and recompact. Create a new context, copy > all the tuples over, delete the old context. That could be a win even in less > adversarial situations than "a single tuple per generation block". > What about pointers to the chunks in the old memory context? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Use generation context to speed up tuplesorts
Hi, On 2022-02-18 12:10:51 +1300, David Rowley wrote: > The other way I thought to fix it was by changing the logic for when > generation blocks are freed. In the problem case mentioned above, the > block being freed is the current block (which was just allocated). I > made some changes to adjust this behaviour so that we no longer free > the block when the final chunk is pfree()'d. Instead, that now lingers > and can be reused by future allocations, providing they fit inside it. That makes sense to me, as long as we keep just one such block. > The problem I see with this method is that there still could be some > pathological case that causes us to end up storing just a single tuple per > generation block. Crazy idea: Detect the situation, and recompact. Create a new context, copy all the tuples over, delete the old context. That could be a win even in less adversarial situations than "a single tuple per generation block". Greetings, Andres Freund
Re: Use generation context to speed up tuplesorts
On Sun, 13 Feb 2022 at 09:56, Tomas Vondra wrote: > I'm not against pushing the generation context improvements, e.g. > something like the patches posted in [1], because those seem reasonable > in general. But I'm somewhat confused about the last patch (adjusting > allocChunkLimit) causing fairly significant regressions. My thoughts here are that we should pursue the patch that allows growing of the block size in the generation context. I do think the investigation of the malloc() behaviour and performance is worth some pursuit, I just don't think it should be done here or as part of this patch. I think it's a fairly large undertaking to ensure any changes to the malloc options are an overall win and not just a win for whatever benchmark we test them on. If they're really an overall win, then shouldn't glibc know about them and maybe adopt them as the standard options? To get the ball rolling again on the changes to the generation context, I took your patches, Tomas, and fixed a few things around the keeper blocks not working correctly. I've now made the keeper block behaviour match how aset.c works. There the keeper block is allocated in the same allocation as the context itself. That reduces the number of malloc() calls when setting up a new memory context. On testing this, I discovered a problem regarding the use of generation contexts for storing tuples in tuplesort.c. The problem is when the sort is bounded (e.g. SELECT * FROM ... ORDER BY .. LIMIT N), we'll store and directly throw away any tuples that sort greater than the existing set of N tuples. This causes a problem because once we move away from using the keeper block, initially, we'll at some point end up storing a tuple in a newly allocated generation block then proceed to pfree() that tuple due to it sorting greater than the existing N tuples. Because we currently always free() newly empty blocks, we end up thrashing free()/malloc(). This behaviour results in one of the regression test queries in tuplesort.sql going from taking 10 ms to 1 minute, on my machine. I had a few thoughts about how best to work around this problem. Firstly, I thought that we should just *not* use the Generation context when the sort is bounded. That turns out to be a bit icky as we only make the sort bounding when we call tuplesort_set_bound(), which is after we've allocated the memory context for storing tuples. I thought about maybe just adding another bool flag named "allowBounding" and use a Generation context if that flag is not set. I just don't really like that as a solution. We also cannot make the set bound part of the tuplesort_begin_heap() call as nodeIncrementalSort.c does reset the bound. Possibly we could ditch tuplesort_set_bound() and invent tuplesort_reset_bound() and change the API so the bound is set when making the Tuplesortstate. The other way I thought to fix it was by changing the logic for when generation blocks are freed. In the problem case mentioned above, the block being freed is the current block (which was just allocated). I made some changes to adjust this behaviour so that we no longer free the block when the final chunk is pfree()'d. Instead, that now lingers and can be reused by future allocations, providing they fit inside it. If they don't fit then we must free the block, otherwise it would just remain empty forever. The problem I see with this method is that there still could be some pathological case that causes us to end up storing just a single tuple per generation block. Hitting the worst case here does seem pretty unlikely, so I'm a bit drawn between if we should just play it safe and stick to the standard memory allocator for bounded sort. I've attached 2 patches. The 0001 patch adds the new logic to generation.c to allow the block to grow and also adds the logic to skip freeing the current block when it becomes empty. The 0002 patch simply makes tuplesort.c use the generation context for tuples unconditionally. I've also attached the benchmark results I got from this patch running the attached sortbenchall script. It's fairly obvious that there are performance improvements here even when the malloc() usage pattern is similar to aset.c's David #!/bin/bash dbname=postgres mod=100 sec=5 psql -c "alter system set max_parallel_workers_per_gather TO 0;" $dbname psql -c "select pg_reload_conf();" $dbname for mem in "4MB" "4GB" do psql -c "alter system set work_mem TO '$mem';" $dbname psql -c "select pg_reload_conf();" $dbname psql -c "show work_mem" $dbname sec=10 for sql in "select * from tenk1 order by two offset 100" "select * from tenk1 order by tenthous offset 100" "select * from tenk1 order by stringu1 offset 100" "select * from tenk1 order by stringu2 offset 100" "select * from tenk1 order by twenty offset 100" "select twenty,sum(unique1 order by unique1) from tenk1 group by twenty" "select unique1,sum(twenty order by twenty)
Re: Use generation context to speed up tuplesorts
On 1/20/22 08:36, Ronan Dunklau wrote: > Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit : >> On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud wrote: >>> I'm also a bit confused about which patch(es) should actually be reviewed >>> in that thread. My understanding is that the original patch might not be >>> relevant anymore but I might be wrong. The CF entry should probably be >>> updated to clarify that, with an annotation/ title change / author update >>> or something. >>> >>> In the meantime I will switch the entry to Waiting on Author. >> >> I think what's going on here is a bit confusing. There's quite a few >> patches on here that aim to do very different things. >> >> The patch I originally proposed was just to make tuplesort use >> generation memory contexts. This helps speed up tuplesort because >> generation contexts allow palloc() to be faster than the standard >> allocator. Additionally, there's no power-of-2 wastage with generation >> contexts like there is with the standard allocator. This helps fit >> more tuples on fewer CPU cache lines. >> >> I believe Andres then mentioned that the fixed block size for >> generation contexts might become a problem. With the standard >> allocator the actual size of the underlying malloc() grows up until a >> maximum size. With generation contexts this is always the same, so we >> could end up with a very large number of blocks which means lots of >> small mallocs which could end up slow. Tomas then posted a series of >> patches to address this. >> >> I then posted another patch that has the planner make a choice on the >> size of the blocks to use for the generation context based on the >> tuple width and number of tuple estimates. My hope there was to >> improve performance further by making a better choice for how large to >> make the blocks in the first place. This reduces the need for Tomas' >> patch, but does not eliminate the need for it. >> >> As of now, I still believe we'll need Tomas' patches to allow the >> block size to grow up to a maximum size. I think those patches are >> likely needed before we think about making tuplesort use generation >> contexts. The reason I believe this is that we don't always have good >> estimates for the number of tuples we're going to sort. nodeSort.c is >> fairly easy as it just fetches tuples once and then sorts them. The >> use of tuplesort for nodeIncrementalSort.c is much more complex as >> we'll likely be performing a series of smaller sorts which are much >> harder to get size estimates for. This means there's likely no magic >> block size that we can use for the generation context that's used to >> store the tuples in tuplesort. This is also the case for order by >> aggregates and probably most other usages of tuplesort. >> > > You left out the discussion I started about glibc's malloc specific behaviour. > > I tried to demonstrate that a big part of the performance gain you were > seeing > were not in fact due to our allocator by itself, but by the way different > block > sizes allocations interact with this malloc implementation and how it handles > releasing memory back to the system. I also tried to show that we can give > DBAs more control over how much memory to "waste" as the price for faster > allocations. > > I agree that the generation context memory manager is useful in the tuplesort > context, if only for the fact that we fall back to disk less quickly due to > the reduced wastage of memory, but I would be wary of drawing conclusions too > quickly based on a specific malloc implementation which shows threshold > effects, > which are compounded by our growing block sizes code in general. > Yeah, I share this concern - how much of the improvement is real and how much is due to accidentally not hitting some glibc-specific self-tuning stuff? Would a realistic workloads get the same improvement or would it cause regression? What about non-glibc systems with other malloc? I'm not against pushing the generation context improvements, e.g. something like the patches posted in [1], because those seem reasonable in general. But I'm somewhat confused about the last patch (adjusting allocChunkLimit) causing fairly significant regressions. I'll try investigating this a bit more next week. regards [1] https://www.postgresql.org/message-id/flat/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a%40enterprisedb.com#a5ecd4e5a9e7d5b07ff25b5684da894f -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Use generation context to speed up tuplesorts
On 1/25/22 13:34, Ronan Dunklau wrote: > > ... > > I've run the same 1-to-32-columns sort benchmark, using both glibc malloc and > jemalloc, with the standard aset memory manager or with David's generation v2 > patch. To use jemalloc, I just use a LD_PRELOAD env variable before starting > postgres. > > I have not tried to measure jemalloc's memory footprint like I did for > glibc's > malloc in the previous benchmark, as I'm not trying to evaluate jemalloc as a > glibc's malloc replacement. > > Please find the results attached. As I suspected, most of the gain observed > with David's patch come from working around glibc's malloc idiosyncracies but > the good news is that it also helps (to a lesser degree) with jemalloc. > > I'm not sure how that would behave with growing block sizes though: as Tomas > patch and stress-testing benchmarks showed, we can hit some particular > thresholds (and regressions) in glibc's self-tuning behaviour. > Interesting results! So just switching to jemalloc can yield much better performance (in some cases), even beating the generation context (on top of glibc malloc). Of course, those are just simple (somewhat synthetic) benchmarks, and there is no info about memory consumption. Perhaps there are cases where this would be slower than glibc malloc, and I doubt many people to switch to a non-default malloc implementation. But I think in general this mostly confirms our suspicion about the patch (somewhat accidentally) working around glibc internal behavior. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Use generation context to speed up tuplesorts
Hi: On Thu, Jan 20, 2022 at 9:31 AM David Rowley wrote: > > As of now, I still believe we'll need Tomas' patches to allow the > block size to grow up to a maximum size. I think those patches are > likely needed before we think about making tuplesort use generation > contexts. The reason I believe this is that we don't always have good > estimates for the number of tuples we're going to sort. +1 I spent some times to study the case here and my current thought is: we can discuss/commit the minimum committable changes which should be beneficial for some cases and no harm for others. Tomas's patch 0002 would make there are no more blocks needed if we switch to GenerationContext (compared with Standard Context). and David's patch can obviously reduce total memory usage and improve the performance. so IMO Tomas's patch 0002 + David's patch is a committable patchset at first round. and Tomas's 0001 patch would be good to have as well. I double checked Tomas's 0002 patch, it looks good to me. and then applied David's patch with ALLOCSET_DEFAULT_SIZES, testing the same workload. Here is the result (number is tps): work_mem = '4GB' | Test Case | master | patched | |---++-| | Test 1|306 | 406 | | Test 2|225 | 278 | | Test 3|202 | 248 | | Test 4|184 | 218 | | Test 5|281 | 360 | work_mem = '4MB' | Test Case | master | patched | |---++-| | Test 1|124 | 409 | | Test 2|106 | 280 | | Test 3|100 | 249 | | Test 4|97 | 218 | | Test 5|120 | 369 | I didn't get the performance improvement as much as David's at the beginning, I think that is because David uses the ALLOCSET_DEFAULT_MAXSIZE directly which will need less number of times for memory allocation. AFAICS, Tomas's patch 0002 + David's patch should be ready for commit for round 1. We can try other opportunities like use rows estimation to allocate initial memory and GenerationContext improves like 0003/0004. Would this work? -- Best Regards Andy Fan
Re: Use generation context to speed up tuplesorts
Le jeudi 20 janvier 2022, 08:36:46 CET Ronan Dunklau a écrit : > Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit : > > On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud wrote: > > > I'm also a bit confused about which patch(es) should actually be > > > reviewed > > > in that thread. My understanding is that the original patch might not > > > be > > > relevant anymore but I might be wrong. The CF entry should probably be > > > updated to clarify that, with an annotation/ title change / author > > > update > > > or something. > > > > > > In the meantime I will switch the entry to Waiting on Author. > > > > I think what's going on here is a bit confusing. There's quite a few > > patches on here that aim to do very different things. > > > > The patch I originally proposed was just to make tuplesort use > > generation memory contexts. This helps speed up tuplesort because > > generation contexts allow palloc() to be faster than the standard > > allocator. Additionally, there's no power-of-2 wastage with generation > > contexts like there is with the standard allocator. This helps fit > > more tuples on fewer CPU cache lines. > > > > I believe Andres then mentioned that the fixed block size for > > generation contexts might become a problem. With the standard > > allocator the actual size of the underlying malloc() grows up until a > > maximum size. With generation contexts this is always the same, so we > > could end up with a very large number of blocks which means lots of > > small mallocs which could end up slow. Tomas then posted a series of > > patches to address this. > > > > I then posted another patch that has the planner make a choice on the > > size of the blocks to use for the generation context based on the > > tuple width and number of tuple estimates. My hope there was to > > improve performance further by making a better choice for how large to > > make the blocks in the first place. This reduces the need for Tomas' > > patch, but does not eliminate the need for it. > > > > As of now, I still believe we'll need Tomas' patches to allow the > > block size to grow up to a maximum size. I think those patches are > > likely needed before we think about making tuplesort use generation > > contexts. The reason I believe this is that we don't always have good > > estimates for the number of tuples we're going to sort. nodeSort.c is > > fairly easy as it just fetches tuples once and then sorts them. The > > use of tuplesort for nodeIncrementalSort.c is much more complex as > > we'll likely be performing a series of smaller sorts which are much > > harder to get size estimates for. This means there's likely no magic > > block size that we can use for the generation context that's used to > > store the tuples in tuplesort. This is also the case for order by > > aggregates and probably most other usages of tuplesort. > > You left out the discussion I started about glibc's malloc specific > behaviour. > > I tried to demonstrate that a big part of the performance gain you were > seeing were not in fact due to our allocator by itself, but by the way > different block sizes allocations interact with this malloc implementation > and how it handles releasing memory back to the system. I also tried to > show that we can give DBAs more control over how much memory to "waste" as > the price for faster allocations. > > I agree that the generation context memory manager is useful in the > tuplesort context, if only for the fact that we fall back to disk less > quickly due to the reduced wastage of memory, but I would be wary of > drawing conclusions too quickly based on a specific malloc implementation > which shows threshold effects, which are compounded by our growing block > sizes code in general. > > I have on my TODO list to run the same kind of benchmarks using jemalloc, to > better isolate the performance gains expected from the generation allocator > itself from the side effect of avoiding glibc's pattern of releasing memory > back to the kernel too quickly. > I've run the same 1-to-32-columns sort benchmark, using both glibc malloc and jemalloc, with the standard aset memory manager or with David's generation v2 patch. To use jemalloc, I just use a LD_PRELOAD env variable before starting postgres. I have not tried to measure jemalloc's memory footprint like I did for glibc's malloc in the previous benchmark, as I'm not trying to evaluate jemalloc as a glibc's malloc replacement. Please find the results attached. As I suspected, most of the gain observed with David's patch come from working around glibc's malloc idiosyncracies but the good news is that it also helps (to a lesser degree) with jemalloc. I'm not sure how that would behave with growing block sizes though: as Tomas patch and stress-testing benchmarks showed, we can hit some particular thresholds (and regressions) in glibc's self-tuning behaviour. -- Ronan Dunklau bench_jemalloc.ods Description:
Re: Use generation context to speed up tuplesorts
Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit : > On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud wrote: > > I'm also a bit confused about which patch(es) should actually be reviewed > > in that thread. My understanding is that the original patch might not be > > relevant anymore but I might be wrong. The CF entry should probably be > > updated to clarify that, with an annotation/ title change / author update > > or something. > > > > In the meantime I will switch the entry to Waiting on Author. > > I think what's going on here is a bit confusing. There's quite a few > patches on here that aim to do very different things. > > The patch I originally proposed was just to make tuplesort use > generation memory contexts. This helps speed up tuplesort because > generation contexts allow palloc() to be faster than the standard > allocator. Additionally, there's no power-of-2 wastage with generation > contexts like there is with the standard allocator. This helps fit > more tuples on fewer CPU cache lines. > > I believe Andres then mentioned that the fixed block size for > generation contexts might become a problem. With the standard > allocator the actual size of the underlying malloc() grows up until a > maximum size. With generation contexts this is always the same, so we > could end up with a very large number of blocks which means lots of > small mallocs which could end up slow. Tomas then posted a series of > patches to address this. > > I then posted another patch that has the planner make a choice on the > size of the blocks to use for the generation context based on the > tuple width and number of tuple estimates. My hope there was to > improve performance further by making a better choice for how large to > make the blocks in the first place. This reduces the need for Tomas' > patch, but does not eliminate the need for it. > > As of now, I still believe we'll need Tomas' patches to allow the > block size to grow up to a maximum size. I think those patches are > likely needed before we think about making tuplesort use generation > contexts. The reason I believe this is that we don't always have good > estimates for the number of tuples we're going to sort. nodeSort.c is > fairly easy as it just fetches tuples once and then sorts them. The > use of tuplesort for nodeIncrementalSort.c is much more complex as > we'll likely be performing a series of smaller sorts which are much > harder to get size estimates for. This means there's likely no magic > block size that we can use for the generation context that's used to > store the tuples in tuplesort. This is also the case for order by > aggregates and probably most other usages of tuplesort. > You left out the discussion I started about glibc's malloc specific behaviour. I tried to demonstrate that a big part of the performance gain you were seeing were not in fact due to our allocator by itself, but by the way different block sizes allocations interact with this malloc implementation and how it handles releasing memory back to the system. I also tried to show that we can give DBAs more control over how much memory to "waste" as the price for faster allocations. I agree that the generation context memory manager is useful in the tuplesort context, if only for the fact that we fall back to disk less quickly due to the reduced wastage of memory, but I would be wary of drawing conclusions too quickly based on a specific malloc implementation which shows threshold effects, which are compounded by our growing block sizes code in general. I have on my TODO list to run the same kind of benchmarks using jemalloc, to better isolate the performance gains expected from the generation allocator itself from the side effect of avoiding glibc's pattern of releasing memory back to the kernel too quickly. Julien, sorry for the confusion of adding that discussion and those patches to the same thread, but I don't really see a way of dissociating those two aspects. -- Ronan Dunklau
Re: Use generation context to speed up tuplesorts
On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud wrote: > I'm also a bit confused about which patch(es) should actually be reviewed in > that thread. My understanding is that the original patch might not be > relevant > anymore but I might be wrong. The CF entry should probably be updated to > clarify that, with an annotation/ title change / author update or something. > > In the meantime I will switch the entry to Waiting on Author. I think what's going on here is a bit confusing. There's quite a few patches on here that aim to do very different things. The patch I originally proposed was just to make tuplesort use generation memory contexts. This helps speed up tuplesort because generation contexts allow palloc() to be faster than the standard allocator. Additionally, there's no power-of-2 wastage with generation contexts like there is with the standard allocator. This helps fit more tuples on fewer CPU cache lines. I believe Andres then mentioned that the fixed block size for generation contexts might become a problem. With the standard allocator the actual size of the underlying malloc() grows up until a maximum size. With generation contexts this is always the same, so we could end up with a very large number of blocks which means lots of small mallocs which could end up slow. Tomas then posted a series of patches to address this. I then posted another patch that has the planner make a choice on the size of the blocks to use for the generation context based on the tuple width and number of tuple estimates. My hope there was to improve performance further by making a better choice for how large to make the blocks in the first place. This reduces the need for Tomas' patch, but does not eliminate the need for it. As of now, I still believe we'll need Tomas' patches to allow the block size to grow up to a maximum size. I think those patches are likely needed before we think about making tuplesort use generation contexts. The reason I believe this is that we don't always have good estimates for the number of tuples we're going to sort. nodeSort.c is fairly easy as it just fetches tuples once and then sorts them. The use of tuplesort for nodeIncrementalSort.c is much more complex as we'll likely be performing a series of smaller sorts which are much harder to get size estimates for. This means there's likely no magic block size that we can use for the generation context that's used to store the tuples in tuplesort. This is also the case for order by aggregates and probably most other usages of tuplesort. David
Re: Use generation context to speed up tuplesorts
Hi, On Fri, Jan 07, 2022 at 12:03:55PM +0100, Ronan Dunklau wrote: > > (Sorry for trying to merge back the discussion on the two sides of the thread) > > In https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, I > expressed the idea of being able to tune glibc's malloc behaviour. > > I implemented that (patch 0001) to provide a new hook which is called on > backend startup, and anytime we set work_mem. This hook is # defined > depending > on the malloc implementation: currently a default, no-op implementation is > provided as well as a glibc's malloc implementation. This patch apparently misses something to have the no-op on windows: https://cirrus-ci.com/task/4978247640285184 [03:03:01.676] Build FAILED. [03:03:01.676] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(19,15): warning C4013: 'mallinfo2' undefined; assuming extern returning int [c:\cirrus\malloc_stats.vcxproj] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(18,19): error C2079: 'm' uses undefined struct 'mallinfo2' [c:\cirrus\malloc_stats.vcxproj] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error C2224: left of '.arena' must have struct/union type [c:\cirrus\malloc_stats.vcxproj] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error C2224: left of '.hblkhd' must have struct/union type [c:\cirrus\malloc_stats.vcxproj] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error C2224: left of '.uordblks' must have struct/union type [c:\cirrus\malloc_stats.vcxproj] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error C2224: left of '.fordblks' must have struct/union type [c:\cirrus\malloc_stats.vcxproj] [03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error C2224: left of '.keepcost' must have struct/union type [c:\cirrus\malloc_stats.vcxproj] I'm also a bit confused about which patch(es) should actually be reviewed in that thread. My understanding is that the original patch might not be relevant anymore but I might be wrong. The CF entry should probably be updated to clarify that, with an annotation/ title change / author update or something. In the meantime I will switch the entry to Waiting on Author.
Re: Use generation context to speed up tuplesorts
Le vendredi 7 janvier 2022, 13:03:28 CET Tomas Vondra a écrit : > On 1/7/22 12:03, Ronan Dunklau wrote: > > Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit : > >> I've attached some benchmark results that I took recently. The > >> spreadsheet contains results from 3 versions. master, master + 0001 - > >> 0002, then master + 0001 - 0003. The 0003 patch makes the code a bit > >> more conservative about the chunk sizes it allocates and also tries to > >> allocate the tuple array according to the number of tuples we expect > >> to be able to sort in a single batch for when the sort is not > >> estimated to fit inside work_mem. > > > > (Sorry for trying to merge back the discussion on the two sides of the > > thread) > > > > In https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, > > I expressed the idea of being able to tune glibc's malloc behaviour. > > > > I implemented that (patch 0001) to provide a new hook which is called on > > backend startup, and anytime we set work_mem. This hook is # defined > > depending on the malloc implementation: currently a default, no-op > > implementation is provided as well as a glibc's malloc implementation. > > Not sure I'd call this a hook - that usually means a way to plug-in > custom code through a callback, and this is simply ifdefing a block of > code to pick the right implementation. Which may be a good way to do > that, just let's not call that a hook. > > There's a commented-out MallocTuneHook() call, probably not needed. Ok, I'll clean that up if we decide to proceed with this. > > I wonder if #ifdefing is sufficient solution, because it happens at > compile time, so what if someone overrides the allocator in LD_PRELOAD? > That was a fairly common way to use a custom allocator in an existing > application. But I don't know how many people do that with Postgres (I'm > not aware of anyone doing that) or if we support that (it'd probably > apply to other stuff too, not just malloc). So maybe it's OK, and I > can't think of a better way anyway. I couldn't think of a better way either, maybe there is something to be done with trying to dlsym something specific to glibc's malloc implementation ? > > > The glibc's malloc implementation relies on a new GUC, > > glibc_malloc_max_trim_threshold. When set to it's default value of -1, we > > don't tune malloc at all, exactly as in HEAD. If a different value is > > provided, we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD > > to this value, capped by work_mem / 2 and work_mem respectively. > > > > The net result is that we can then allow to keep more unused memory at the > > top of the heap, and to use mmap less frequently, if the DBA chooses too. > > A possible other use case would be to on the contrary, limit the > > allocated memory in idle backends to a minimum. > > > > The reasoning behind this is that glibc's malloc default way of handling > > those two thresholds is to adapt to the size of the last freed mmaped > > block. > > > > I've run the same "up to 32 columns" benchmark as you did, with this new > > patch applied on top of both HEAD and your v2 patchset incorporating > > planner estimates for the block sizez. Those are called "aset" and > > "generation" in the attached spreadsheet. For each, I've run it with > > glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case > > > > I've measured two things: > > - query latency, as reported by pgbench > > - total memory allocated by malloc at backend ext after running each > > query > > > > three times. This represents the "idle" memory consumption, and thus what > > we waste in malloc inside of releasing back to the system. This > > measurement has been performed using the very small module presented in > > patch 0002. Please note that I in no way propose that we include this > > module, it was just a convenient way for me to measure memory footprint. > > > > My conclusion is that the impressive gains you see from using the > > generation context with bigger blocks mostly comes from the fact that we > > allocate bigger blocks, and that this moves the mmap thresholds > > accordingly. I wonder how much of a difference it would make on other > > malloc implementation: I'm afraid the optimisation presented here would > > in fact be specific to glibc's malloc, since we have almost the same > > gains with both allocators when tuning malloc to keep more memory. I > > still think both approaches are useful, and would be necessary. > Interesting measurements. It's intriguing that for generation contexts, > the default "-1" often outperforms "1MB" (but not the other options), > while for aset it's pretty much "the higher value the better". For generation context with "big block sizes" this result is expected, as the malloc dynamic tuning will adapt to the big block size. This can also be seen on the "idle memory" measurement: the memory consumption is identical to the 64MB value
Re: Use generation context to speed up tuplesorts
On 1/7/22 12:03, Ronan Dunklau wrote: Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit : I've attached some benchmark results that I took recently. The spreadsheet contains results from 3 versions. master, master + 0001 - 0002, then master + 0001 - 0003. The 0003 patch makes the code a bit more conservative about the chunk sizes it allocates and also tries to allocate the tuple array according to the number of tuples we expect to be able to sort in a single batch for when the sort is not estimated to fit inside work_mem. (Sorry for trying to merge back the discussion on the two sides of the thread) In https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, I expressed the idea of being able to tune glibc's malloc behaviour. I implemented that (patch 0001) to provide a new hook which is called on backend startup, and anytime we set work_mem. This hook is # defined depending on the malloc implementation: currently a default, no-op implementation is provided as well as a glibc's malloc implementation. Not sure I'd call this a hook - that usually means a way to plug-in custom code through a callback, and this is simply ifdefing a block of code to pick the right implementation. Which may be a good way to do that, just let's not call that a hook. There's a commented-out MallocTuneHook() call, probably not needed. I wonder if #ifdefing is sufficient solution, because it happens at compile time, so what if someone overrides the allocator in LD_PRELOAD? That was a fairly common way to use a custom allocator in an existing application. But I don't know how many people do that with Postgres (I'm not aware of anyone doing that) or if we support that (it'd probably apply to other stuff too, not just malloc). So maybe it's OK, and I can't think of a better way anyway. The glibc's malloc implementation relies on a new GUC, glibc_malloc_max_trim_threshold. When set to it's default value of -1, we don't tune malloc at all, exactly as in HEAD. If a different value is provided, we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD to this value, capped by work_mem / 2 and work_mem respectively. The net result is that we can then allow to keep more unused memory at the top of the heap, and to use mmap less frequently, if the DBA chooses too. A possible other use case would be to on the contrary, limit the allocated memory in idle backends to a minimum. The reasoning behind this is that glibc's malloc default way of handling those two thresholds is to adapt to the size of the last freed mmaped block. I've run the same "up to 32 columns" benchmark as you did, with this new patch applied on top of both HEAD and your v2 patchset incorporating planner estimates for the block sizez. Those are called "aset" and "generation" in the attached spreadsheet. For each, I've run it with glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case I've measured two things: - query latency, as reported by pgbench - total memory allocated by malloc at backend ext after running each query three times. This represents the "idle" memory consumption, and thus what we waste in malloc inside of releasing back to the system. This measurement has been performed using the very small module presented in patch 0002. Please note that I in no way propose that we include this module, it was just a convenient way for me to measure memory footprint. My conclusion is that the impressive gains you see from using the generation context with bigger blocks mostly comes from the fact that we allocate bigger blocks, and that this moves the mmap thresholds accordingly. I wonder how much of a difference it would make on other malloc implementation: I'm afraid the optimisation presented here would in fact be specific to glibc's malloc, since we have almost the same gains with both allocators when tuning malloc to keep more memory. I still think both approaches are useful, and would be necessary. Interesting measurements. It's intriguing that for generation contexts, the default "-1" often outperforms "1MB" (but not the other options), while for aset it's pretty much "the higher value the better". Since this affects all memory allocations, I need to come up with other meaningful scenarios to benchmarks. OK. Are you thinking about a different microbenchmark, or something closer to real workload? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Use generation context to speed up tuplesorts
Le vendredi 17 décembre 2021, 14:39:10 CET Tomas Vondra a écrit : > I wasn't really suggesting to investigate those other allocators in this > patch - it seems like a task requiring a pretty significant amount of > work/time. My point was that we should make it reasonably easy to add > tweaks for those other environments, if someone is interested enough to > do the legwork. > > >> 2) In fact, I wonder if different glibc versions behave differently? > >> Hopefully it's not changing that much, though. Ditto kernel versions, > >> but the mmap/sbrk interface is likely more stable. We can test this. > > > > That could be tested, yes. As a matter of fact, a commit removing the > > upper > > limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to > > glibc, > > which means we can service much bigger allocation without mmap. > > Yeah, I noticed that commit too. Most systems stick to one glibc > version, so it'll take time to reach most systems. Let's continue with > just one glibc version and then maybe test other versions. Yes, I also need to figure out how to detect we're using glibc as I'm not very familiar with configure. > > >> 3) If we bump the thresholds, won't that work against reusing the > >> memory? I mean, if we free a whole block (from any allocator we have), > >> glibc might return it to kernel, depending on mmap threshold value. It's > >> not guaranteed, but increasing the malloc thresholds will make that even > >> less likely. So we might just as well increase the minimum block size, > >> with about the same effect, no? > > > > It is my understanding that malloc will try to compact memory by moving it > > around. So the memory should be actually be released to the kernel at some > > point. In the meantime, malloc can reuse it for our next invocation (which > > can be in a different memory context on our side). > > > > If we increase the minimum block size, this is memory we will actually > > > > reserve, and it will not protect us against the ramping-up behaviour: > > - the first allocation of a big block may be over mmap_threshold, and > > serviced> > > by an expensive mmap > > > > - when it's free, the threshold is doubled > > - next invocation is serviced by an sbrk call > > - freeing it will be above the trim threshold, and it will be returned. > > > > After several "big" allocations, the thresholds will raise to their > > maximum > > values (well, it used to, I need to check what happens with that latest > > patch of glibc...) > > > > This will typically happen several times as malloc doubles the threshold > > each time. This is probably the reason quadrupling the block sizes was > > more effective. > > Hmmm, OK. Can we we benchmark the case with large initial block size, at > least for comparison? The benchmark I called "fixed" was with a fixed block size of ALLOCSET_DEFAULT_MAXSIZE (first proposed patch) and showed roughly the same performance profile as the growing blocks + malloc tuning. But if I understand correctly, you implemented the growing blocks logic after concerns about wasting memory with a constant large block size. If we tune malloc, that memory would not be wasted if we don't alloc it, just not released as eagerly when it's allocated. Or do you want a benchmark with an even bigger initial block size ? With the growing blocks patch with a large initial size ? I can run either, I just want to understand what is interesting to you. One thing that would be interesting would be to trace the total amount of memory allocated in the different cases. This is something I will need to do anyway when I propose that patch; Best regards, -- Ronan Dunklau
Re: Use generation context to speed up tuplesorts
On 12/17/21 09:08, Ronan Dunklau wrote: > Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit : >> On 12/16/21 17:03, Ronan Dunklau wrote: >>> Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit : I will follow up with a benchmark of the test sorting a table with a width varying from 1 to 32 columns. >>> >>> So please find attached another benchmark for that case. >>> >>> The 3 different patchsets tested are: >>> - master >>> - fixed (David's original patch) >>> - adjust (Thomas growing blocks patch) >> >> Presumably Thomas is me, right? > > I'm really sorry for this typo... Please accept my apologies. > Nah, no apology needed. I was just wondering if I missed some patch from Thomas Munro ... >> >>> So it looks like tuning malloc for this would be very benificial for any >>> kind of allocation, and by doing so we reduce the problems seen with the >>> growing blocks patch to next to nothing, while keeping the ability to not >>> allocate too much memory from the get go. >> >> Thanks for running those tests and investigating the glibc behavior! I >> find those results very interesting. My conclusions from this is that >> the interaction interaction between "our" allocator and the allocator in >> malloc (e.g. glibc) can be problematic. Which makes benchmarking and >> optimization somewhat tricky because code changes may trigger behavior >> change in glibc (or whatever allocator backs malloc). >> >> I think it's worth exploring if we can tune this in a reasonable way, >> but I have a couple concerns related to that: >> >> 1) I wonder how glibc-specific this is - I'd bet it applies to other >> allocators (either on another OS or just different allocator on Linux) >> too. Tweaking glibc parameters won't affect those systems, of course, >> but maybe we should allow tweaking those systems too ... > > I agree, finding their specific problems and see if we can workaround it > would > be interesting. I suppose glibc's malloc is the most commonly used allocator > in production, as it is the default for most Linux distributions. > I wasn't really suggesting to investigate those other allocators in this patch - it seems like a task requiring a pretty significant amount of work/time. My point was that we should make it reasonably easy to add tweaks for those other environments, if someone is interested enough to do the legwork. >> >> 2) In fact, I wonder if different glibc versions behave differently? >> Hopefully it's not changing that much, though. Ditto kernel versions, >> but the mmap/sbrk interface is likely more stable. We can test this. > > That could be tested, yes. As a matter of fact, a commit removing the upper > limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc, > which means we can service much bigger allocation without mmap. > Yeah, I noticed that commit too. Most systems stick to one glibc version, so it'll take time to reach most systems. Let's continue with just one glibc version and then maybe test other versions. > >> >> 3) If we bump the thresholds, won't that work against reusing the >> memory? I mean, if we free a whole block (from any allocator we have), >> glibc might return it to kernel, depending on mmap threshold value. It's >> not guaranteed, but increasing the malloc thresholds will make that even >> less likely. So we might just as well increase the minimum block size, >> with about the same effect, no? > > It is my understanding that malloc will try to compact memory by moving it > around. So the memory should be actually be released to the kernel at some > point. In the meantime, malloc can reuse it for our next invocation (which > can > be in a different memory context on our side). > > If we increase the minimum block size, this is memory we will actually > reserve, and it will not protect us against the ramping-up behaviour: > - the first allocation of a big block may be over mmap_threshold, and > serviced > by an expensive mmap > - when it's free, the threshold is doubled > - next invocation is serviced by an sbrk call > - freeing it will be above the trim threshold, and it will be returned. > > After several "big" allocations, the thresholds will raise to their maximum > values (well, it used to, I need to check what happens with that latest patch > of glibc...) > > This will typically happen several times as malloc doubles the threshold each > time. This is probably the reason quadrupling the block sizes was more > effective. > Hmmm, OK. Can we we benchmark the case with large initial block size, at least for comparison? > >> >>> I would like to try to implement some dynamic glibc malloc tuning, if that >>> is something we don't reject on principle from the get go. >> >> +1 to that > > Ok, I'll work on a patch for this and submit a new thread. > Cool! regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Use generation context to speed up tuplesorts
Le vendredi 17 décembre 2021, 09:08:06 CET Ronan Dunklau a écrit : > It is my understanding that malloc will try to compact memory by moving it > around. So the memory should be actually be released to the kernel at some > point. In the meantime, malloc can reuse it for our next invocation (which > can be in a different memory context on our side). I've been told off-list this comment wasn't clear: I meant that it compacts *free* memory, consolidating into larger blocks that will eventually reach the threshold, and be released. -- Ronan Dunklau
Re: Use generation context to speed up tuplesorts
Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit : > On 12/16/21 17:03, Ronan Dunklau wrote: > > Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit : > >> I will follow up with a benchmark of the test sorting a table with a > >> width > >> varying from 1 to 32 columns. > > > > So please find attached another benchmark for that case. > > > > The 3 different patchsets tested are: > > - master > > - fixed (David's original patch) > > - adjust (Thomas growing blocks patch) > > Presumably Thomas is me, right? I'm really sorry for this typo... Please accept my apologies. > > > So it looks like tuning malloc for this would be very benificial for any > > kind of allocation, and by doing so we reduce the problems seen with the > > growing blocks patch to next to nothing, while keeping the ability to not > > allocate too much memory from the get go. > > Thanks for running those tests and investigating the glibc behavior! I > find those results very interesting. My conclusions from this is that > the interaction interaction between "our" allocator and the allocator in > malloc (e.g. glibc) can be problematic. Which makes benchmarking and > optimization somewhat tricky because code changes may trigger behavior > change in glibc (or whatever allocator backs malloc). > > I think it's worth exploring if we can tune this in a reasonable way, > but I have a couple concerns related to that: > > 1) I wonder how glibc-specific this is - I'd bet it applies to other > allocators (either on another OS or just different allocator on Linux) > too. Tweaking glibc parameters won't affect those systems, of course, > but maybe we should allow tweaking those systems too ... I agree, finding their specific problems and see if we can workaround it would be interesting. I suppose glibc's malloc is the most commonly used allocator in production, as it is the default for most Linux distributions. > > 2) In fact, I wonder if different glibc versions behave differently? > Hopefully it's not changing that much, though. Ditto kernel versions, > but the mmap/sbrk interface is likely more stable. We can test this. That could be tested, yes. As a matter of fact, a commit removing the upper limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc, which means we can service much bigger allocation without mmap. > > 3) If we bump the thresholds, won't that work against reusing the > memory? I mean, if we free a whole block (from any allocator we have), > glibc might return it to kernel, depending on mmap threshold value. It's > not guaranteed, but increasing the malloc thresholds will make that even > less likely. So we might just as well increase the minimum block size, > with about the same effect, no? It is my understanding that malloc will try to compact memory by moving it around. So the memory should be actually be released to the kernel at some point. In the meantime, malloc can reuse it for our next invocation (which can be in a different memory context on our side). If we increase the minimum block size, this is memory we will actually reserve, and it will not protect us against the ramping-up behaviour: - the first allocation of a big block may be over mmap_threshold, and serviced by an expensive mmap - when it's free, the threshold is doubled - next invocation is serviced by an sbrk call - freeing it will be above the trim threshold, and it will be returned. After several "big" allocations, the thresholds will raise to their maximum values (well, it used to, I need to check what happens with that latest patch of glibc...) This will typically happen several times as malloc doubles the threshold each time. This is probably the reason quadrupling the block sizes was more effective. > > > I would like to try to implement some dynamic glibc malloc tuning, if that > > is something we don't reject on principle from the get go. > > +1 to that Ok, I'll work on a patch for this and submit a new thread. -- Ronan Dunklau
Re: Use generation context to speed up tuplesorts
On 12/16/21 17:03, Ronan Dunklau wrote: Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit : I will follow up with a benchmark of the test sorting a table with a width varying from 1 to 32 columns. So please find attached another benchmark for that case. The 3 different patchsets tested are: - master - fixed (David's original patch) - adjust (Thomas growing blocks patch) Presumably Thomas is me, right? So it looks like tuning malloc for this would be very benificial for any kind of allocation, and by doing so we reduce the problems seen with the growing blocks patch to next to nothing, while keeping the ability to not allocate too much memory from the get go. Thanks for running those tests and investigating the glibc behavior! I find those results very interesting. My conclusions from this is that the interaction interaction between "our" allocator and the allocator in malloc (e.g. glibc) can be problematic. Which makes benchmarking and optimization somewhat tricky because code changes may trigger behavior change in glibc (or whatever allocator backs malloc). I think it's worth exploring if we can tune this in a reasonable way, but I have a couple concerns related to that: 1) I wonder how glibc-specific this is - I'd bet it applies to other allocators (either on another OS or just different allocator on Linux) too. Tweaking glibc parameters won't affect those systems, of course, but maybe we should allow tweaking those systems too ... 2) In fact, I wonder if different glibc versions behave differently? Hopefully it's not changing that much, though. Ditto kernel versions, but the mmap/sbrk interface is likely more stable. We can test this. 3) If we bump the thresholds, won't that work against reusing the memory? I mean, if we free a whole block (from any allocator we have), glibc might return it to kernel, depending on mmap threshold value. It's not guaranteed, but increasing the malloc thresholds will make that even less likely. So we might just as well increase the minimum block size, with about the same effect, no? I would like to try to implement some dynamic glibc malloc tuning, if that is something we don't reject on principle from the get go. +1 to that regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Use generation context to speed up tuplesorts
Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit : > I will follow up with a benchmark of the test sorting a table with a width > varying from 1 to 32 columns. > So please find attached another benchmark for that case. The 3 different patchsets tested are: - master - fixed (David's original patch) - adjust (Thomas growing blocks patch) So it looks like tuning malloc for this would be very benificial for any kind of allocation, and by doing so we reduce the problems seen with the growing blocks patch to next to nothing, while keeping the ability to not allocate too much memory from the get go. I would like to try to implement some dynamic glibc malloc tuning, if that is something we don't reject on principle from the get go. -- Ronan Dunklau bench_results.ods Description: application/vnd.oasis.opendocument.spreadsheet
Re: Use generation context to speed up tuplesorts
Le mercredi 8 décembre 2021, 22:07:12 CET Tomas Vondra a écrit : > On 12/8/21 16:51, Ronan Dunklau wrote: > > Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit : > >> And now comes the funny part - if I run it in the same backend as the > >> > >> "full" benchmark, I get roughly the same results: > >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms > >> > >> ++---+--+- > >> > >>32768 |512 | 806256640 |37159 | 76669 > >> > >> but if I reconnect and run it in the new backend, I get this: > >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms > >> > >> ++---+--+- > >> > >>32768 |512 | 806158336 | 233909 | 100785 > >> > >> (1 row) > >> > >> It does not matter if I wait a bit before running the query, if I run it > >> repeatedly, etc. The machine is not doing anything else, the CPU is set > >> to use "performance" governor, etc. > > > > I've reproduced the behaviour you mention. > > I also noticed asm_exc_page_fault showing up in the perf report in that > > case. > > > > Running an strace on it shows that in one case, we have a lot of brk > > calls, > > while when we run in the same process as the previous tests, we don't. > > > > My suspicion is that the previous workload makes glibc malloc change it's > > trim_threshold and possibly other dynamic options, which leads to > > constantly moving the brk pointer in one case and not the other. > > > > Running your fifo test with absurd malloc options shows that indeed that > > might be the case (I needed to change several, because changing one > > disable the dynamic adjustment for every single one of them, and malloc > > would fall back to using mmap and freeing it on each iteration): > > > > mallopt(M_TOP_PAD, 1024 * 1024 * 1024); > > mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024); > > mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long)); > > > > I get the following results for your self contained test. I ran the query > > twice, in each case, seeing the importance of the first allocation and the > > subsequent ones: > > > > With default malloc options: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ++---+--+- > > > > 32768 |512 | 795836416 | 300156 | 207557 > > > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ++---+--+- > > > > 32768 |512 | 795836416 | 211942 | 77207 > > > > With the oversized values above: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ++---+--+- > > > > 32768 |512 | 795836416 | 219000 | 36223 > > > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > > > > ++---+--+- > > > > 32768 |512 | 795836416 |75761 | 78082 > > > > (1 row) > > > > I can't tell how representative your benchmark extension would be of real > > life allocation / free patterns, but there is probably something we can > > improve here. > > Thanks for looking at this. I think those allocation / free patterns are > fairly extreme, and there probably are no workloads doing exactly this. > The idea is the actual workloads are likely some combination of these > extreme cases. > > > I'll try to see if I can understand more precisely what is happening. > > Thanks, that'd be helpful. Maybe we can learn something about tuning > malloc parameters to get significantly better performance. > Apologies for the long email, maybe what I will state here is obvious for others but I learnt a lot, so... I think I understood what the problem was in your generation tests: depending on the sequence of allocations we will raise a different maximum for mmap_threshold and trim_threshold. When an mmap block is freed, malloc will raise it's threshold as it consider memory freed to be better served by regular moving of the sbrk pointer, instead of using mmap to map memory. This threshold is upped by multiplying it by two anytime we free a mmap'ed block. At the same time, the trim_threshold is raised to double the mmap_threshold, considering that this amount of memory should not be released to the OS because we have a good chance of reusing it. This can be demonstrated using the attached systemtap script, along with a patch adding new traces to generation_context for this purpose: When running your query: select block_size, chunk_size, x.* from (values (512)) AS a(chunk_size), (values (32768)) AS b(block_size), lateral generation_bench_fifo(100, block_size, chunk_size,
Re: Use generation context to speed up tuplesorts
On 12/8/21 16:51, Ronan Dunklau wrote: > Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit : >> And now comes the funny part - if I run it in the same backend as the >> "full" benchmark, I get roughly the same results: >> >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms >> ++---+--+- >>32768 |512 | 806256640 |37159 | 76669 >> >> but if I reconnect and run it in the new backend, I get this: >> >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms >> ++---+--+- >>32768 |512 | 806158336 | 233909 | 100785 >> (1 row) >> >> It does not matter if I wait a bit before running the query, if I run it >> repeatedly, etc. The machine is not doing anything else, the CPU is set >> to use "performance" governor, etc. > > I've reproduced the behaviour you mention. > I also noticed asm_exc_page_fault showing up in the perf report in that case. > > Running an strace on it shows that in one case, we have a lot of brk calls, > while when we run in the same process as the previous tests, we don't. > > My suspicion is that the previous workload makes glibc malloc change it's > trim_threshold and possibly other dynamic options, which leads to constantly > moving the brk pointer in one case and not the other. > > Running your fifo test with absurd malloc options shows that indeed that > might > be the case (I needed to change several, because changing one disable the > dynamic adjustment for every single one of them, and malloc would fall back > to > using mmap and freeing it on each iteration): > > mallopt(M_TOP_PAD, 1024 * 1024 * 1024); > mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024); > mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long)); > > I get the following results for your self contained test. I ran the query > twice, in each case, seeing the importance of the first allocation and the > subsequent ones: > > With default malloc options: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ++---+--+- > 32768 |512 | 795836416 | 300156 | 207557 > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ++---+--+- > 32768 |512 | 795836416 | 211942 | 77207 > > > With the oversized values above: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ++---+--+- > 32768 |512 | 795836416 | 219000 | 36223 > > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ++---+--+- > 32768 |512 | 795836416 |75761 | 78082 > (1 row) > > I can't tell how representative your benchmark extension would be of real > life > allocation / free patterns, but there is probably something we can improve > here. > Thanks for looking at this. I think those allocation / free patterns are fairly extreme, and there probably are no workloads doing exactly this. The idea is the actual workloads are likely some combination of these extreme cases. > I'll try to see if I can understand more precisely what is happening. > Thanks, that'd be helpful. Maybe we can learn something about tuning malloc parameters to get significantly better performance. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Use generation context to speed up tuplesorts
Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit : > And now comes the funny part - if I run it in the same backend as the > "full" benchmark, I get roughly the same results: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ++---+--+- >32768 |512 | 806256640 |37159 | 76669 > > but if I reconnect and run it in the new backend, I get this: > > block_size | chunk_size | mem_allocated | alloc_ms | free_ms > ++---+--+- >32768 |512 | 806158336 | 233909 | 100785 > (1 row) > > It does not matter if I wait a bit before running the query, if I run it > repeatedly, etc. The machine is not doing anything else, the CPU is set > to use "performance" governor, etc. I've reproduced the behaviour you mention. I also noticed asm_exc_page_fault showing up in the perf report in that case. Running an strace on it shows that in one case, we have a lot of brk calls, while when we run in the same process as the previous tests, we don't. My suspicion is that the previous workload makes glibc malloc change it's trim_threshold and possibly other dynamic options, which leads to constantly moving the brk pointer in one case and not the other. Running your fifo test with absurd malloc options shows that indeed that might be the case (I needed to change several, because changing one disable the dynamic adjustment for every single one of them, and malloc would fall back to using mmap and freeing it on each iteration): mallopt(M_TOP_PAD, 1024 * 1024 * 1024); mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024); mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long)); I get the following results for your self contained test. I ran the query twice, in each case, seeing the importance of the first allocation and the subsequent ones: With default malloc options: block_size | chunk_size | mem_allocated | alloc_ms | free_ms ++---+--+- 32768 |512 | 795836416 | 300156 | 207557 block_size | chunk_size | mem_allocated | alloc_ms | free_ms ++---+--+- 32768 |512 | 795836416 | 211942 | 77207 With the oversized values above: block_size | chunk_size | mem_allocated | alloc_ms | free_ms ++---+--+- 32768 |512 | 795836416 | 219000 | 36223 block_size | chunk_size | mem_allocated | alloc_ms | free_ms ++---+--+- 32768 |512 | 795836416 |75761 | 78082 (1 row) I can't tell how representative your benchmark extension would be of real life allocation / free patterns, but there is probably something we can improve here. I'll try to see if I can understand more precisely what is happening. -- Ronan Dunklau
Re: Use generation context to speed up tuplesorts
On Mon, 9 Aug 2021 at 00:38, Tomas Vondra wrote: > I'm not sure quadrupling the size is a good idea, though, because it > increases the amount of memory we might be wasting. With the doubling, > the amount of wasted /unused memory is limited to ~50%, because the next > block is (roughly) equal to sum of already allocated blocks, so > allocating just 1B on it leaves us with 50%. But quadrupling the size > means we'll end up with ~75% free space. Of course, this is capped by > the maximum block size etc. but still ... Yeah, not sure what is best. It does however seem likely that the majority of the performance improvement that I saw is due to either malloc()/free() calls or just having fewer blocks in the context. Maybe it's worth getting the planner on board with deciding how to do the allocations. It feels a bit overcautious to go allocating blocks in each power of two starting at 8192 bytes when doing a 1GB sort. Maybe we should be looking towards doing something more like making the init allocation size more like pg_prevpower2_64(Min(work_mem * 1024L, sort_tuples * tuple_width)), or maybe half or quarter that. It would certainly not be the only executor node to allocate memory based on what the planner thought. Just look at ExecHashTableCreate(). David
Re: Use generation context to speed up tuplesorts
On Wed, 4 Aug 2021 at 02:10, Tomas Vondra wrote: > I did run the same set of benchmarks as for Slab, measuring some usual > allocation patterns. The results for i5-2500k machine are attached (for > the xeon it's almost exactly the same behavior). While running those > tests I realized the last patch is wrong and sets allocChunkLimit=1, > which is bogus and causes significant regression. So here's an updated > version of the patch series too. I know you're not done with these yet, but FWIW, I was getting an Assert failure with these patches on: Assert(total_allocated == context->mem_allocated); It seems to be because you've forgotten to ignore keeper blocks when adjusting context->mem_allocated in GenerationReset() David
Re: Use generation context to speed up tuplesorts
On 8/8/21 12:11 PM, David Rowley wrote: On Sat, 7 Aug 2021 at 12:10, Tomas Vondra wrote: All of the tests show that the patches to improve the allocation efficiency of generation.c don't help to improve the results of the test cases. I wondered if it's maybe worth trying to see what happens if instead of doubling the allocations each time, quadruple them instead. I didn't try this. I doubt quadrupling the allocations won't help very much, but I suspect the problem might be in the 0004 patch - at least that's what shows regression in my results. Could you try with just 0001-0003 applied? I tried the quadrupling of the buffer instead of doubling it each time and got the attached. Column E, or green in the graphs show the results of that. It's now much closer to the original patch which just made the block size 8MB. Interesting, I wouldn't have expected that to make such difference. I'm not sure quadrupling the size is a good idea, though, because it increases the amount of memory we might be wasting. With the doubling, the amount of wasted /unused memory is limited to ~50%, because the next block is (roughly) equal to sum of already allocated blocks, so allocating just 1B on it leaves us with 50%. But quadrupling the size means we'll end up with ~75% free space. Of course, this is capped by the maximum block size etc. but still ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Use generation context to speed up tuplesorts
On 8/8/21 9:02 AM, David Rowley wrote: On Sat, 7 Aug 2021 at 12:10, Tomas Vondra wrote: On 8/6/21 3:07 PM, David Rowley wrote: All of the tests show that the patches to improve the allocation efficiency of generation.c don't help to improve the results of the test cases. I wondered if it's maybe worth trying to see what happens if instead of doubling the allocations each time, quadruple them instead. I didn't try this. I doubt quadrupling the allocations won't help very much, but I suspect the problem might be in the 0004 patch - at least that's what shows regression in my results. Could you try with just 0001-0003 applied? But 0004 only changes the logic which controls the threshold of when we allocate an oversized chunk. It looks like the threshold is 512KB with the 0004 patch. My test is only doing a maximum allocation of 296 bytes so will never allocate an oversized chunk. Can you explain why you think 0004 would cause performance regressions? It's based solely on results of my benchmarks, where this patch seems to cause performance regression. I agree it's a bit bizzare, considering what the patch does. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Use generation context to speed up tuplesorts
On Sat, 7 Aug 2021 at 12:10, Tomas Vondra wrote: > > All of the tests show that the patches to improve the allocation > > efficiency of generation.c don't help to improve the results of the > > test cases. I wondered if it's maybe worth trying to see what happens > > if instead of doubling the allocations each time, quadruple them > > instead. I didn't try this. > > > > I doubt quadrupling the allocations won't help very much, but I suspect > the problem might be in the 0004 patch - at least that's what shows > regression in my results. Could you try with just 0001-0003 applied? I tried the quadrupling of the buffer instead of doubling it each time and got the attached. Column E, or green in the graphs show the results of that. It's now much closer to the original patch which just made the block size 8MB. David generation context tuplesort_v2.ods Description: application/vnd.oasis.opendocument.spreadsheet
Re: Use generation context to speed up tuplesorts
On Sat, 7 Aug 2021 at 12:10, Tomas Vondra wrote: > > On 8/6/21 3:07 PM, David Rowley wrote: > > All of the tests show that the patches to improve the allocation > > efficiency of generation.c don't help to improve the results of the > > test cases. I wondered if it's maybe worth trying to see what happens > > if instead of doubling the allocations each time, quadruple them > > instead. I didn't try this. > > > > I doubt quadrupling the allocations won't help very much, but I suspect > the problem might be in the 0004 patch - at least that's what shows > regression in my results. Could you try with just 0001-0003 applied? But 0004 only changes the logic which controls the threshold of when we allocate an oversized chunk. It looks like the threshold is 512KB with the 0004 patch. My test is only doing a maximum allocation of 296 bytes so will never allocate an oversized chunk. Can you explain why you think 0004 would cause performance regressions? David
Re: Use generation context to speed up tuplesorts
On 8/6/21 3:07 PM, David Rowley wrote: On Wed, 4 Aug 2021 at 02:10, Tomas Vondra wrote: A review would be nice, although it can wait - It'd be interesting to know if those patches help with the workload(s) you've been looking at. I tried out the v2 set of patches using the attached scripts. The attached spreadsheet includes the original tests and compares master with the patch which uses the generation context vs that patch plus your v2 patch. I've also included 4 additional tests, each of which starts with a 1 column table and then adds another 32 columns testing the performance after adding each additional column. I did this because I wanted to see if the performance was more similar to master when the allocations had less power of 2 wastage from allocset. If, for example, you look at row 123 of the spreadsheet you can see both patched and unpatched the allocations were 272 bytes each yet there was still a 50% performance improvement with just the generation context patch when compared to master. Looking at the spreadsheet, you'll also notice that in the 2 column test of each of the 4 new tests the number of bytes used for each allocation is larger with the generation context. 56 vs 48. This is due to the GenerationChunk struct size being later than the Allocset's version by 8 bytes. This is because it also holds the GenerationBlock. So with the patch there are some cases where we'll use slightly more memory. Additional tests: 1. Sort 1 tuples on a column with values 0-99 in memory. 2. As #1 but with 1 million tuples. 3 As #1 but with a large OFFSET to remove the overhead of sending to the client. 4. As #2 but with a large OFFSET. Test #3 above is the most similar one to the original tests and shows similar gains. When the sort becomes larger (1 million tuple test), the gains reduce. This indicates the gains are coming from improved CPU cache efficiency from the removal of the power of 2 wastage in memory allocations. All of the tests show that the patches to improve the allocation efficiency of generation.c don't help to improve the results of the test cases. I wondered if it's maybe worth trying to see what happens if instead of doubling the allocations each time, quadruple them instead. I didn't try this. Thanks for the scripts and the spreadsheet with results. I doubt quadrupling the allocations won't help very much, but I suspect the problem might be in the 0004 patch - at least that's what shows regression in my results. Could you try with just 0001-0003 applied? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Use generation context to speed up tuplesorts
On Wed, 4 Aug 2021 at 02:10, Tomas Vondra wrote: > A review would be nice, although it can wait - It'd be interesting to > know if those patches help with the workload(s) you've been looking at. I tried out the v2 set of patches using the attached scripts. The attached spreadsheet includes the original tests and compares master with the patch which uses the generation context vs that patch plus your v2 patch. I've also included 4 additional tests, each of which starts with a 1 column table and then adds another 32 columns testing the performance after adding each additional column. I did this because I wanted to see if the performance was more similar to master when the allocations had less power of 2 wastage from allocset. If, for example, you look at row 123 of the spreadsheet you can see both patched and unpatched the allocations were 272 bytes each yet there was still a 50% performance improvement with just the generation context patch when compared to master. Looking at the spreadsheet, you'll also notice that in the 2 column test of each of the 4 new tests the number of bytes used for each allocation is larger with the generation context. 56 vs 48. This is due to the GenerationChunk struct size being later than the Allocset's version by 8 bytes. This is because it also holds the GenerationBlock. So with the patch there are some cases where we'll use slightly more memory. Additional tests: 1. Sort 1 tuples on a column with values 0-99 in memory. 2. As #1 but with 1 million tuples. 3 As #1 but with a large OFFSET to remove the overhead of sending to the client. 4. As #2 but with a large OFFSET. Test #3 above is the most similar one to the original tests and shows similar gains. When the sort becomes larger (1 million tuple test), the gains reduce. This indicates the gains are coming from improved CPU cache efficiency from the removal of the power of 2 wastage in memory allocations. All of the tests show that the patches to improve the allocation efficiency of generation.c don't help to improve the results of the test cases. I wondered if it's maybe worth trying to see what happens if instead of doubling the allocations each time, quadruple them instead. I didn't try this. David #!/bin/bash sec=5 dbname=postgres records=100 mod=100 psql -c "drop table if exists t" $dbname psql -c "create table t (a bigint not null)" $dbname psql -c "insert into t select x % $mod from generate_series(1,$records) x" $dbname psql -c "vacuum freeze t" $dbname for i in {1..32} do echo $i psql -c "explain analyze select * from t order by a offset 10" $dbname | grep -E "Memory|Disk" echo "select * from t order by a offset 10" > bench.sql for loops in {1..3} do pgbench -n -M prepared -T $sec -f bench.sql $dbname | grep tps done psql -c "alter table t add column c$i bigint" $dbname psql -c "update t set c$i = a" $dbname psql -c "vacuum full t" $dbname psql -c "vacuum freeze t" $dbname done generation context tuplesort.ods Description: application/vnd.oasis.opendocument.spreadsheet #!/bin/bash sec=5 dbname=postgres records=1 mod=100 psql -c "drop table if exists t" $dbname psql -c "create table t (a bigint not null)" $dbname psql -c "insert into t select x % $mod from generate_series(1,$records) x" $dbname psql -c "vacuum freeze t" $dbname for i in {1..32} do echo $i psql -c "explain analyze select * from t order by a offset 10" $dbname | grep -E "Memory|Disk" echo "select * from t order by a offset 10" > bench.sql for loops in {1..3} do pgbench -n -M prepared -T $sec -f bench.sql $dbname | grep tps done psql -c "alter table t add column c$i bigint" $dbname psql -c "update t set c$i = a" $dbname psql -c "vacuum full t" $dbname psql -c "vacuum freeze t" $dbname done
Re: Use generation context to speed up tuplesorts
Hi, On 2021-08-03 10:59:25 +1200, David Rowley wrote: > On Sat, 31 Jul 2021 at 08:38, Andres Freund wrote: > > There is at least one path using tuplecontext that reaches code that > > could end up freeing memory to a significant enough degree to care about > > generation.c effectively not using that memory: > > tuplesort_putdatum()->datumCopy()->EOH_flatten_into() > > On a quick look I didn't find any expanded record user that frees > > nontrivial amounts of memory, but I didn't look all that carefully. > > I guess we could just use a normal context for datum sorts if we > thought that might be a problem. I think that's probably a cure worse than the disease. I suspect datum sorts can benefit from the higher density quite a bit... > I'm not too familiar with the expanded object code, but I'm struggling > to imagine why anything would need to do a pfree in there. We just do > EOH_get_flat_size() to determine how big to make the allocation then > allocate some memory for EOH_flatten_into() to use to expand the > object into. I can see some scenarios with a bit more creative uses of expanded objects. We've e.g. been talking about using EA to avoid repeated and partial detoasting overhead and you might need to do some more toast fetches when flattening. Toast fetches always allocate, and if the fetch were only for later parts of the tuple, the fetched data would need to be freed. It's probably fine to deal with this at later time, and just leave a comment somewhere. It could be addressed by having a bump style allocator combined with having freelists. It's not like the tuplesort.c case is actually interested in the generational behaviour of generation.c (which makes freelists uninteresting), it's just that generation.c is the densest allocator that we have right now... Greetings, Andres Freund
Re: Use generation context to speed up tuplesorts
On Sat, 31 Jul 2021 at 08:38, Andres Freund wrote: > There is at least one path using tuplecontext that reaches code that > could end up freeing memory to a significant enough degree to care about > generation.c effectively not using that memory: > tuplesort_putdatum()->datumCopy()->EOH_flatten_into() > On a quick look I didn't find any expanded record user that frees > nontrivial amounts of memory, but I didn't look all that carefully. I guess we could just use a normal context for datum sorts if we thought that might be a problem. I'm not too familiar with the expanded object code, but I'm struggling to imagine why anything would need to do a pfree in there. We just do EOH_get_flat_size() to determine how big to make the allocation then allocate some memory for EOH_flatten_into() to use to expand the object into. David
Re: Use generation context to speed up tuplesorts
On Sat, 31 Jul 2021 at 14:34, Tomas Vondra wrote: > I spent a bit of time hacking on the Generation context, adding the two > improvements discussed in this thread: > > 1) internal handling of block sizes, similar to what AllocSet does (it > pretty much just copies parts of it) > > 2) keeper block (we keep one empry block instead of freeing it) > > 3) I've also added allocChunkLimit, which makes it look a bit more like > AllocSet (instead of using just blockSize/8, which does not work too > well with dynamic blockSize) > > I haven't done any extensive tests on it, but it does pass check-world > with asserts etc. I haven't touched the comments, those need updating. > regards Thanks for starting work on that. I've only had a quick look, but I can have a more detailed look once you've got it more complete. For now it does not really look like the keeper block stuff is wired up the same way as in aset.c. I'd expect you to be allocating that in the same malloc as you're using to allocate the context struct itself in GenerationContextCreate(). Also, likely as a result of the above, minContextSize does not seem to be wired up to anything apart from an Assert(). David
Re: Use generation context to speed up tuplesorts
Hi, I spent a bit of time hacking on the Generation context, adding the two improvements discussed in this thread: 1) internal handling of block sizes, similar to what AllocSet does (it pretty much just copies parts of it) 2) keeper block (we keep one empry block instead of freeing it) 3) I've also added allocChunkLimit, which makes it look a bit more like AllocSet (instead of using just blockSize/8, which does not work too well with dynamic blockSize) I haven't done any extensive tests on it, but it does pass check-world with asserts etc. I haven't touched the comments, those need updating. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company >From 6ca0ae674964157c1f2d2ae446cb415c9be509b6 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Fri, 30 Jul 2021 23:53:52 +0200 Subject: [PATCH 1/3] Generation: grow blocks --- src/backend/access/gist/gistvacuum.c | 2 +- .../replication/logical/reorderbuffer.c | 2 +- src/backend/utils/mmgr/generation.c | 53 ++- src/include/utils/memutils.h | 4 +- 4 files changed, 44 insertions(+), 17 deletions(-) diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c index 0663193531..1818ed06fc 100644 --- a/src/backend/access/gist/gistvacuum.c +++ b/src/backend/access/gist/gistvacuum.c @@ -161,7 +161,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, */ vstate.page_set_context = GenerationContextCreate(CurrentMemoryContext, "GiST VACUUM page set context", - 16 * 1024); + ALLOCSET_DEFAULT_SIZES); oldctx = MemoryContextSwitchTo(vstate.page_set_context); vstate.internal_page_set = intset_create(); vstate.empty_leaf_set = intset_create(); diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index f99e45f22d..89018dc99d 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -336,7 +336,7 @@ ReorderBufferAllocate(void) buffer->tup_context = GenerationContextCreate(new_ctx, "Tuples", - SLAB_LARGE_BLOCK_SIZE); + ALLOCSET_DEFAULT_SIZES); hash_ctl.keysize = sizeof(TransactionId); hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt); diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c index 584cd614da..771a2525ca 100644 --- a/src/backend/utils/mmgr/generation.c +++ b/src/backend/utils/mmgr/generation.c @@ -60,7 +60,9 @@ typedef struct GenerationContext MemoryContextData header; /* Standard memory-context fields */ /* Generational context parameters */ - Size blockSize; /* standard block size */ + Size initBlockSize; /* initial block size */ + Size maxBlockSize; /* maximum block size */ + Size nextBlockSize; /* next block size to allocate */ GenerationBlock *block; /* current (most recently allocated) block */ dlist_head blocks; /* list of blocks */ @@ -196,7 +198,9 @@ static const MemoryContextMethods GenerationMethods = { MemoryContext GenerationContextCreate(MemoryContext parent, const char *name, - Size blockSize) + Size minContextSize, + Size initBlockSize, + Size maxBlockSize) { GenerationContext *set; @@ -208,16 +212,20 @@ GenerationContextCreate(MemoryContext parent, "padding calculation in GenerationChunk is wrong"); /* - * First, validate allocation parameters. (If we're going to throw an - * error, we should do so before the context is created, not after.) We - * somewhat arbitrarily enforce a minimum 1K block size, mostly because - * that's what AllocSet does. + * First, validate allocation parameters. Once these were regular runtime + * test and elog's, but in practice Asserts seem sufficient because nobody + * varies their parameters at runtime. We somewhat arbitrarily enforce a + * minimum 1K block size. */ - if (blockSize != MAXALIGN(blockSize) || - blockSize < 1024 || - !AllocHugeSizeIsValid(blockSize)) - elog(ERROR, "invalid blockSize for memory context: %zu", - blockSize); + Assert(initBlockSize == MAXALIGN(initBlockSize) && + initBlockSize >= 1024); + Assert(maxBlockSize == MAXALIGN(maxBlockSize) && + maxBlockSize >= initBlockSize && + AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ + Assert(minContextSize == 0 || + (minContextSize == MAXALIGN(minContextSize) && + minContextSize >= 1024 && + minContextSize <= maxBlockSize)); /* * Allocate the context header. Unlike aset.c, we never try to combine @@ -242,7 +250,9 @@ GenerationContextCreate(MemoryContext parent, */ /* Fill in GenerationContext-specific header fields */ - set->blockSize = blockSize; + set->initBlockSize = initBlockSize; + set->maxBlockSize = maxBlockSize; + set->nextBlockSize = initBlockSize; set->block = NULL;
Re: Use generation context to speed up tuplesorts
On 7/30/21 10:38 PM, Andres Freund wrote: Hi, On 2021-07-30 18:42:18 +1200, David Rowley wrote: While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum sorts for single column sorts), I noticed that we use a separate memory context to store tuple data and we just reset when we're done if the sort fits in memory, or we dump the tuples to disk in the same order they're added and reset the context when it does not. There is a little pfree() work going on via writetup_heap() which I think possibly could just be removed to get some additional gains. Anyway, this context that stores tuples uses the standard aset.c allocator which has the usual power of 2 wastage and additional overheads of freelists etc. I wondered how much faster it would go if I set it to use a generation context instead of an aset.c one. After running make installcheck to make the tenk1 table, running the attached tuplesortbench script, I get this: So it seems to save quite a bit of memory getting away from rounding up allocations to the next power of 2. Performance-wise, there's some pretty good gains. (Results in TPS) Very nice! Yes, very nice. I wouldn't have expected such significant difference, particularly in CPU usage. It's pretty interesting that it both reduces memory and CPU usage, I'd have guessed it's either one of the other. I wonder if there's cases where generation.c would regress performance over aset.c due to not having an initial / "keeper" block? Not sure. I guess such workload would need to allocate and free a single block (so very little memory) very often. I guess that's possible, but I'm not aware of a place doing that very often. Although, maybe decoding could do that for simple (serial) workload. I'm not opposed to adding a keeper block to Generation, similarly to what was discussed for Slab not too long ago. The patch is just a simple 1-liner at the moment. I likely do need to adjust what I'm passing as the blockSize to GenerationContextCreate(). Maybe a better number would be something that's calculated from work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64)) so that we just allocate at most a 16th of work_mem per chunk, but not bigger than 8MB. I don't think changing this will affect the performance of the above very much. I think it's bad that both genereration and slab don't have internal handling of block sizes. Needing to err on the size of too big blocks to handle large amounts of memory well, just so the contexts don't need to deal with variably sized blocks isn't a sensible tradeoff. Well, back then it seemed like a sensible trade off to me, but I agree it may have negative consequences. I'm not opposed to revisiting this. I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for tuplesort.c. There's plenty cases where we'll just sort a handful of tuples, and increasing the memory usage of those by a factor of 1024 isn't good. The Min() won't do any good if a larger work_mem is used. Nor will it be good to use thousands of small allocations for a large in-memory tuplesort just because we're concerned about the initial allocation size. Both because of the allocation overhead, but importantly also because that will make context resets more expensive. True. To me this says that we should transplant aset.c's block size growing into generation.c. Yeah, maybe. There is at least one path using tuplecontext that reaches code that could end up freeing memory to a significant enough degree to care about generation.c effectively not using that memory: tuplesort_putdatum()->datumCopy()->EOH_flatten_into() On a quick look I didn't find any expanded record user that frees nontrivial amounts of memory, but I didn't look all that carefully. Not sure, I'm not familiar with EOH_flatten_into or expanded records. But I wonder if there's some sort of metric that we could track in Generation and use it to identify "interesting" places. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Use generation context to speed up tuplesorts
Hi, On 2021-07-30 18:42:18 +1200, David Rowley wrote: > While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum > sorts for single column sorts), I noticed that we use a separate > memory context to store tuple data and we just reset when we're done > if the sort fits in memory, or we dump the tuples to disk in the same > order they're added and reset the context when it does not. There is > a little pfree() work going on via writetup_heap() which I think > possibly could just be removed to get some additional gains. > > Anyway, this context that stores tuples uses the standard aset.c > allocator which has the usual power of 2 wastage and additional > overheads of freelists etc. I wondered how much faster it would go if > I set it to use a generation context instead of an aset.c one. > > After running make installcheck to make the tenk1 table, running the > attached tuplesortbench script, I get this: > So it seems to save quite a bit of memory getting away from rounding > up allocations to the next power of 2. > > Performance-wise, there's some pretty good gains. (Results in TPS) Very nice! I wonder if there's cases where generation.c would regress performance over aset.c due to not having an initial / "keeper" block? > The patch is just a simple 1-liner at the moment. I likely do need to > adjust what I'm passing as the blockSize to GenerationContextCreate(). > Maybe a better number would be something that's calculated from > work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64)) > so that we just allocate at most a 16th of work_mem per chunk, but not > bigger than 8MB. I don't think changing this will affect the > performance of the above very much. I think it's bad that both genereration and slab don't have internal handling of block sizes. Needing to err on the size of too big blocks to handle large amounts of memory well, just so the contexts don't need to deal with variably sized blocks isn't a sensible tradeoff. I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for tuplesort.c. There's plenty cases where we'll just sort a handful of tuples, and increasing the memory usage of those by a factor of 1024 isn't good. The Min() won't do any good if a larger work_mem is used. Nor will it be good to use thousands of small allocations for a large in-memory tuplesort just because we're concerned about the initial allocation size. Both because of the allocation overhead, but importantly also because that will make context resets more expensive. To me this says that we should transplant aset.c's block size growing into generation.c. There is at least one path using tuplecontext that reaches code that could end up freeing memory to a significant enough degree to care about generation.c effectively not using that memory: tuplesort_putdatum()->datumCopy()->EOH_flatten_into() On a quick look I didn't find any expanded record user that frees nontrivial amounts of memory, but I didn't look all that carefully. Greetings, Andres Freund
Re: Use generation context to speed up tuplesorts
On Fri, Jul 30, 2021 at 2:42 AM David Rowley wrote: > Master: >Sort Method: quicksort Memory: 5541kB > Patched: >Sort Method: quicksort Memory: 3197kB Whoa. > work_mem = '4GB'; > Testmastergen sortcompare > Test1317.2665.6210% > Test2228.6388.9170% > Test3207.4330.7159% > Test4185.5279.4151% > Test5292.2563.9193% Very impressive. An early version of what eventually became DSA worked with backend-local memory and I saw very substantial memory usage improvements on large sorts, similar to what you show here. I am not sure I saw the same CPU improvements, and in any case I abandoned the idea of using that infrastructure to manage backend-local memory at some point, since the whole thing had lots of problems that I didn't know how to solve. What you've done here looks like a much more promising approach. -- Robert Haas EDB: http://www.enterprisedb.com
Use generation context to speed up tuplesorts
Hackers, While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum sorts for single column sorts), I noticed that we use a separate memory context to store tuple data and we just reset when we're done if the sort fits in memory, or we dump the tuples to disk in the same order they're added and reset the context when it does not. There is a little pfree() work going on via writetup_heap() which I think possibly could just be removed to get some additional gains. Anyway, this context that stores tuples uses the standard aset.c allocator which has the usual power of 2 wastage and additional overheads of freelists etc. I wondered how much faster it would go if I set it to use a generation context instead of an aset.c one. After running make installcheck to make the tenk1 table, running the attached tuplesortbench script, I get this: Master: work_mem = '4MB'; Sort Method: external merge Disk: 2496kB work_mem = '4GB'; Sort Method: quicksort Memory: 5541kB Patched: work_mem = '4MB'; Sort Method: quicksort Memory: 3197kB So it seems to save quite a bit of memory getting away from rounding up allocations to the next power of 2. Performance-wise, there's some pretty good gains. (Results in TPS) work_mem = '4GB'; Testmastergen sortcompare Test1317.2665.6210% Test2228.6388.9170% Test3207.4330.7159% Test4185.5279.4151% Test5292.2563.9193% If I drop the work_mem down to standard the unpatched version does to disk, but the patched version does not. The gains get a little bigger. work_mem = '4MB'; Testmastergen sortcompare Test1177.5658.2371% Test2149.7385.2257% Test3137.5330.0240% Test4129.0275.1213% Test5161.7546.4338% The patch is just a simple 1-liner at the moment. I likely do need to adjust what I'm passing as the blockSize to GenerationContextCreate(). Maybe a better number would be something that's calculated from work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64)) so that we just allocate at most a 16th of work_mem per chunk, but not bigger than 8MB. I don't think changing this will affect the performance of the above very much. David #!/bin/bash sec=10 # This should improve echo "select * from tenk1 order by two offset 100" > bench.sql echo Test1 pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps # This should improve echo "select * from tenk1 order by tenthous offset 100" > bench.sql echo Test2 pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps # This should improve echo "select * from tenk1 order by stringu1 offset 100" > bench.sql echo Test3 pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps # This should improve echo "select * from tenk1 order by stringu2 offset 100" > bench.sql echo Test4 pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps # This should improve echo "select * from tenk1 order by twenty offset 100" > bench.sql echo Test5 pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps generation_tuplesort.patch Description: Binary data