Re: Use generation context to speed up tuplesorts

2023-06-19 Thread Ronan Dunklau
Le dimanche 18 juin 2023, 20:22:17 CEST Tomas Vondra a écrit :
> Hi Ronan,
> 
> We briefly chatted about the glibc-tuning part of this thread at pgcon,
> so I wonder if you're still planning to pursue that. If you do, I
> suggest we start a fresh thread, so that it's not mixed with the already
> committed improvements of generation context.
> 
> I wonder what's the situation with the generation context improvements
> already pushed - does setting the glibc thresholds still help, and if
> yes how much?

Hi Tomas ! 

I'm currently working on it, trying to evaluate the possible benefits on the 
current ASet and Generation contexts usecases. 

I'll make sure to use a new thread to post my findings.

Thank you for remembering !

Best regards,

--
Ronan Dunklau








Re: Use generation context to speed up tuplesorts

2023-06-18 Thread Tomas Vondra
Hi Ronan,

We briefly chatted about the glibc-tuning part of this thread at pgcon,
so I wonder if you're still planning to pursue that. If you do, I
suggest we start a fresh thread, so that it's not mixed with the already
committed improvements of generation context.

I wonder what's the situation with the generation context improvements
already pushed - does setting the glibc thresholds still help, and if
yes how much?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use generation context to speed up tuplesorts

2022-04-23 Thread David Rowley
Thanks for raising this.

On Sun, 24 Apr 2022 at 12:59, Noah Misch  wrote:
> This (commit 77bae39) did not change function parameter counts, and
> TUPLESORT_RANDOMACCESS generally has same the same numeric value as "true".  I
> get no warning if I pass "true" for the "sortopt" flags parameter.  Hence, I
> suspect this did not break the API.  Should we be happy about that?  I'm fine
> with it.

I'd be open to doing something like make sortopt the first argument of
each of the tuplesort_begin* functions if people have concerns about
this causing problems. However, given that TUPLESORT_RANDOMACCESS and
true likely have the same value, it seems we'd be more likely to break
code down the line if we added some option that's needed that wouldn't
get set by passing "true" as the sortopt.

If anyone was to use tuplesort_set_bound() without updating their
tuplesort_begin* then on non-assert builds, nothing would misbehave.
There's just a risk of having bad memory fragmentation from using the
generation context for bounded sorts.

David




Re: Use generation context to speed up tuplesorts

2022-04-23 Thread Peter Geoghegan
On Sat, Apr 23, 2022 at 5:59 PM Noah Misch  wrote:
> This (commit 77bae39) did not change function parameter counts, and
> TUPLESORT_RANDOMACCESS generally has same the same numeric value as "true".  I
> get no warning if I pass "true" for the "sortopt" flags parameter.  Hence, I
> suspect this did not break the API.  Should we be happy about that?  I'm fine
> with it.

If I happened to believe that this issue (or one like it) might have
real negative consequences, and that those consequences could easily
be avoided (by making the API break impossible to overlook), then I
would object -- why even take a small chance? Fortunately I don't
believe that we're even taking a small chance here, all things
considered. And so I agree; this issue isn't a concern.

-- 
Peter Geoghegan




Re: Use generation context to speed up tuplesorts

2022-04-23 Thread Noah Misch
On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote:
> 0002:
> This modifies the tuplesort API so that instead of having a
> randomAccess bool flag, this is changed to a bitwise flag that we can
> add further options in the future.  It's slightly annoying to break
> the API, but it's not exactly going to be hard for people to code
> around that.

For what it's worth, the three PGXN extensions using tuplesort_begin_* are
"citus", "pg_bulkload", and "vector".  Nothing calls tuplesort_set_bound().

This (commit 77bae39) did not change function parameter counts, and
TUPLESORT_RANDOMACCESS generally has same the same numeric value as "true".  I
get no warning if I pass "true" for the "sortopt" flags parameter.  Hence, I
suspect this did not break the API.  Should we be happy about that?  I'm fine
with it.




Re: Use generation context to speed up tuplesorts

2022-04-04 Thread David Rowley
On Sat, 2 Apr 2022 at 02:25, Justin Pryzby  wrote:
>
> On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote:
> > 0002:
> > This modifies the tuplesort API so that instead of having a
> > randomAccess bool flag,
>
> Per cirrus, this missed updating one line for dtrace.

Thanks.

I've pushed all 3 of the patches now.

David




Re: Use generation context to speed up tuplesorts

2022-04-01 Thread Justin Pryzby
On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote:
> 0002:
> This modifies the tuplesort API so that instead of having a
> randomAccess bool flag,

Per cirrus, this missed updating one line for dtrace.

On Fri, Apr 01, 2022 at 10:00:08PM +1300, David Rowley wrote:
> I feel this is a pretty simple patch and if nobody has any objections
> then I plan to push all 3 of them on my New Zealand Monday morning.

I'll mark the CF as such.

-- 
Justin
>From a6271ac626b561cdafe60152eb4fb208e7fe7240 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Wed, 16 Feb 2022 10:23:32 +1300
Subject: [PATCH 1/4] Improve the generation memory allocator

Here we make a series of improvements to the generation memory
allocator:

1. Allow generation contexts to have a minimum, initial and maximum block
sizes. The standard allocator allows this already but when the generation
context was added, it only allowed fixed-sized blocks.  The problem with
fixed-sized blocks is that it's difficult to choose how large to make the
blocks.  If the chosen size is too small then we'd end up with a large
number of blocks and a large number of malloc calls. If the block size is
made too large, then memory is wasted.

2. Add support for "keeper" blocks.  This is a special block that is
allocated along with the context itself but is never freed.  Instead,
when the last chunk in the keeper block is freed, we simply mark the block
as empty to allow new allocations to make use of it.

3. Add facility to "recycle" newly empty blocks instead of freeing them
and having to later malloc an entire new block again.  We do this by
recording a single GenerationBlock which has become empty of any chunks.
When we run out of space in the current block, we check to see if there is
a "freeblock" and use that if it contains enough space for the allocation.

Author: David Rowley, Tomas Vondra
Discussion: https://postgr.es/m/d987fd54-01f8-0f73-af6c-519f799a0...@enterprisedb.com
---
 src/backend/access/gist/gistvacuum.c  |   6 +
 .../replication/logical/reorderbuffer.c   |   7 +
 src/backend/utils/mmgr/generation.c   | 383 ++
 src/include/utils/memutils.h  |   4 +-
 4 files changed, 322 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index aac4afab8ff..f190decdff2 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -158,9 +158,15 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * pages in page_set_context.  Internally, the integer set will remember
 	 * this context so that the subsequent allocations for these integer sets
 	 * will be done from the same context.
+	 *
+	 * XXX the allocation sizes used below pre-date generation context's block
+	 * growing code.  These values should likely be benchmarked and set to
+	 * more suitable values.
 	 */
 	vstate.page_set_context = GenerationContextCreate(CurrentMemoryContext,
 	  "GiST VACUUM page set context",
+	  16 * 1024,
+	  16 * 1024,
 	  16 * 1024);
 	oldctx = MemoryContextSwitchTo(vstate.page_set_context);
 	vstate.internal_page_set = intset_create();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c2d9be81fae..4702750a2e7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -370,8 +370,15 @@ ReorderBufferAllocate(void)
 			SLAB_DEFAULT_BLOCK_SIZE,
 			sizeof(ReorderBufferTXN));
 
+	/*
+	 * XXX the allocation sizes used below pre-date generation context's block
+	 * growing code.  These values should likely be benchmarked and set to
+	 * more suitable values.
+	 */
 	buffer->tup_context = GenerationContextCreate(new_ctx,
   "Tuples",
+  SLAB_LARGE_BLOCK_SIZE,
+  SLAB_LARGE_BLOCK_SIZE,
   SLAB_LARGE_BLOCK_SIZE);
 
 	hash_ctl.keysize = sizeof(TransactionId);
diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c
index 95c006f689f..29065201556 100644
--- a/src/backend/utils/mmgr/generation.c
+++ b/src/backend/utils/mmgr/generation.c
@@ -20,18 +20,15 @@
  *
  *	The memory context uses a very simple approach to free space management.
  *	Instead of a complex global freelist, each block tracks a number
- *	of allocated and freed chunks. Freed chunks are not reused, and once all
- *	chunks in a block are freed, the whole block is thrown away. When the
- *	chunks allocated in the same block have similar lifespan, this works
- *	very well and is very cheap.
+ *	of allocated and freed chunks.  The block is classed as empty when the
+ *	number of free chunks is equal to the number of allocated chunks.  When
+ *	this occurs, instead of freeing the block, we try to "recycle" it, i.e.
+ *	reuse it for new allocations.  This is done by setting the block in the

Re: Use generation context to speed up tuplesorts

2022-04-01 Thread David Rowley
On Wed, 23 Mar 2022 at 04:08, David Rowley  wrote:
> If nobody has any objections to the 0001 patch then I'd like to move
> ahead with it in the next few days.  For the 0002 patch, I'm currently
> feeling like we shouldn't be using the Generation context for bounded
> sorts. The only way I can think to do that is with an API change to
> tuplesort. I feel 0001 is useful even without 0002.

0001:
I've made some small revisions to the 0001 patch so that the keeper
and freeblock are only used again when they're entirely empty.  The
situation I want to avoid here is that when the current block does not
have enough free space to store some new allocation, that we'll then
try the freeblock and then keeper block.  The problem I saw there is
that we may previously have partially filled the keeper or freeblock
block and have been unable to store some medium sized allocation which
caused us to move to a new block.  If we didn't check that the keeper
and freeblock blocks were empty first then we could end up being able
to store some small allocation in there where some previous medium
sized allocation couldn't fit.  While that's not necessarily an actual
problem, what it does mean is that consecutive allocations might not
end up in the same block or the next block.  Over time in a FIFO type
workload it would be possible to get fragmentation, which could result
in being unable to free blocks.  For longer lived contexts I imagine
that could end up fairly bad. The updated patch should avoid that
problem.

0002:
This modifies the tuplesort API so that instead of having a
randomAccess bool flag, this is changed to a bitwise flag that we can
add further options in the future.  It's slightly annoying to break
the API, but it's not exactly going to be hard for people to code
around that.  It might also mean we don't have to break the API in the
future if we're doing some change where we can just add a new bitwise
flag.

0003:
This adds a new flag for TUPLESORT_ALLOWBOUNDED and modifies the
places where we set a sort bound to pass the flag.  The patch only
uses the generation context when the flag is not passed.

I feel this is a pretty simple patch and if nobody has any objections
then I plan to push all 3 of them on my New Zealand Monday morning.

David
From 053066f8276809af1b3f2c26e1e86b8e1db85bac Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Wed, 16 Feb 2022 10:23:32 +1300
Subject: [PATCH v5 1/3] Improve the generation memory allocator

Here we make a series of improvements to the generation memory
allocator:

1. Allow generation contexts to have a minimum, initial and maximum block
sizes. The standard allocator allows this already but when the generation
context was added, it only allowed fixed-sized blocks.  The problem with
fixed-sized blocks is that it's difficult to choose how large to make the
blocks.  If the chosen size is too small then we'd end up with a large
number of blocks and a large number of malloc calls. If the block size is
made too large, then memory is wasted.

2. Add support for "keeper" blocks.  This is a special block that is
allocated along with the context itself but is never freed.  Instead,
when the last chunk in the keeper block is freed, we simply mark the block
as empty to allow new allocations to make use of it.

3. Add facility to "recycle" newly empty blocks instead of freeing them
and having to later malloc an entire new block again.  We do this by
recording a single GenerationBlock which has become empty of any chunks.
When we run out of space in the current block, we check to see if there is
a "freeblock" and use that if it contains enough space for the allocation.

Author: David Rowley, Tomas Vondra
Discussion: 
https://postgr.es/m/d987fd54-01f8-0f73-af6c-519f799a0...@enterprisedb.com
---
 src/backend/access/gist/gistvacuum.c  |   6 +
 .../replication/logical/reorderbuffer.c   |   7 +
 src/backend/utils/mmgr/generation.c   | 383 ++
 src/include/utils/memutils.h  |   4 +-
 4 files changed, 322 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c 
b/src/backend/access/gist/gistvacuum.c
index aac4afab8f..f190decdff 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -158,9 +158,15 @@ gistvacuumscan(IndexVacuumInfo *info, 
IndexBulkDeleteResult *stats,
 * pages in page_set_context.  Internally, the integer set will remember
 * this context so that the subsequent allocations for these integer 
sets
 * will be done from the same context.
+*
+* XXX the allocation sizes used below pre-date generation context's 
block
+* growing code.  These values should likely be benchmarked and set to
+* more suitable values.
 */
vstate.page_set_context = GenerationContextCreate(CurrentMemoryContext,

  

Re: Use generation context to speed up tuplesorts

2022-03-22 Thread David Rowley
On Wed, 23 Feb 2022 at 15:25, Andres Freund  wrote:
> On 2022-02-18 12:10:51 +1300, David Rowley wrote:
> > The other way I thought to fix it was by changing the logic for when
> > generation blocks are freed.  In the problem case mentioned above, the
> > block being freed is the current block (which was just allocated).  I
> > made some changes to adjust this behaviour so that we no longer free
> > the block when the final chunk is pfree()'d. Instead, that now lingers
> > and can be reused by future allocations, providing they fit inside it.
>
> That makes sense to me, as long as we keep just one such block.

I've rewritten the patch in such a way that it no longer matters which
block becomes free.  What I did was add a new field to the
GenerationContext named "freeblock".  We now simply store up to 1
recently emptied block there instead of calling free() on it.  When
we're allocating new memory, we'll try and make use of the "freeblock"
when we don't have enough space in the current block.

I feel like this is quite a good change as it saves continual
free/malloc calls for FIFO pattern users of the generation context.
Since that's pretty much the workload that this context is intended
for, that seems like a very good change to make.

I did consider only recording a block as free once it reaches the
maximum size.  As I have the code currently, we'll continually reuse
all blocks, including the initial sized ones.  It will end up filling
blocks quicker as we'll be reusing those smaller initial blocks
continually with FIFO workloads. I'm not entirely sure if that
matters. I just wanted to point out that I thought about it.

Aside from the freeblock, I've also added code for adding a "keeper"
block.  This is allocated in the same malloc'd chunk as the Generation
context itself. One thing I'm not too sure about is if the keeper
block change is worth the trouble.  If we pfree() all of the chunks
out of the keeper block, there's a special case in GenerationFree() to
empty that block rather than free() it.  This also means we need a
special case in GenerationAlloc() so we try to see if the keeper block
has any space before we go and allocate a new block.   My current
thoughts are that the keeper block is really only useful for
Generation contexts that remain very small and don't ever require
allocation of additional blocks.  This will mean all the memory can be
allocated along with the context, which saves a malloc and free call.
Does anyone have any thoughts on this?

> > The problem I see with this method is that there still could be some
> > pathological case that causes us to end up storing just a single tuple per
> > generation block.
>
> Crazy idea: Detect the situation, and recompact. Create a new context, copy
> all the tuples over, delete the old context. That could be a win even in less
> adversarial situations than "a single tuple per generation block".

I think that would need some new MemoryContext API functions in which
callers could use to determine the fragmentation and do something
about it on the consumer side. Otherwise, as Tomas says, if we did it
from within the context code itself we'd have no visibility of what is
pointing to the memory we're moving around.

If nobody has any objections to the 0001 patch then I'd like to move
ahead with it in the next few days.  For the 0002 patch, I'm currently
feeling like we shouldn't be using the Generation context for bounded
sorts. The only way I can think to do that is with an API change to
tuplesort. I feel 0001 is useful even without 0002.

David
From 546ac7a3f5ad993c76576e09dd031a453a70c692 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Wed, 16 Feb 2022 10:23:32 +1300
Subject: [PATCH v4 1/2] Improve the generation memory allocator

Here we make a series of improvements to the generation memory
allocator:

1. Allow generation contexts to have a minimum, initial and maximum block
sizes. The standard allocator allows this already but when the generation
context was added, it only allowed fixed-sized blocks.  The problem with
fixed-sized blocks is that it's difficult to choose how large to make the
blocks.  If the chosen size is too small then we'd end up with a large
number of blocks and a large number of malloc calls. If the block size is
made too large, then memory is wasted.

2. Add support for "keeper" blocks.  This is a special block that is
allocated along with the context itself but is never freed.  Instead,
when the last chunk in the keeper block is freed, we simply mark the block
as empty to allow new allocations to make use of it.

3. Add facility to "recycle" newly empty blocks instead of freeing them
and having to later malloc an entire new block again.  We do this by
recording a single GenerationBlock which has become empty of any chunks.
When we run out of space in the current block, we check to see if there is
a "freeblock" and use that if it contains enough space for the allocation.

Author: David Rowley, Tomas Vondra
Discussion: 

Re: Use generation context to speed up tuplesorts

2022-02-23 Thread Tomas Vondra
On 2/23/22 03:25, Andres Freund wrote:
> Hi,
> 
> On 2022-02-18 12:10:51 +1300, David Rowley wrote:
>> The other way I thought to fix it was by changing the logic for when
>> generation blocks are freed.  In the problem case mentioned above, the
>> block being freed is the current block (which was just allocated).  I
>> made some changes to adjust this behaviour so that we no longer free
>> the block when the final chunk is pfree()'d. Instead, that now lingers
>> and can be reused by future allocations, providing they fit inside it.
> 
> That makes sense to me, as long as we keep just one such block.
> 
> 
>> The problem I see with this method is that there still could be some
>> pathological case that causes us to end up storing just a single tuple per
>> generation block.
> 
> Crazy idea: Detect the situation, and recompact. Create a new context, copy
> all the tuples over, delete the old context. That could be a win even in less
> adversarial situations than "a single tuple per generation block".
> 

What about pointers to the chunks in the old memory context?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use generation context to speed up tuplesorts

2022-02-22 Thread Andres Freund
Hi,

On 2022-02-18 12:10:51 +1300, David Rowley wrote:
> The other way I thought to fix it was by changing the logic for when
> generation blocks are freed.  In the problem case mentioned above, the
> block being freed is the current block (which was just allocated).  I
> made some changes to adjust this behaviour so that we no longer free
> the block when the final chunk is pfree()'d. Instead, that now lingers
> and can be reused by future allocations, providing they fit inside it.

That makes sense to me, as long as we keep just one such block.


> The problem I see with this method is that there still could be some
> pathological case that causes us to end up storing just a single tuple per
> generation block.

Crazy idea: Detect the situation, and recompact. Create a new context, copy
all the tuples over, delete the old context. That could be a win even in less
adversarial situations than "a single tuple per generation block".


Greetings,

Andres Freund




Re: Use generation context to speed up tuplesorts

2022-02-17 Thread David Rowley
On Sun, 13 Feb 2022 at 09:56, Tomas Vondra
 wrote:
> I'm not against pushing the generation context improvements, e.g.
> something like the patches posted in [1], because those seem reasonable
> in general. But I'm somewhat confused about the last patch (adjusting
> allocChunkLimit) causing fairly significant regressions.

My thoughts here are that we should pursue the patch that allows
growing of the block size in the generation context.  I do think the
investigation of the malloc() behaviour and performance is worth some
pursuit, I just don't think it should be done here or as part of this
patch.  I think it's a fairly large undertaking to ensure any changes
to the malloc options are an overall win and not just a win for
whatever benchmark we test them on. If they're really an overall win,
then shouldn't glibc know about them and maybe adopt them as the
standard options?

To get the ball rolling again on the changes to the generation
context, I took your patches, Tomas, and fixed a few things around the
keeper blocks not working correctly.  I've now made the keeper block
behaviour match how aset.c works.  There the keeper block is allocated
in the same allocation as the context itself. That reduces the number
of malloc() calls when setting up a new memory context.

On testing this, I discovered a problem regarding the use of
generation contexts for storing tuples in tuplesort.c.  The problem is
when the sort is bounded (e.g. SELECT * FROM ... ORDER BY .. LIMIT N),
we'll store and directly throw away any tuples that sort greater than
the existing set of N tuples. This causes a problem because once we
move away from using the keeper block, initially, we'll at some point
end up storing a tuple in a newly allocated generation block then
proceed to pfree() that tuple due to it sorting greater than the
existing N tuples. Because we currently always free() newly empty
blocks, we end up thrashing free()/malloc().  This behaviour results
in one of the regression test queries in tuplesort.sql going from
taking 10 ms to 1 minute, on my machine.

I had a few thoughts about how best to work around this problem.
Firstly, I thought that we should just *not* use the Generation
context when the sort is bounded.  That turns out to be a bit icky as
we only make the sort bounding when we call tuplesort_set_bound(),
which is after we've allocated the memory context for storing tuples.
I thought about maybe just adding another bool flag named
"allowBounding" and use a Generation context if that flag is not set.
I just don't really like that as a solution.  We also cannot make the
set bound part of the tuplesort_begin_heap() call as
nodeIncrementalSort.c does reset the bound. Possibly we could ditch
tuplesort_set_bound() and invent tuplesort_reset_bound() and change
the API so the bound is set when making the Tuplesortstate.

The other way I thought to fix it was by changing the logic for when
generation blocks are freed.  In the problem case mentioned above, the
block being freed is the current block (which was just allocated).  I
made some changes to adjust this behaviour so that we no longer free
the block when the final chunk is pfree()'d. Instead, that now lingers
and can be reused by future allocations, providing they fit inside it.
If they don't fit then we must free the block, otherwise it would just
remain empty forever.   The problem I see with this method is that
there still could be some pathological case that causes us to end up
storing just a single tuple per generation block.

Hitting the worst case here does seem pretty unlikely, so I'm a bit
drawn between if we should just play it safe and stick to the standard
memory allocator for bounded sort.

I've attached 2 patches. The 0001 patch adds the new logic to
generation.c to allow the block to grow and also adds the logic to
skip freeing the current block when it becomes empty.  The 0002 patch
simply makes tuplesort.c use the generation context for tuples
unconditionally.

I've also attached the benchmark results I got from this patch running
the attached sortbenchall script. It's fairly obvious that there are
performance improvements here even when the malloc() usage pattern is
similar to aset.c's

David
#!/bin/bash

dbname=postgres
mod=100
sec=5

psql -c "alter system set max_parallel_workers_per_gather TO 0;" $dbname
psql -c "select pg_reload_conf();" $dbname

for mem in "4MB" "4GB"
do

psql -c "alter system set work_mem TO '$mem';" $dbname
psql -c "select pg_reload_conf();" $dbname
psql -c "show work_mem" $dbname

sec=10

for sql in "select * from tenk1 order by two offset 100" "select * 
from tenk1 order by tenthous offset 100" "select * from tenk1 order by 
stringu1 offset 100" "select * from tenk1 order by stringu2 offset 100" 
"select * from tenk1 order by twenty offset 100" "select twenty,sum(unique1 
order by unique1) from tenk1 group by twenty" "select unique1,sum(twenty order 
by twenty) 

Re: Use generation context to speed up tuplesorts

2022-02-12 Thread Tomas Vondra
On 1/20/22 08:36, Ronan Dunklau wrote:
> Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit :
>> On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud  wrote:
>>> I'm also a bit confused about which patch(es) should actually be reviewed
>>> in that thread.  My understanding is that the original patch might not be
>>> relevant anymore but I might be wrong.  The CF entry should probably be
>>> updated to clarify that, with an annotation/ title change / author update
>>> or something.
>>>
>>> In the meantime I will switch the entry to Waiting on Author.
>>
>> I think what's going on here is a bit confusing. There's quite a few
>> patches on here that aim to do very different things.
>>
>> The patch I originally proposed was just to make tuplesort use
>> generation memory contexts. This helps speed up tuplesort because
>> generation contexts allow palloc() to be faster than the standard
>> allocator. Additionally, there's no power-of-2 wastage with generation
>> contexts like there is with the standard allocator. This helps fit
>> more tuples on fewer CPU cache lines.
>>
>> I believe Andres then mentioned that the fixed block size for
>> generation contexts might become a problem. With the standard
>> allocator the actual size of the underlying malloc() grows up until a
>> maximum size.  With generation contexts this is always the same, so we
>> could end up with a very large number of blocks which means lots of
>> small mallocs which could end up slow.  Tomas then posted a series of
>> patches to address this.
>>
>> I then posted another patch that has the planner make a choice on the
>> size of the blocks to use for the generation context based on the
>> tuple width and number of tuple estimates. My hope there was to
>> improve performance further by making a better choice for how large to
>> make the blocks in the first place.  This reduces the need for Tomas'
>> patch, but does not eliminate the need for it.
>>
>> As of now, I still believe we'll need Tomas' patches to allow the
>> block size to grow up to a maximum size.  I think those patches are
>> likely needed before we think about making tuplesort use generation
>> contexts.  The reason I believe this is that we don't always have good
>> estimates for the number of tuples we're going to sort.  nodeSort.c is
>> fairly easy as it just fetches tuples once and then sorts them.  The
>> use of tuplesort for nodeIncrementalSort.c is much more complex as
>> we'll likely be performing a series of smaller sorts which are much
>> harder to get size estimates for. This means there's likely no magic
>> block size that we can use for the generation context that's used to
>> store the tuples in tuplesort.  This is also the case for order by
>> aggregates and probably most other usages of tuplesort.
>>
> 
> You left out the discussion I started about glibc's malloc specific behaviour.
> 
> I tried to demonstrate that a big part of the performance gain you were 
> seeing 
> were not in fact due to our allocator by itself, but by the way different 
> block 
> sizes allocations interact with this malloc implementation and how it handles 
> releasing memory back to the system. I also tried to show that we can give 
> DBAs more control over how much memory to "waste" as the price for faster 
> allocations.
> 
> I agree that the generation context memory manager is useful in the tuplesort 
> context, if only for the fact that we fall back to disk less quickly due to 
> the reduced wastage of memory, but I would be wary of drawing conclusions too 
> quickly based on a specific malloc implementation which shows threshold 
> effects, 
> which are compounded by our growing block sizes code in general.
> 

Yeah, I share this concern - how much of the improvement is real and how
much is due to accidentally not hitting some glibc-specific self-tuning
stuff? Would a realistic workloads get the same improvement or would it
cause regression? What about non-glibc systems with other malloc?

I'm not against pushing the generation context improvements, e.g.
something like the patches posted in [1], because those seem reasonable
in general. But I'm somewhat confused about the last patch (adjusting
allocChunkLimit) causing fairly significant regressions.

I'll try investigating this a bit more next week.


regards


[1]
https://www.postgresql.org/message-id/flat/bcdd4e3e-c12d-cd2b-7ead-a91ad416100a%40enterprisedb.com#a5ecd4e5a9e7d5b07ff25b5684da894f

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use generation context to speed up tuplesorts

2022-02-12 Thread Tomas Vondra
On 1/25/22 13:34, Ronan Dunklau wrote:
>
> ...
>
> I've run the same 1-to-32-columns sort benchmark, using both glibc malloc and 
> jemalloc, with the standard aset memory manager or with David's generation v2 
> patch.  To use jemalloc, I just use a LD_PRELOAD env variable before starting 
> postgres. 
> 
> I have not tried to measure jemalloc's memory footprint like I did for 
> glibc's 
> malloc in the previous benchmark, as I'm not trying to evaluate jemalloc as a 
> glibc's malloc replacement. 
> 
> Please find the results attached. As I suspected, most of the gain observed 
> with David's patch come from working around glibc's malloc idiosyncracies but 
> the good news is that it also helps (to a lesser degree) with jemalloc.
> 
> I'm not sure how that would behave with growing block sizes though: as Tomas 
> patch and stress-testing benchmarks showed, we can hit some particular 
> thresholds (and regressions) in glibc's self-tuning behaviour.
> 

Interesting results!

So just switching to jemalloc can yield much better performance (in some
cases), even beating the generation context (on top of glibc malloc).

Of course, those are just simple (somewhat synthetic) benchmarks, and
there is no info about memory consumption. Perhaps there are cases where
this would be slower than glibc malloc, and I doubt many people to
switch to a non-default malloc implementation.

But I think in general this mostly confirms our suspicion about the
patch (somewhat accidentally) working around glibc internal behavior.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use generation context to speed up tuplesorts

2022-02-06 Thread Andy Fan
Hi:

On Thu, Jan 20, 2022 at 9:31 AM David Rowley  wrote:

>
> As of now, I still believe we'll need Tomas' patches to allow the
> block size to grow up to a maximum size.  I think those patches are
> likely needed before we think about making tuplesort use generation
> contexts.  The reason I believe this is that we don't always have good
> estimates for the number of tuples we're going to sort.


+1

I spent some times to study the case here and my current thought is:
we can discuss/commit the minimum committable changes which
should be beneficial for some cases and no harm for others.
Tomas's patch 0002 would make there are no more blocks needed
if we switch to GenerationContext (compared with Standard Context). and
David's patch can obviously reduce total memory usage and improve
the performance.  so IMO Tomas's patch 0002 + David's patch is a committable
patchset at first round.  and Tomas's 0001 patch would be good to have
as well.

I double checked Tomas's 0002 patch, it looks good to me.  and then applied
David's patch with ALLOCSET_DEFAULT_SIZES,  testing the same workload.
Here is the result (number is tps):

work_mem = '4GB'

| Test Case | master | patched |
|---++-|
| Test 1|306 | 406 |
| Test 2|225 | 278 |
| Test 3|202 | 248 |
| Test 4|184 | 218 |
| Test 5|281 | 360 |

work_mem = '4MB'

| Test Case | master | patched |
|---++-|
| Test 1|124 | 409 |
| Test 2|106 | 280 |
| Test 3|100 | 249 |
| Test 4|97  | 218 |
| Test 5|120 | 369 |


I didn't get the performance improvement as much as David's at the
beginning,  I
think that is because David uses the ALLOCSET_DEFAULT_MAXSIZE directly which
will need less number of times for memory allocation.

AFAICS,  Tomas's patch 0002 + David's patch should be ready for commit for
round 1.
We can try other opportunities like use rows estimation to allocate initial
memory and
GenerationContext improves like 0003/0004.  Would this work?

-- 
Best Regards
Andy Fan


Re: Use generation context to speed up tuplesorts

2022-01-25 Thread Ronan Dunklau
Le jeudi 20 janvier 2022, 08:36:46 CET Ronan Dunklau a écrit :
> Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit :
> > On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud  wrote:
> > > I'm also a bit confused about which patch(es) should actually be
> > > reviewed
> > > in that thread.  My understanding is that the original patch might not
> > > be
> > > relevant anymore but I might be wrong.  The CF entry should probably be
> > > updated to clarify that, with an annotation/ title change / author
> > > update
> > > or something.
> > > 
> > > In the meantime I will switch the entry to Waiting on Author.
> > 
> > I think what's going on here is a bit confusing. There's quite a few
> > patches on here that aim to do very different things.
> > 
> > The patch I originally proposed was just to make tuplesort use
> > generation memory contexts. This helps speed up tuplesort because
> > generation contexts allow palloc() to be faster than the standard
> > allocator. Additionally, there's no power-of-2 wastage with generation
> > contexts like there is with the standard allocator. This helps fit
> > more tuples on fewer CPU cache lines.
> > 
> > I believe Andres then mentioned that the fixed block size for
> > generation contexts might become a problem. With the standard
> > allocator the actual size of the underlying malloc() grows up until a
> > maximum size.  With generation contexts this is always the same, so we
> > could end up with a very large number of blocks which means lots of
> > small mallocs which could end up slow.  Tomas then posted a series of
> > patches to address this.
> > 
> > I then posted another patch that has the planner make a choice on the
> > size of the blocks to use for the generation context based on the
> > tuple width and number of tuple estimates. My hope there was to
> > improve performance further by making a better choice for how large to
> > make the blocks in the first place.  This reduces the need for Tomas'
> > patch, but does not eliminate the need for it.
> > 
> > As of now, I still believe we'll need Tomas' patches to allow the
> > block size to grow up to a maximum size.  I think those patches are
> > likely needed before we think about making tuplesort use generation
> > contexts.  The reason I believe this is that we don't always have good
> > estimates for the number of tuples we're going to sort.  nodeSort.c is
> > fairly easy as it just fetches tuples once and then sorts them.  The
> > use of tuplesort for nodeIncrementalSort.c is much more complex as
> > we'll likely be performing a series of smaller sorts which are much
> > harder to get size estimates for. This means there's likely no magic
> > block size that we can use for the generation context that's used to
> > store the tuples in tuplesort.  This is also the case for order by
> > aggregates and probably most other usages of tuplesort.
> 
> You left out the discussion I started about glibc's malloc specific
> behaviour.
> 
> I tried to demonstrate that a big part of the performance gain you were
> seeing were not in fact due to our allocator by itself, but by the way
> different block sizes allocations interact with this malloc implementation
> and how it handles releasing memory back to the system. I also tried to
> show that we can give DBAs more control over how much memory to "waste" as
> the price for faster allocations.
> 
> I agree that the generation context memory manager is useful in the
> tuplesort context, if only for the fact that we fall back to disk less
> quickly due to the reduced wastage of memory, but I would be wary of
> drawing conclusions too quickly based on a specific malloc implementation
> which shows threshold effects, which are compounded by our growing block
> sizes code in general.
> 
> I have on my TODO list to run the same kind of benchmarks using jemalloc, to
> better isolate the performance gains expected from the generation allocator
> itself from the side effect of avoiding glibc's pattern of releasing memory
> back to the kernel too quickly.
> 

I've run the same 1-to-32-columns sort benchmark, using both glibc malloc and 
jemalloc, with the standard aset memory manager or with David's generation v2 
patch.  To use jemalloc, I just use a LD_PRELOAD env variable before starting 
postgres. 

I have not tried to measure jemalloc's memory footprint like I did for glibc's 
malloc in the previous benchmark, as I'm not trying to evaluate jemalloc as a 
glibc's malloc replacement. 

Please find the results attached. As I suspected, most of the gain observed 
with David's patch come from working around glibc's malloc idiosyncracies but 
the good news is that it also helps (to a lesser degree) with jemalloc.

I'm not sure how that would behave with growing block sizes though: as Tomas 
patch and stress-testing benchmarks showed, we can hit some particular 
thresholds (and regressions) in glibc's self-tuning behaviour.

-- 
Ronan Dunklau

bench_jemalloc.ods
Description: 

Re: Use generation context to speed up tuplesorts

2022-01-19 Thread Ronan Dunklau
Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit :
> On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud  wrote:
> > I'm also a bit confused about which patch(es) should actually be reviewed
> > in that thread.  My understanding is that the original patch might not be
> > relevant anymore but I might be wrong.  The CF entry should probably be
> > updated to clarify that, with an annotation/ title change / author update
> > or something.
> > 
> > In the meantime I will switch the entry to Waiting on Author.
> 
> I think what's going on here is a bit confusing. There's quite a few
> patches on here that aim to do very different things.
> 
> The patch I originally proposed was just to make tuplesort use
> generation memory contexts. This helps speed up tuplesort because
> generation contexts allow palloc() to be faster than the standard
> allocator. Additionally, there's no power-of-2 wastage with generation
> contexts like there is with the standard allocator. This helps fit
> more tuples on fewer CPU cache lines.
> 
> I believe Andres then mentioned that the fixed block size for
> generation contexts might become a problem. With the standard
> allocator the actual size of the underlying malloc() grows up until a
> maximum size.  With generation contexts this is always the same, so we
> could end up with a very large number of blocks which means lots of
> small mallocs which could end up slow.  Tomas then posted a series of
> patches to address this.
> 
> I then posted another patch that has the planner make a choice on the
> size of the blocks to use for the generation context based on the
> tuple width and number of tuple estimates. My hope there was to
> improve performance further by making a better choice for how large to
> make the blocks in the first place.  This reduces the need for Tomas'
> patch, but does not eliminate the need for it.
> 
> As of now, I still believe we'll need Tomas' patches to allow the
> block size to grow up to a maximum size.  I think those patches are
> likely needed before we think about making tuplesort use generation
> contexts.  The reason I believe this is that we don't always have good
> estimates for the number of tuples we're going to sort.  nodeSort.c is
> fairly easy as it just fetches tuples once and then sorts them.  The
> use of tuplesort for nodeIncrementalSort.c is much more complex as
> we'll likely be performing a series of smaller sorts which are much
> harder to get size estimates for. This means there's likely no magic
> block size that we can use for the generation context that's used to
> store the tuples in tuplesort.  This is also the case for order by
> aggregates and probably most other usages of tuplesort.
> 

You left out the discussion I started about glibc's malloc specific behaviour.

I tried to demonstrate that a big part of the performance gain you were seeing 
were not in fact due to our allocator by itself, but by the way different block 
sizes allocations interact with this malloc implementation and how it handles 
releasing memory back to the system. I also tried to show that we can give 
DBAs more control over how much memory to "waste" as the price for faster 
allocations.

I agree that the generation context memory manager is useful in the tuplesort 
context, if only for the fact that we fall back to disk less quickly due to 
the reduced wastage of memory, but I would be wary of drawing conclusions too 
quickly based on a specific malloc implementation which shows threshold 
effects, 
which are compounded by our growing block sizes code in general.

I have on my TODO list to run the same kind of benchmarks using jemalloc, to 
better isolate the performance gains expected from the generation allocator 
itself from the side effect of avoiding glibc's pattern of releasing memory 
back to the kernel too quickly. 

Julien, sorry for the confusion  of adding that discussion and those patches 
to the same thread, but I don't really see a way of dissociating those two 
aspects.

-- 
Ronan Dunklau






Re: Use generation context to speed up tuplesorts

2022-01-19 Thread David Rowley
On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud  wrote:
> I'm also a bit confused about which patch(es) should actually be reviewed in
> that thread.  My understanding is that the original patch might not be 
> relevant
> anymore but I might be wrong.  The CF entry should probably be updated to
> clarify that, with an annotation/ title change / author update or something.
>
> In the meantime I will switch the entry to Waiting on Author.

I think what's going on here is a bit confusing. There's quite a few
patches on here that aim to do very different things.

The patch I originally proposed was just to make tuplesort use
generation memory contexts. This helps speed up tuplesort because
generation contexts allow palloc() to be faster than the standard
allocator. Additionally, there's no power-of-2 wastage with generation
contexts like there is with the standard allocator. This helps fit
more tuples on fewer CPU cache lines.

I believe Andres then mentioned that the fixed block size for
generation contexts might become a problem. With the standard
allocator the actual size of the underlying malloc() grows up until a
maximum size.  With generation contexts this is always the same, so we
could end up with a very large number of blocks which means lots of
small mallocs which could end up slow.  Tomas then posted a series of
patches to address this.

I then posted another patch that has the planner make a choice on the
size of the blocks to use for the generation context based on the
tuple width and number of tuple estimates. My hope there was to
improve performance further by making a better choice for how large to
make the blocks in the first place.  This reduces the need for Tomas'
patch, but does not eliminate the need for it.

As of now, I still believe we'll need Tomas' patches to allow the
block size to grow up to a maximum size.  I think those patches are
likely needed before we think about making tuplesort use generation
contexts.  The reason I believe this is that we don't always have good
estimates for the number of tuples we're going to sort.  nodeSort.c is
fairly easy as it just fetches tuples once and then sorts them.  The
use of tuplesort for nodeIncrementalSort.c is much more complex as
we'll likely be performing a series of smaller sorts which are much
harder to get size estimates for. This means there's likely no magic
block size that we can use for the generation context that's used to
store the tuples in tuplesort.  This is also the case for order by
aggregates and probably most other usages of tuplesort.

David




Re: Use generation context to speed up tuplesorts

2022-01-17 Thread Julien Rouhaud
Hi,

On Fri, Jan 07, 2022 at 12:03:55PM +0100, Ronan Dunklau wrote:
> 
> (Sorry for trying to merge back the discussion on the two sides of the thread)
> 
> In  https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, I 
> expressed the idea of being able to tune glibc's malloc behaviour. 
> 
> I implemented that (patch 0001) to provide a new hook which is called on 
> backend startup, and anytime we set work_mem. This hook is # defined 
> depending 
> on the malloc implementation: currently a default, no-op implementation is 
> provided as well as a glibc's malloc implementation.

This patch apparently misses something to have the no-op on windows:

https://cirrus-ci.com/task/4978247640285184
[03:03:01.676] Build FAILED.
[03:03:01.676]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(19,15): warning 
C4013: 'mallinfo2' undefined; assuming extern returning int 
[c:\cirrus\malloc_stats.vcxproj]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(18,19): error 
C2079: 'm' uses undefined struct 'mallinfo2' [c:\cirrus\malloc_stats.vcxproj]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error 
C2224: left of '.arena' must have struct/union type 
[c:\cirrus\malloc_stats.vcxproj]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error 
C2224: left of '.hblkhd' must have struct/union type 
[c:\cirrus\malloc_stats.vcxproj]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error 
C2224: left of '.uordblks' must have struct/union type 
[c:\cirrus\malloc_stats.vcxproj]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error 
C2224: left of '.fordblks' must have struct/union type 
[c:\cirrus\malloc_stats.vcxproj]
[03:03:01.676] c:\cirrus\contrib\malloc_stats\malloc_stats.c(32,1): error 
C2224: left of '.keepcost' must have struct/union type 
[c:\cirrus\malloc_stats.vcxproj]

I'm also a bit confused about which patch(es) should actually be reviewed in
that thread.  My understanding is that the original patch might not be relevant
anymore but I might be wrong.  The CF entry should probably be updated to
clarify that, with an annotation/ title change / author update or something.

In the meantime I will switch the entry to Waiting on Author.




Re: Use generation context to speed up tuplesorts

2022-01-07 Thread Ronan Dunklau
Le vendredi 7 janvier 2022, 13:03:28 CET Tomas Vondra a écrit :
> On 1/7/22 12:03, Ronan Dunklau wrote:
> > Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit :
> >> I've attached some benchmark results that I took recently.  The
> >> spreadsheet contains results from 3 versions. master, master + 0001 -
> >> 0002, then master + 0001 - 0003.  The 0003 patch makes the code a bit
> >> more conservative about the chunk sizes it allocates and also tries to
> >> allocate the tuple array according to the number of tuples we expect
> >> to be able to sort in a single batch for when the sort is not
> >> estimated to fit inside work_mem.
> > 
> > (Sorry for trying to merge back the discussion on the two sides of the
> > thread)
> > 
> > In  https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan,
> > I expressed the idea of being able to tune glibc's malloc behaviour.
> > 
> > I implemented that (patch 0001) to provide a new hook which is called on
> > backend startup, and anytime we set work_mem. This hook is # defined
> > depending on the malloc implementation: currently a default, no-op
> > implementation is provided as well as a glibc's malloc implementation.
> 
> Not sure I'd call this a hook - that usually means a way to plug-in
> custom code through a callback, and this is simply ifdefing a block of
> code to pick the right implementation. Which may be a good way to do
> that, just let's not call that a hook.
> 
> There's a commented-out MallocTuneHook() call, probably not needed.

Ok, I'll clean that up if we decide to proceed with this.

> 
> I wonder if #ifdefing is sufficient solution, because it happens at
> compile time, so what if someone overrides the allocator in LD_PRELOAD?
> That was a fairly common way to use a custom allocator in an existing
> application. But I don't know how many people do that with Postgres (I'm
> not aware of anyone doing that) or if we support that (it'd probably
> apply to other stuff too, not just malloc). So maybe it's OK, and I
> can't think of a better way anyway.

I couldn't think of a better way either, maybe there is something to be done 
with trying to dlsym something specific to glibc's malloc implementation ?


> 
> > The glibc's malloc implementation relies on a new GUC,
> > glibc_malloc_max_trim_threshold. When set to it's default value of -1, we
> > don't tune malloc at all, exactly as in HEAD. If a different value is
> > provided, we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD
> > to this value, capped by work_mem / 2 and work_mem respectively.
> > 
> > The net result is that we can then allow to keep more unused memory at the
> > top of the heap, and to use mmap less frequently, if the DBA chooses too.
> > A possible other use case would be to on the contrary, limit the
> > allocated memory in idle backends to a minimum.
> > 
> > The reasoning behind this is that glibc's malloc default way of handling
> > those two thresholds is to adapt to the size of the last freed mmaped
> > block.
> > 
> > I've run the same "up to 32 columns" benchmark as you did, with this new
> > patch applied on top of both HEAD and your v2 patchset incorporating
> > planner estimates for the block sizez. Those are called "aset" and
> > "generation" in the attached spreadsheet. For each, I've run it with
> > glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case
> > 
> > I've measured two things:
> >   - query latency, as reported by pgbench
> >   - total memory allocated by malloc at backend ext after running each
> >   query
> > 
> > three times. This represents the "idle" memory consumption, and thus what
> > we waste in malloc inside of releasing back to the system. This
> > measurement has been performed using the very small module presented in
> > patch 0002. Please note that I in no way propose that we include this
> > module, it was just a convenient way for me to measure memory footprint.
> > 
> > My conclusion is that the impressive gains you see from using the
> > generation context with bigger blocks mostly comes from the fact that we
> > allocate bigger blocks, and that this moves the mmap thresholds
> > accordingly. I wonder how much of a difference it would make on other
> > malloc implementation: I'm afraid the optimisation presented here would
> > in fact be specific to glibc's malloc, since we have almost the same
> > gains with both allocators when tuning malloc to keep more memory. I
> > still think both approaches are useful, and would be necessary.
> Interesting measurements. It's intriguing that for generation contexts,
> the default "-1" often outperforms "1MB" (but not the other options),
> while for aset it's pretty much "the higher value the better".

For generation context with "big block sizes" this result is expected, as the 
malloc dynamic tuning will adapt to the big block size. This can also be seen 
on the "idle memory" measurement: the memory consumption is identical to the 
64MB value 

Re: Use generation context to speed up tuplesorts

2022-01-07 Thread Tomas Vondra

On 1/7/22 12:03, Ronan Dunklau wrote:

Le vendredi 31 décembre 2021, 22:26:37 CET David Rowley a écrit :

I've attached some benchmark results that I took recently.  The
spreadsheet contains results from 3 versions. master, master + 0001 -
0002, then master + 0001 - 0003.  The 0003 patch makes the code a bit
more conservative about the chunk sizes it allocates and also tries to
allocate the tuple array according to the number of tuples we expect
to be able to sort in a single batch for when the sort is not
estimated to fit inside work_mem.


(Sorry for trying to merge back the discussion on the two sides of the thread)

In  https://www.postgresql.org/message-id/4776839.iZASKD2KPV%40aivenronan, I
expressed the idea of being able to tune glibc's malloc behaviour.

I implemented that (patch 0001) to provide a new hook which is called on
backend startup, and anytime we set work_mem. This hook is # defined depending
on the malloc implementation: currently a default, no-op implementation is
provided as well as a glibc's malloc implementation.



Not sure I'd call this a hook - that usually means a way to plug-in 
custom code through a callback, and this is simply ifdefing a block of 
code to pick the right implementation. Which may be a good way to do 
that, just let's not call that a hook.


There's a commented-out MallocTuneHook() call, probably not needed.

I wonder if #ifdefing is sufficient solution, because it happens at 
compile time, so what if someone overrides the allocator in LD_PRELOAD? 
That was a fairly common way to use a custom allocator in an existing 
application. But I don't know how many people do that with Postgres (I'm 
not aware of anyone doing that) or if we support that (it'd probably 
apply to other stuff too, not just malloc). So maybe it's OK, and I 
can't think of a better way anyway.



The glibc's malloc implementation relies on a new GUC,
glibc_malloc_max_trim_threshold. When set to it's default value of -1, we
don't tune malloc at all, exactly as in HEAD. If a different value is provided,
we set M_MMAP_THRESHOLD to half this value, and M_TRIM_TRESHOLD to this value,
capped by work_mem / 2 and work_mem respectively.

The net result is that we can then allow to keep more unused memory at the top
of the heap, and to use mmap less frequently, if the DBA chooses too. A
possible other use case would be to on the contrary, limit the allocated
memory in idle backends to a minimum.

The reasoning behind this is that glibc's malloc default way of handling those
two thresholds is to adapt to the size of the last freed mmaped block.

I've run the same "up to 32 columns" benchmark as you did, with this new patch
applied on top of both HEAD and your v2 patchset incorporating planner
estimates for the block sizez. Those are called "aset" and "generation" in the
attached spreadsheet. For each, I've run it with
glibc_malloc_max_trim_threshold set to -1, 1MB, 4MB and 64MB. In each case
I've measured two things:
  - query latency, as reported by pgbench
  - total memory allocated by malloc at backend ext after running each query
three times. This represents the "idle" memory consumption, and thus what we
waste in malloc inside of releasing back to the system. This measurement has
been performed using the very small module presented in patch 0002. Please
note that I in no way propose that we include this module, it was just a
convenient way for me to measure memory footprint.

My conclusion is that the impressive gains you see from using the generation
context with bigger blocks mostly comes from the fact that we allocate bigger
blocks, and that this moves the mmap thresholds accordingly. I wonder how much
of a difference it would make on other malloc implementation: I'm afraid the
optimisation presented here would in fact be specific to glibc's malloc, since
we have almost the same gains with both allocators when tuning malloc to keep
more memory. I still think both approaches are useful, and would be necessary.



Interesting measurements. It's intriguing that for generation contexts, 
the default "-1" often outperforms "1MB" (but not the other options), 
while for aset it's pretty much "the higher value the better".



Since this affects all memory allocations, I need to come up with other
meaningful scenarios to benchmarks.



OK. Are you thinking about a different microbenchmark, or something 
closer to real workload?



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use generation context to speed up tuplesorts

2021-12-17 Thread Ronan Dunklau
Le vendredi 17 décembre 2021, 14:39:10 CET Tomas Vondra a écrit :
> I wasn't really suggesting to investigate those other allocators in this
> patch - it seems like a task requiring a pretty significant amount of
> work/time. My point was that we should make it reasonably easy to add
> tweaks for those other environments, if someone is interested enough to
> do the legwork.
> 
> >> 2) In fact, I wonder if different glibc versions behave differently?
> >> Hopefully it's not changing that much, though. Ditto kernel versions,
> >> but the mmap/sbrk interface is likely more stable. We can test this.
> > 
> > That could be tested, yes. As a matter of fact, a commit removing the
> > upper
> > limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to
> > glibc,
> > which means we can service much bigger allocation without mmap.
> 
> Yeah, I noticed that commit too. Most systems stick to one glibc
> version, so it'll take time to reach most systems. Let's continue with
> just one glibc version and then maybe test other versions.

Yes, I also need to figure out how to detect we're using glibc as I'm not very 
familiar with configure. 

> 
> >> 3) If we bump the thresholds, won't that work against reusing the
> >> memory? I mean, if we free a whole block (from any allocator we have),
> >> glibc might return it to kernel, depending on mmap threshold value. It's
> >> not guaranteed, but increasing the malloc thresholds will make that even
> >> less likely. So we might just as well increase the minimum block size,
> >> with about the same effect, no?
> > 
> > It is my understanding that malloc will try to compact memory by moving it
> > around. So the memory should be actually be released to the kernel at some
> > point. In the meantime, malloc can reuse it for our next invocation (which
> > can be in a different memory context on our side).
> > 
> > If we increase the minimum block size, this is memory we will actually
> > 
> > reserve, and it will not protect us against the ramping-up behaviour:
> >  - the first allocation of a big block may be over mmap_threshold, and
> >  serviced> 
> > by an expensive mmap
> > 
> >  - when it's free, the threshold is doubled
> >  - next invocation is serviced by an sbrk call
> >  - freeing it will be above the trim threshold, and it will be returned.
> > 
> > After several "big" allocations, the thresholds will raise to their
> > maximum
> > values (well, it used to, I need to check what happens with that latest
> > patch of glibc...)
> > 
> > This will typically happen several times as malloc doubles the threshold
> > each time. This is probably the reason quadrupling the block sizes was
> > more effective.
> 
> Hmmm, OK. Can we we benchmark the case with large initial block size, at
> least for comparison?

The benchmark I called "fixed" was with a fixed block size of 
ALLOCSET_DEFAULT_MAXSIZE (first proposed patch) and showed roughly the same 
performance profile as the growing blocks + malloc tuning.
But if  I understand correctly, you implemented the growing blocks logic after 
concerns about wasting memory with a constant large block size. If we tune 
malloc, that memory would not be wasted if we don't alloc it, just not 
released as eagerly when it's allocated. 

Or do you want a benchmark with an even bigger initial block size ? With the 
growing blocks patch with a large initial size ? I can run either, I just want 
to understand what is interesting to you.

One thing that would be interesting would be to trace the total amount of 
memory allocated in the different cases. This is something I will need to do 
anyway when I propose that patch; 

Best regards,

-- 
Ronan Dunklau






Re: Use generation context to speed up tuplesorts

2021-12-17 Thread Tomas Vondra
On 12/17/21 09:08, Ronan Dunklau wrote:
> Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit :
>> On 12/16/21 17:03, Ronan Dunklau wrote:
>>> Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit :
 I will follow up with a benchmark of the test sorting a table with a
 width
 varying from 1 to 32 columns.
>>>
>>> So please find attached another benchmark for that case.
>>>
>>> The 3 different patchsets tested are:
>>>   - master
>>>   - fixed (David's original patch)
>>>   - adjust (Thomas growing blocks patch)
>>
>> Presumably Thomas is me, right?
> 
> I'm really sorry for this typo... Please accept my apologies. 
> 

Nah, no apology needed. I was just wondering if I missed some patch from
Thomas Munro ...

>>
>>> So it looks like tuning malloc for this would be very benificial for any
>>> kind of allocation, and by doing so we reduce the problems seen with the
>>> growing blocks patch to next to nothing, while keeping the ability to not
>>> allocate too much memory from the get go.
>>
>> Thanks for running those tests and investigating the glibc behavior! I
>> find those results very interesting. My conclusions from this is that
>> the interaction interaction between "our" allocator and the allocator in
>> malloc (e.g. glibc) can be problematic. Which makes benchmarking and
>> optimization somewhat tricky because code changes may trigger behavior
>> change in glibc (or whatever allocator backs malloc).
>>
>> I think it's worth exploring if we can tune this in a reasonable way,
>> but I have a couple concerns related to that:
>>
>> 1) I wonder how glibc-specific this is - I'd bet it applies to other
>> allocators (either on another OS or just different allocator on Linux)
>> too. Tweaking glibc parameters won't affect those systems, of course,
>> but maybe we should allow tweaking those systems too ...
> 
> I agree, finding their specific problems and see if we can workaround it 
> would 
> be interesting. I suppose glibc's malloc is the most commonly used allocator 
> in production, as it is the default for most Linux distributions.
> 

I wasn't really suggesting to investigate those other allocators in this
patch - it seems like a task requiring a pretty significant amount of
work/time. My point was that we should make it reasonably easy to add
tweaks for those other environments, if someone is interested enough to
do the legwork.

>>
>> 2) In fact, I wonder if different glibc versions behave differently?
>> Hopefully it's not changing that much, though. Ditto kernel versions,
>> but the mmap/sbrk interface is likely more stable. We can test this.
> 
> That could be tested, yes. As a matter of fact, a commit removing the upper 
> limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc, 
> which means we can service much bigger allocation without mmap.
> 

Yeah, I noticed that commit too. Most systems stick to one glibc
version, so it'll take time to reach most systems. Let's continue with
just one glibc version and then maybe test other versions.

> 
>>
>> 3) If we bump the thresholds, won't that work against reusing the
>> memory? I mean, if we free a whole block (from any allocator we have),
>> glibc might return it to kernel, depending on mmap threshold value. It's
>> not guaranteed, but increasing the malloc thresholds will make that even
>> less likely. So we might just as well increase the minimum block size,
>> with about the same effect, no?
> 
> It is my understanding that malloc will try to compact memory by moving it 
> around. So the memory should be actually be released to the kernel at some 
> point. In the meantime, malloc can reuse it for our next invocation (which 
> can 
> be in a different memory context on our side). 
> 
> If we increase the minimum block size, this is memory we will actually 
> reserve, and it will not protect us against the ramping-up behaviour:
>  - the first allocation of a big block may be over mmap_threshold, and 
> serviced 
> by an expensive mmap
>  - when it's free, the threshold is doubled
>  - next invocation is serviced by an sbrk call
>  - freeing it will be above the trim threshold, and it will be returned.
> 
> After several "big" allocations, the thresholds will raise to their maximum 
> values (well, it used to, I need to check what happens with that latest patch 
> of glibc...)
>  
> This will typically happen several times as malloc doubles the threshold each 
> time. This is probably the reason quadrupling the block sizes was more 
> effective. 
> 

Hmmm, OK. Can we we benchmark the case with large initial block size, at
least for comparison?

> 
>>
>>> I would like to try to implement some dynamic glibc malloc tuning, if that
>>> is something we don't reject on principle from the get go.
>>
>> +1 to that
> 
> Ok, I'll work on a patch for this and submit a new thread. 
> 

Cool!


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use generation context to speed up tuplesorts

2021-12-17 Thread Ronan Dunklau
Le vendredi 17 décembre 2021, 09:08:06 CET Ronan Dunklau a écrit :
> It is my understanding that malloc will try to compact memory by moving it
> around. So the memory should be actually be released to the kernel at some
> point. In the meantime, malloc can reuse it for our next invocation (which
> can be in a different memory context on our side).

I've been told off-list this comment wasn't clear: I meant that it compacts 
*free* memory, consolidating into larger blocks that will eventually reach the 
threshold, and be released.

-- 
Ronan Dunklau






Re: Use generation context to speed up tuplesorts

2021-12-17 Thread Ronan Dunklau
Le jeudi 16 décembre 2021, 18:00:56 CET Tomas Vondra a écrit :
> On 12/16/21 17:03, Ronan Dunklau wrote:
> > Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit :
> >> I will follow up with a benchmark of the test sorting a table with a
> >> width
> >> varying from 1 to 32 columns.
> > 
> > So please find attached another benchmark for that case.
> > 
> > The 3 different patchsets tested are:
> >   - master
> >   - fixed (David's original patch)
> >   - adjust (Thomas growing blocks patch)
> 
> Presumably Thomas is me, right?

I'm really sorry for this typo... Please accept my apologies. 

> 
> > So it looks like tuning malloc for this would be very benificial for any
> > kind of allocation, and by doing so we reduce the problems seen with the
> > growing blocks patch to next to nothing, while keeping the ability to not
> > allocate too much memory from the get go.
> 
> Thanks for running those tests and investigating the glibc behavior! I
> find those results very interesting. My conclusions from this is that
> the interaction interaction between "our" allocator and the allocator in
> malloc (e.g. glibc) can be problematic. Which makes benchmarking and
> optimization somewhat tricky because code changes may trigger behavior
> change in glibc (or whatever allocator backs malloc).
> 
> I think it's worth exploring if we can tune this in a reasonable way,
> but I have a couple concerns related to that:
> 
> 1) I wonder how glibc-specific this is - I'd bet it applies to other
> allocators (either on another OS or just different allocator on Linux)
> too. Tweaking glibc parameters won't affect those systems, of course,
> but maybe we should allow tweaking those systems too ...

I agree, finding their specific problems and see if we can workaround it would 
be interesting. I suppose glibc's malloc is the most commonly used allocator 
in production, as it is the default for most Linux distributions.

> 
> 2) In fact, I wonder if different glibc versions behave differently?
> Hopefully it's not changing that much, though. Ditto kernel versions,
> but the mmap/sbrk interface is likely more stable. We can test this.

That could be tested, yes. As a matter of fact, a commit removing the upper 
limit for MALLOC_MMAP_THRESHOLD has just been committed yesterday to glibc, 
which means we can service much bigger allocation without mmap.


> 
> 3) If we bump the thresholds, won't that work against reusing the
> memory? I mean, if we free a whole block (from any allocator we have),
> glibc might return it to kernel, depending on mmap threshold value. It's
> not guaranteed, but increasing the malloc thresholds will make that even
> less likely. So we might just as well increase the minimum block size,
> with about the same effect, no?

It is my understanding that malloc will try to compact memory by moving it 
around. So the memory should be actually be released to the kernel at some 
point. In the meantime, malloc can reuse it for our next invocation (which can 
be in a different memory context on our side). 

If we increase the minimum block size, this is memory we will actually 
reserve, and it will not protect us against the ramping-up behaviour:
 - the first allocation of a big block may be over mmap_threshold, and serviced 
by an expensive mmap
 - when it's free, the threshold is doubled
 - next invocation is serviced by an sbrk call
 - freeing it will be above the trim threshold, and it will be returned.

After several "big" allocations, the thresholds will raise to their maximum 
values (well, it used to, I need to check what happens with that latest patch 
of glibc...)
 
This will typically happen several times as malloc doubles the threshold each 
time. This is probably the reason quadrupling the block sizes was more 
effective. 


> 
> > I would like to try to implement some dynamic glibc malloc tuning, if that
> > is something we don't reject on principle from the get go.
> 
> +1 to that

Ok, I'll work on a patch for this and submit a new thread. 


-- 
Ronan Dunklau






Re: Use generation context to speed up tuplesorts

2021-12-16 Thread Tomas Vondra

On 12/16/21 17:03, Ronan Dunklau wrote:

Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit :

I will follow up with a benchmark of the test sorting a table with a width
varying from 1 to 32 columns.



So please find attached another benchmark for that case.

The 3 different patchsets tested are:

  - master
  - fixed (David's original patch)
  - adjust (Thomas growing blocks patch)



Presumably Thomas is me, right?


So it looks like tuning malloc for this would be very benificial for any kind
of allocation, and by doing so we reduce the problems seen with the growing
blocks patch to next to nothing, while keeping the ability to not allocate too
much memory from the get go.



Thanks for running those tests and investigating the glibc behavior! I 
find those results very interesting. My conclusions from this is that 
the interaction interaction between "our" allocator and the allocator in 
malloc (e.g. glibc) can be problematic. Which makes benchmarking and 
optimization somewhat tricky because code changes may trigger behavior 
change in glibc (or whatever allocator backs malloc).


I think it's worth exploring if we can tune this in a reasonable way, 
but I have a couple concerns related to that:


1) I wonder how glibc-specific this is - I'd bet it applies to other 
allocators (either on another OS or just different allocator on Linux) 
too. Tweaking glibc parameters won't affect those systems, of course, 
but maybe we should allow tweaking those systems too ...


2) In fact, I wonder if different glibc versions behave differently? 
Hopefully it's not changing that much, though. Ditto kernel versions, 
but the mmap/sbrk interface is likely more stable. We can test this.


3) If we bump the thresholds, won't that work against reusing the 
memory? I mean, if we free a whole block (from any allocator we have), 
glibc might return it to kernel, depending on mmap threshold value. It's 
not guaranteed, but increasing the malloc thresholds will make that even 
less likely. So we might just as well increase the minimum block size, 
with about the same effect, no?




I would like to try to implement some dynamic glibc malloc tuning, if that is
something we don't reject on principle from the get go.



+1 to that


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use generation context to speed up tuplesorts

2021-12-16 Thread Ronan Dunklau
Le jeudi 16 décembre 2021, 11:56:15 CET Ronan Dunklau a écrit :
> I will follow up with a benchmark of the test sorting a table with a width
> varying from 1 to 32 columns.
> 

So please find attached another benchmark for that case.

The 3 different patchsets tested are:

 - master
 - fixed (David's original patch)
 - adjust (Thomas growing blocks patch)

So it looks like tuning malloc for this would be very benificial for any kind 
of allocation, and by doing so we reduce the problems seen with the growing 
blocks patch to next to nothing, while keeping the ability to not allocate too 
much memory from the get go.

I would like to try to implement some dynamic glibc malloc tuning, if that is 
something we don't reject on principle from the get go. 


-- 
Ronan Dunklau

bench_results.ods
Description: application/vnd.oasis.opendocument.spreadsheet


Re: Use generation context to speed up tuplesorts

2021-12-16 Thread Ronan Dunklau
Le mercredi 8 décembre 2021, 22:07:12 CET Tomas Vondra a écrit :
> On 12/8/21 16:51, Ronan Dunklau wrote:
> > Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
> >> And now comes the funny part - if I run it in the same backend as the
> >> 
> >> "full" benchmark, I get roughly the same results:
> >>   block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >>  
> >>  ++---+--+-
> >>  
> >>32768 |512 | 806256640 |37159 |   76669
> >> 
> >> but if I reconnect and run it in the new backend, I get this:
> >>   block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >>  
> >>  ++---+--+-
> >>  
> >>32768 |512 | 806158336 |   233909 |  100785
> >>  
> >>  (1 row)
> >> 
> >> It does not matter if I wait a bit before running the query, if I run it
> >> repeatedly, etc. The machine is not doing anything else, the CPU is set
> >> to use "performance" governor, etc.
> > 
> > I've reproduced the behaviour you mention.
> > I also noticed asm_exc_page_fault showing up in the perf report in that
> > case.
> > 
> > Running an strace on it shows that in one case, we have a lot of brk
> > calls,
> > while when we run in the same process as the previous tests, we don't.
> > 
> > My suspicion is that the previous workload makes glibc malloc change it's
> > trim_threshold and possibly other dynamic options, which leads to
> > constantly moving the brk pointer in one case and not the other.
> > 
> > Running your fifo test with absurd malloc options shows that indeed that
> > might be the case (I needed to change several, because changing one
> > disable the dynamic adjustment for every single one of them, and malloc
> > would fall back to using mmap and freeing it on each iteration):
> > 
> > mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
> > mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
> > mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));
> > 
> > I get the following results for your self contained test. I ran the query
> > twice, in each case, seeing the importance of the first allocation and the
> > subsequent ones:
> > 
> > With default malloc options:
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> > 
> > ++---+--+-
> > 
> >   32768 |512 | 795836416 |   300156 |  207557
> >  
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> > 
> > ++---+--+-
> > 
> >   32768 |512 | 795836416 |   211942 |   77207
> > 
> > With the oversized values above:
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> > 
> > ++---+--+-
> > 
> >   32768 |512 | 795836416 |   219000 |   36223
> >  
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> > 
> > ++---+--+-
> > 
> >   32768 |512 | 795836416 |75761 |   78082
> > 
> > (1 row)
> > 
> > I can't tell how representative your benchmark extension would be of real
> > life allocation / free patterns, but there is probably something we can
> > improve here.
> 
> Thanks for looking at this. I think those allocation / free patterns are
> fairly extreme, and there probably are no workloads doing exactly this.
> The idea is the actual workloads are likely some combination of these
> extreme cases.
> 
> > I'll try to see if I can understand more precisely what is happening.
> 
> Thanks, that'd be helpful. Maybe we can learn something about tuning
> malloc parameters to get significantly better performance.
> 

Apologies for the long email, maybe what I will state here is obvious for 
others but I learnt a lot, so... 

I think I understood what the problem was in your generation tests: depending 
on the sequence of allocations we will raise a different maximum for 
mmap_threshold and trim_threshold. When an mmap block is freed, malloc will 
raise it's threshold as it consider memory freed to be better served by 
regular moving of the sbrk pointer, instead of using mmap to map memory.  This 
threshold is upped by multiplying it by two anytime we free a mmap'ed block.

At the same time, the trim_threshold is raised to double the mmap_threshold, 
considering  that this amount of memory should not be released to the OS 
because we have a good chance of reusing it. 

This can be demonstrated using the attached systemtap script, along with a 
patch adding new traces to generation_context for this purpose:
When running your query:

select block_size, chunk_size, x.* from
(values (512)) AS a(chunk_size),
(values (32768)) AS b(block_size),
 lateral generation_bench_fifo(100, block_size, chunk_size,
   

Re: Use generation context to speed up tuplesorts

2021-12-08 Thread Tomas Vondra
On 12/8/21 16:51, Ronan Dunklau wrote:
> Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
>> And now comes the funny part - if I run it in the same backend as the
>> "full" benchmark, I get roughly the same results:
>>
>>   block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>>  ++---+--+-
>>32768 |512 | 806256640 |37159 |   76669
>>
>> but if I reconnect and run it in the new backend, I get this:
>>
>>   block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>>  ++---+--+-
>>32768 |512 | 806158336 |   233909 |  100785
>>  (1 row)
>>
>> It does not matter if I wait a bit before running the query, if I run it
>> repeatedly, etc. The machine is not doing anything else, the CPU is set
>> to use "performance" governor, etc.
> 
> I've reproduced the behaviour you mention.
> I also noticed asm_exc_page_fault showing up in the perf report in that case.
> 
> Running an strace on it shows that in one case, we have a lot of brk calls, 
> while when we run in the same process as the previous tests, we don't.
> 
> My suspicion is that the previous workload makes glibc malloc change it's 
> trim_threshold and possibly other dynamic options, which leads to constantly 
> moving the brk pointer in one case and not the other.
> 
> Running your fifo test with absurd malloc options shows that indeed that 
> might 
> be the case (I needed to change several, because changing one disable the 
> dynamic adjustment for every single one of them, and malloc would fall back 
> to 
> using mmap and freeing it on each iteration):
> 
> mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
> mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
> mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));
> 
> I get the following results for your self contained test. I ran the query 
> twice, in each case, seeing the importance of the first allocation and the 
> subsequent ones:
> 
> With default malloc options:
> 
>  block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
> ++---+--+-
>   32768 |512 | 795836416 |   300156 |  207557
> 
>  block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
> ++---+--+-
>   32768 |512 | 795836416 |   211942 |   77207
> 
> 
> With the oversized values above:
> 
>  block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
> ++---+--+-
>   32768 |512 | 795836416 |   219000 |   36223
> 
> 
>  block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
> ++---+--+-
>   32768 |512 | 795836416 |75761 |   78082
> (1 row)
> 
> I can't tell how representative your benchmark extension would be of real 
> life 
> allocation / free patterns, but there is probably something we can improve 
> here.
> 

Thanks for looking at this. I think those allocation / free patterns are
fairly extreme, and there probably are no workloads doing exactly this.
The idea is the actual workloads are likely some combination of these
extreme cases.

> I'll try to see if I can understand more precisely what is happening.
> 

Thanks, that'd be helpful. Maybe we can learn something about tuning
malloc parameters to get significantly better performance.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use generation context to speed up tuplesorts

2021-12-08 Thread Ronan Dunklau
Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
> And now comes the funny part - if I run it in the same backend as the
> "full" benchmark, I get roughly the same results:
> 
>   block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>  ++---+--+-
>32768 |512 | 806256640 |37159 |   76669
> 
> but if I reconnect and run it in the new backend, I get this:
> 
>   block_size | chunk_size | mem_allocated | alloc_ms | free_ms
>  ++---+--+-
>32768 |512 | 806158336 |   233909 |  100785
>  (1 row)
> 
> It does not matter if I wait a bit before running the query, if I run it
> repeatedly, etc. The machine is not doing anything else, the CPU is set
> to use "performance" governor, etc.

I've reproduced the behaviour you mention.
I also noticed asm_exc_page_fault showing up in the perf report in that case.

Running an strace on it shows that in one case, we have a lot of brk calls, 
while when we run in the same process as the previous tests, we don't.

My suspicion is that the previous workload makes glibc malloc change it's 
trim_threshold and possibly other dynamic options, which leads to constantly 
moving the brk pointer in one case and not the other.

Running your fifo test with absurd malloc options shows that indeed that might 
be the case (I needed to change several, because changing one disable the 
dynamic adjustment for every single one of them, and malloc would fall back to 
using mmap and freeing it on each iteration):

mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));

I get the following results for your self contained test. I ran the query 
twice, in each case, seeing the importance of the first allocation and the 
subsequent ones:

With default malloc options:

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
++---+--+-
  32768 |512 | 795836416 |   300156 |  207557

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
++---+--+-
  32768 |512 | 795836416 |   211942 |   77207


With the oversized values above:

 block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
++---+--+-
  32768 |512 | 795836416 |   219000 |   36223


 block_size | chunk_size | mem_allocated | alloc_ms | free_ms 
++---+--+-
  32768 |512 | 795836416 |75761 |   78082
(1 row)

I can't tell how representative your benchmark extension would be of real life 
allocation / free patterns, but there is probably something we can improve 
here.

I'll try to see if I can understand more precisely what is happening.

-- 
Ronan Dunklau






Re: Use generation context to speed up tuplesorts

2021-08-08 Thread David Rowley
On Mon, 9 Aug 2021 at 00:38, Tomas Vondra  wrote:
> I'm not sure quadrupling the size is a good idea, though, because it
> increases the amount of memory we might be wasting. With the doubling,
> the amount of wasted /unused memory is limited to ~50%, because the next
> block is (roughly) equal to sum of already allocated blocks, so
> allocating just 1B on it leaves us with 50%. But quadrupling the size
> means we'll end up with ~75% free space. Of course, this is capped by
> the maximum block size etc. but still ...

Yeah, not sure what is best. It does however seem likely that the
majority of the performance improvement that I saw is due to either
malloc()/free() calls or just having fewer blocks in the context.

Maybe it's worth getting the planner on board with deciding how to do
the allocations.  It feels a bit overcautious to go allocating blocks
in each power of two starting at 8192 bytes when doing a 1GB sort.
Maybe we should be looking towards doing something more like making
the init allocation size more like pg_prevpower2_64(Min(work_mem *
1024L, sort_tuples * tuple_width)), or maybe half or quarter that.

It would certainly not be the only executor node to allocate memory
based on what the planner thought. Just look at ExecHashTableCreate().

David




Re: Use generation context to speed up tuplesorts

2021-08-08 Thread David Rowley
On Wed, 4 Aug 2021 at 02:10, Tomas Vondra  wrote:
> I did run the same set of benchmarks as for Slab, measuring some usual
> allocation patterns. The results for i5-2500k machine are attached (for
> the xeon it's almost exactly the same behavior). While running those
> tests I realized the last patch is wrong and sets allocChunkLimit=1,
> which is bogus and causes significant regression. So here's an updated
> version of the patch series too.

I know you're not done with these yet, but FWIW, I was getting an
Assert failure with these patches on:

Assert(total_allocated == context->mem_allocated);

It seems to be because you've forgotten to ignore keeper blocks when
adjusting context->mem_allocated in GenerationReset()

David




Re: Use generation context to speed up tuplesorts

2021-08-08 Thread Tomas Vondra

On 8/8/21 12:11 PM, David Rowley wrote:

On Sat, 7 Aug 2021 at 12:10, Tomas Vondra  wrote:

All of the tests show that the patches to improve the allocation
efficiency of generation.c don't help to improve the results of the
test cases. I wondered if it's maybe worth trying to see what happens
if instead of doubling the allocations each time, quadruple them
instead. I didn't try this.



I doubt quadrupling the allocations won't help very much, but I suspect
the problem might be in the 0004 patch - at least that's what shows
regression in my results. Could you try with just 0001-0003 applied?


I tried the quadrupling of the buffer instead of doubling it each time
and got the attached.  Column E, or green in the graphs show the
results of that. It's now much closer to the original patch which just
made the block size 8MB.



Interesting, I wouldn't have expected that to make such difference.

I'm not sure quadrupling the size is a good idea, though, because it 
increases the amount of memory we might be wasting. With the doubling, 
the amount of wasted /unused memory is limited to ~50%, because the next 
block is (roughly) equal to sum of already allocated blocks, so 
allocating just 1B on it leaves us with 50%. But quadrupling the size 
means we'll end up with ~75% free space. Of course, this is capped by 
the maximum block size etc. but still ...



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use generation context to speed up tuplesorts

2021-08-08 Thread Tomas Vondra




On 8/8/21 9:02 AM, David Rowley wrote:

On Sat, 7 Aug 2021 at 12:10, Tomas Vondra  wrote:


On 8/6/21 3:07 PM, David Rowley wrote:

All of the tests show that the patches to improve the allocation
efficiency of generation.c don't help to improve the results of the
test cases. I wondered if it's maybe worth trying to see what happens
if instead of doubling the allocations each time, quadruple them
instead. I didn't try this.



I doubt quadrupling the allocations won't help very much, but I suspect
the problem might be in the 0004 patch - at least that's what shows
regression in my results. Could you try with just 0001-0003 applied?


But 0004 only changes the logic which controls the threshold of when
we allocate an oversized chunk.  It looks like the threshold is 512KB
with the 0004 patch.  My test is only doing a maximum allocation of
296 bytes so will never allocate an oversized chunk.

Can you explain why you think 0004 would cause performance regressions?



It's based solely on results of my benchmarks, where this patch seems to 
cause performance regression. I agree it's a bit bizzare, considering 
what the patch does.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use generation context to speed up tuplesorts

2021-08-08 Thread David Rowley
On Sat, 7 Aug 2021 at 12:10, Tomas Vondra  wrote:
> > All of the tests show that the patches to improve the allocation
> > efficiency of generation.c don't help to improve the results of the
> > test cases. I wondered if it's maybe worth trying to see what happens
> > if instead of doubling the allocations each time, quadruple them
> > instead. I didn't try this.
> >
>
> I doubt quadrupling the allocations won't help very much, but I suspect
> the problem might be in the 0004 patch - at least that's what shows
> regression in my results. Could you try with just 0001-0003 applied?

I tried the quadrupling of the buffer instead of doubling it each time
and got the attached.  Column E, or green in the graphs show the
results of that. It's now much closer to the original patch which just
made the block size 8MB.

David


generation context tuplesort_v2.ods
Description: application/vnd.oasis.opendocument.spreadsheet


Re: Use generation context to speed up tuplesorts

2021-08-08 Thread David Rowley
On Sat, 7 Aug 2021 at 12:10, Tomas Vondra  wrote:
>
> On 8/6/21 3:07 PM, David Rowley wrote:
> > All of the tests show that the patches to improve the allocation
> > efficiency of generation.c don't help to improve the results of the
> > test cases. I wondered if it's maybe worth trying to see what happens
> > if instead of doubling the allocations each time, quadruple them
> > instead. I didn't try this.
> >
>
> I doubt quadrupling the allocations won't help very much, but I suspect
> the problem might be in the 0004 patch - at least that's what shows
> regression in my results. Could you try with just 0001-0003 applied?

But 0004 only changes the logic which controls the threshold of when
we allocate an oversized chunk.  It looks like the threshold is 512KB
with the 0004 patch.  My test is only doing a maximum allocation of
296 bytes so will never allocate an oversized chunk.

Can you explain why you think 0004 would cause performance regressions?

David




Re: Use generation context to speed up tuplesorts

2021-08-06 Thread Tomas Vondra

On 8/6/21 3:07 PM, David Rowley wrote:

On Wed, 4 Aug 2021 at 02:10, Tomas Vondra  wrote:

A review would be nice, although it can wait - It'd be interesting to
know if those patches help with the workload(s) you've been looking at.


I tried out the v2 set of patches using the attached scripts.  The
attached spreadsheet includes the original tests and compares master
with the patch which uses the generation context vs that patch plus
your v2 patch.

I've also included 4 additional tests, each of which starts with a 1
column table and then adds another 32 columns testing the performance
after adding each additional column. I did this because I wanted to
see if the performance was more similar to master when the allocations
had less power of 2 wastage from allocset. If, for example, you look
at row 123 of the spreadsheet you can see both patched and unpatched
the allocations were 272 bytes each yet there was still a 50%
performance improvement with just the generation context patch when
compared to master.

Looking at the spreadsheet, you'll also notice that in the 2 column
test of each of the 4 new tests the number of bytes used for each
allocation is larger with the generation context. 56 vs 48.  This is
due to the GenerationChunk struct size being later than the Allocset's
version by 8 bytes.  This is because it also holds the
GenerationBlock.  So with the patch there are some cases where we'll
use slightly more memory.

Additional tests:

1. Sort 1 tuples on a column with values 0-99 in memory.
2. As #1 but with 1 million tuples.
3 As #1 but with a large OFFSET to remove the overhead of sending to the client.
4. As #2 but with a large OFFSET.

Test #3 above is the most similar one to the original tests and shows
similar gains. When the sort becomes larger (1 million tuple test),
the gains reduce. This indicates the gains are coming from improved
CPU cache efficiency from the removal of the power of 2 wastage in
memory allocations.

All of the tests show that the patches to improve the allocation
efficiency of generation.c don't help to improve the results of the
test cases. I wondered if it's maybe worth trying to see what happens
if instead of doubling the allocations each time, quadruple them
instead. I didn't try this.



Thanks for the scripts and the spreadsheet with results.

I doubt quadrupling the allocations won't help very much, but I suspect 
the problem might be in the 0004 patch - at least that's what shows 
regression in my results. Could you try with just 0001-0003 applied?



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use generation context to speed up tuplesorts

2021-08-06 Thread David Rowley
On Wed, 4 Aug 2021 at 02:10, Tomas Vondra  wrote:
> A review would be nice, although it can wait - It'd be interesting to
> know if those patches help with the workload(s) you've been looking at.

I tried out the v2 set of patches using the attached scripts.  The
attached spreadsheet includes the original tests and compares master
with the patch which uses the generation context vs that patch plus
your v2 patch.

I've also included 4 additional tests, each of which starts with a 1
column table and then adds another 32 columns testing the performance
after adding each additional column. I did this because I wanted to
see if the performance was more similar to master when the allocations
had less power of 2 wastage from allocset. If, for example, you look
at row 123 of the spreadsheet you can see both patched and unpatched
the allocations were 272 bytes each yet there was still a 50%
performance improvement with just the generation context patch when
compared to master.

Looking at the spreadsheet, you'll also notice that in the 2 column
test of each of the 4 new tests the number of bytes used for each
allocation is larger with the generation context. 56 vs 48.  This is
due to the GenerationChunk struct size being later than the Allocset's
version by 8 bytes.  This is because it also holds the
GenerationBlock.  So with the patch there are some cases where we'll
use slightly more memory.

Additional tests:

1. Sort 1 tuples on a column with values 0-99 in memory.
2. As #1 but with 1 million tuples.
3 As #1 but with a large OFFSET to remove the overhead of sending to the client.
4. As #2 but with a large OFFSET.

Test #3 above is the most similar one to the original tests and shows
similar gains. When the sort becomes larger (1 million tuple test),
the gains reduce. This indicates the gains are coming from improved
CPU cache efficiency from the removal of the power of 2 wastage in
memory allocations.

All of the tests show that the patches to improve the allocation
efficiency of generation.c don't help to improve the results of the
test cases. I wondered if it's maybe worth trying to see what happens
if instead of doubling the allocations each time, quadruple them
instead. I didn't try this.

David
#!/bin/bash

sec=5
dbname=postgres
records=100
mod=100

psql -c "drop table if exists t" $dbname
psql -c "create table t (a bigint not null)" $dbname
psql -c "insert into t select x % $mod from generate_series(1,$records) x" 
$dbname
psql -c "vacuum freeze t" $dbname

for i in {1..32}
do
echo $i
psql -c "explain analyze select * from t order by a offset 10" 
$dbname | grep -E "Memory|Disk"
echo "select * from t order by a offset 10" > bench.sql
for loops in {1..3}
do
pgbench -n -M prepared -T $sec -f bench.sql $dbname | grep tps
done

psql -c "alter table t add column c$i bigint" $dbname
psql -c "update t set c$i = a" $dbname
psql -c "vacuum full t" $dbname
psql -c "vacuum freeze t" $dbname
done



generation context tuplesort.ods
Description: application/vnd.oasis.opendocument.spreadsheet
#!/bin/bash

sec=5
dbname=postgres
records=1
mod=100

psql -c "drop table if exists t" $dbname
psql -c "create table t (a bigint not null)" $dbname
psql -c "insert into t select x % $mod from generate_series(1,$records) x" 
$dbname
psql -c "vacuum freeze t" $dbname

for i in {1..32}
do
echo $i
psql -c "explain analyze select * from t order by a offset 10" 
$dbname | grep -E "Memory|Disk"
echo "select * from t order by a offset 10" > bench.sql
for loops in {1..3}
do
pgbench -n -M prepared -T $sec -f bench.sql $dbname | grep tps
done

psql -c "alter table t add column c$i bigint" $dbname
psql -c "update t set c$i = a" $dbname
psql -c "vacuum full t" $dbname
psql -c "vacuum freeze t" $dbname
done



Re: Use generation context to speed up tuplesorts

2021-08-03 Thread Andres Freund
Hi,

On 2021-08-03 10:59:25 +1200, David Rowley wrote:
> On Sat, 31 Jul 2021 at 08:38, Andres Freund  wrote:
> > There is at least one path using tuplecontext that reaches code that
> > could end up freeing memory to a significant enough degree to care about
> > generation.c effectively not using that memory:
> > tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
> > On a quick look I didn't find any expanded record user that frees
> > nontrivial amounts of memory, but I didn't look all that carefully.
> 
> I guess we could just use a normal context for datum sorts if we
> thought that might be a problem.

I think that's probably a cure worse than the disease. I suspect datum sorts
can benefit from the higher density quite a bit...


> I'm not too familiar with the expanded object code, but I'm struggling
> to imagine why anything would need to do a pfree in there. We just do
> EOH_get_flat_size() to determine how big to make the allocation then
> allocate some memory for EOH_flatten_into() to use to expand the
> object into.

I can see some scenarios with a bit more creative uses of expanded
objects. We've e.g. been talking about using EA to avoid repeated and partial
detoasting overhead and you might need to do some more toast fetches when
flattening. Toast fetches always allocate, and if the fetch were only for
later parts of the tuple, the fetched data would need to be freed.

It's probably fine to deal with this at later time, and just leave a comment
somewhere.

It could be addressed by having a bump style allocator combined with having
freelists. It's not like the tuplesort.c case is actually interested in the
generational behaviour of generation.c (which makes freelists uninteresting),
it's just that generation.c is the densest allocator that we have right now...

Greetings,

Andres Freund




Re: Use generation context to speed up tuplesorts

2021-08-02 Thread David Rowley
On Sat, 31 Jul 2021 at 08:38, Andres Freund  wrote:
> There is at least one path using tuplecontext that reaches code that
> could end up freeing memory to a significant enough degree to care about
> generation.c effectively not using that memory:
> tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
> On a quick look I didn't find any expanded record user that frees
> nontrivial amounts of memory, but I didn't look all that carefully.

I guess we could just use a normal context for datum sorts if we
thought that might be a problem.

I'm not too familiar with the expanded object code, but I'm struggling
to imagine why anything would need to do a pfree in there. We just do
EOH_get_flat_size() to determine how big to make the allocation then
allocate some memory for EOH_flatten_into() to use to expand the
object into.

David




Re: Use generation context to speed up tuplesorts

2021-08-02 Thread David Rowley
On Sat, 31 Jul 2021 at 14:34, Tomas Vondra
 wrote:
> I spent a bit of time hacking on the Generation context, adding the two
> improvements discussed in this thread:
>
> 1) internal handling of block sizes, similar to what AllocSet does (it
> pretty much just copies parts of it)
>
> 2) keeper block (we keep one empry block instead of freeing it)
>
> 3) I've also added allocChunkLimit, which makes it look a bit more like
> AllocSet (instead of using just blockSize/8, which does not work too
> well with dynamic blockSize)
>
> I haven't done any extensive tests on it, but it does pass check-world
> with asserts etc. I haven't touched the comments, those need updating.
> regards

Thanks for starting work on that. I've only had a quick look, but I
can have a more detailed look once you've got it more complete.

For now it does not really look like the keeper block stuff is wired
up the same way as in aset.c. I'd expect you to be allocating that in
the same malloc as you're using to allocate the context struct itself
in GenerationContextCreate().

Also, likely as a result of the above, minContextSize does not seem to
be wired up to anything apart from an Assert().

David




Re: Use generation context to speed up tuplesorts

2021-07-30 Thread Tomas Vondra

Hi,

I spent a bit of time hacking on the Generation context, adding the two 
improvements discussed in this thread:


1) internal handling of block sizes, similar to what AllocSet does (it 
pretty much just copies parts of it)


2) keeper block (we keep one empry block instead of freeing it)

3) I've also added allocChunkLimit, which makes it look a bit more like 
AllocSet (instead of using just blockSize/8, which does not work too 
well with dynamic blockSize)


I haven't done any extensive tests on it, but it does pass check-world 
with asserts etc. I haven't touched the comments, those need updating.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>From 6ca0ae674964157c1f2d2ae446cb415c9be509b6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 30 Jul 2021 23:53:52 +0200
Subject: [PATCH 1/3] Generation: grow blocks

---
 src/backend/access/gist/gistvacuum.c  |  2 +-
 .../replication/logical/reorderbuffer.c   |  2 +-
 src/backend/utils/mmgr/generation.c   | 53 ++-
 src/include/utils/memutils.h  |  4 +-
 4 files changed, 44 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531..1818ed06fc 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -161,7 +161,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	vstate.page_set_context = GenerationContextCreate(CurrentMemoryContext,
 	  "GiST VACUUM page set context",
-	  16 * 1024);
+	  ALLOCSET_DEFAULT_SIZES);
 	oldctx = MemoryContextSwitchTo(vstate.page_set_context);
 	vstate.internal_page_set = intset_create();
 	vstate.empty_leaf_set = intset_create();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f99e45f22d..89018dc99d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -336,7 +336,7 @@ ReorderBufferAllocate(void)
 
 	buffer->tup_context = GenerationContextCreate(new_ctx,
   "Tuples",
-  SLAB_LARGE_BLOCK_SIZE);
+  ALLOCSET_DEFAULT_SIZES);
 
 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c
index 584cd614da..771a2525ca 100644
--- a/src/backend/utils/mmgr/generation.c
+++ b/src/backend/utils/mmgr/generation.c
@@ -60,7 +60,9 @@ typedef struct GenerationContext
 	MemoryContextData header;	/* Standard memory-context fields */
 
 	/* Generational context parameters */
-	Size		blockSize;		/* standard block size */
+	Size		initBlockSize;	/* initial block size */
+	Size		maxBlockSize;	/* maximum block size */
+	Size		nextBlockSize;	/* next block size to allocate */
 
 	GenerationBlock *block;		/* current (most recently allocated) block */
 	dlist_head	blocks;			/* list of blocks */
@@ -196,7 +198,9 @@ static const MemoryContextMethods GenerationMethods = {
 MemoryContext
 GenerationContextCreate(MemoryContext parent,
 		const char *name,
-		Size blockSize)
+		Size minContextSize,
+		Size initBlockSize,
+		Size maxBlockSize)
 {
 	GenerationContext *set;
 
@@ -208,16 +212,20 @@ GenerationContextCreate(MemoryContext parent,
 	 "padding calculation in GenerationChunk is wrong");
 
 	/*
-	 * First, validate allocation parameters.  (If we're going to throw an
-	 * error, we should do so before the context is created, not after.)  We
-	 * somewhat arbitrarily enforce a minimum 1K block size, mostly because
-	 * that's what AllocSet does.
+	 * First, validate allocation parameters.  Once these were regular runtime
+	 * test and elog's, but in practice Asserts seem sufficient because nobody
+	 * varies their parameters at runtime.  We somewhat arbitrarily enforce a
+	 * minimum 1K block size.
 	 */
-	if (blockSize != MAXALIGN(blockSize) ||
-		blockSize < 1024 ||
-		!AllocHugeSizeIsValid(blockSize))
-		elog(ERROR, "invalid blockSize for memory context: %zu",
-			 blockSize);
+	Assert(initBlockSize == MAXALIGN(initBlockSize) &&
+		   initBlockSize >= 1024);
+	Assert(maxBlockSize == MAXALIGN(maxBlockSize) &&
+		   maxBlockSize >= initBlockSize &&
+		   AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
+	Assert(minContextSize == 0 ||
+		   (minContextSize == MAXALIGN(minContextSize) &&
+			minContextSize >= 1024 &&
+			minContextSize <= maxBlockSize));
 
 	/*
 	 * Allocate the context header.  Unlike aset.c, we never try to combine
@@ -242,7 +250,9 @@ GenerationContextCreate(MemoryContext parent,
 	 */
 
 	/* Fill in GenerationContext-specific header fields */
-	set->blockSize = blockSize;
+	set->initBlockSize = initBlockSize;
+	set->maxBlockSize = maxBlockSize;
+	set->nextBlockSize = initBlockSize;
 	set->block = NULL;
 	

Re: Use generation context to speed up tuplesorts

2021-07-30 Thread Tomas Vondra

On 7/30/21 10:38 PM, Andres Freund wrote:

Hi,

On 2021-07-30 18:42:18 +1200, David Rowley wrote:

While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum
sorts for single column sorts), I noticed that we use a separate
memory context to store tuple data and we just reset when we're done
if the sort fits in memory, or we dump the tuples to disk in the same
order they're added and reset the context when it does not.  There is
a little pfree() work going on via writetup_heap() which I think
possibly could just be removed to get some additional gains.

Anyway, this context that stores tuples uses the standard aset.c
allocator which has the usual power of 2 wastage and additional
overheads of freelists etc.  I wondered how much faster it would go if
I set it to use a generation context instead of an aset.c one.

After running make installcheck to make the tenk1 table, running the
attached tuplesortbench script, I get this:



So it seems to save quite a bit of memory getting away from rounding
up allocations to the next power of 2.

Performance-wise, there's some pretty good gains. (Results in TPS)


Very nice!



Yes, very nice. I wouldn't have expected such significant difference, 
particularly in CPU usage. It's pretty interesting that it both reduces 
memory and CPU usage, I'd have guessed it's either one of the other.




I wonder if there's cases where generation.c would regress performance
over aset.c due to not having an initial / "keeper" block?



Not sure. I guess such workload would need to allocate and free a single 
block (so very little memory) very often. I guess that's possible, but 
I'm not aware of a place doing that very often. Although, maybe decoding 
could do that for simple (serial) workload.


I'm not opposed to adding a keeper block to Generation, similarly to 
what was discussed for Slab not too long ago.





The patch is just a simple 1-liner at the moment.  I likely do need to
adjust what I'm passing as the blockSize to GenerationContextCreate().
   Maybe a better number would be something that's calculated from
work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64))
so that we just allocate at most a 16th of work_mem per chunk, but not
bigger than 8MB. I don't think changing this will affect the
performance of the above very much.


I think it's bad that both genereration and slab don't have internal
handling of block sizes. Needing to err on the size of too big blocks to
handle large amounts of memory well, just so the contexts don't need to
deal with variably sized blocks isn't a sensible tradeoff.



Well, back then it seemed like a sensible trade off to me, but I agree 
it may have negative consequences. I'm not opposed to revisiting this.



I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or
Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for
tuplesort.c. There's plenty cases where we'll just sort a handful of
tuples, and increasing the memory usage of those by a factor of 1024
isn't good. The Min() won't do any good if a larger work_mem is used.

Nor will it be good to use thousands of small allocations for a large
in-memory tuplesort just because we're concerned about the initial
allocation size. Both because of the allocation overhead, but
importantly also because that will make context resets more expensive.



True.


To me this says that we should transplant aset.c's block size growing
into generation.c.



Yeah, maybe.



There is at least one path using tuplecontext that reaches code that
could end up freeing memory to a significant enough degree to care about
generation.c effectively not using that memory:
tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
On a quick look I didn't find any expanded record user that frees
nontrivial amounts of memory, but I didn't look all that carefully.



Not sure, I'm not familiar with EOH_flatten_into or expanded records. 
But I wonder if there's some sort of metric that we could track in 
Generation and use it to identify "interesting" places.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Use generation context to speed up tuplesorts

2021-07-30 Thread Andres Freund
Hi,

On 2021-07-30 18:42:18 +1200, David Rowley wrote:
> While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum
> sorts for single column sorts), I noticed that we use a separate
> memory context to store tuple data and we just reset when we're done
> if the sort fits in memory, or we dump the tuples to disk in the same
> order they're added and reset the context when it does not.  There is
> a little pfree() work going on via writetup_heap() which I think
> possibly could just be removed to get some additional gains.
> 
> Anyway, this context that stores tuples uses the standard aset.c
> allocator which has the usual power of 2 wastage and additional
> overheads of freelists etc.  I wondered how much faster it would go if
> I set it to use a generation context instead of an aset.c one.
> 
> After running make installcheck to make the tenk1 table, running the
> attached tuplesortbench script, I get this:

> So it seems to save quite a bit of memory getting away from rounding
> up allocations to the next power of 2.
> 
> Performance-wise, there's some pretty good gains. (Results in TPS)

Very nice!


I wonder if there's cases where generation.c would regress performance
over aset.c due to not having an initial / "keeper" block?


> The patch is just a simple 1-liner at the moment.  I likely do need to
> adjust what I'm passing as the blockSize to GenerationContextCreate().
>   Maybe a better number would be something that's calculated from
> work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64))
> so that we just allocate at most a 16th of work_mem per chunk, but not
> bigger than 8MB. I don't think changing this will affect the
> performance of the above very much.

I think it's bad that both genereration and slab don't have internal
handling of block sizes. Needing to err on the size of too big blocks to
handle large amounts of memory well, just so the contexts don't need to
deal with variably sized blocks isn't a sensible tradeoff.

I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or
Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for
tuplesort.c. There's plenty cases where we'll just sort a handful of
tuples, and increasing the memory usage of those by a factor of 1024
isn't good. The Min() won't do any good if a larger work_mem is used.

Nor will it be good to use thousands of small allocations for a large
in-memory tuplesort just because we're concerned about the initial
allocation size. Both because of the allocation overhead, but
importantly also because that will make context resets more expensive.

To me this says that we should transplant aset.c's block size growing
into generation.c.


There is at least one path using tuplecontext that reaches code that
could end up freeing memory to a significant enough degree to care about
generation.c effectively not using that memory:
tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
On a quick look I didn't find any expanded record user that frees
nontrivial amounts of memory, but I didn't look all that carefully.


Greetings,

Andres Freund




Re: Use generation context to speed up tuplesorts

2021-07-30 Thread Robert Haas
On Fri, Jul 30, 2021 at 2:42 AM David Rowley  wrote:
> Master:
>Sort Method: quicksort  Memory: 5541kB
> Patched:
>Sort Method: quicksort  Memory: 3197kB

Whoa.

> work_mem = '4GB';
> Testmastergen sortcompare
> Test1317.2665.6210%
> Test2228.6388.9170%
> Test3207.4330.7159%
> Test4185.5279.4151%
> Test5292.2563.9193%

Very impressive.

An early version of what eventually became DSA worked with
backend-local memory and I saw very substantial memory usage
improvements on large sorts, similar to what you show here. I am not
sure I saw the same CPU improvements, and in any case I abandoned the
idea of using that infrastructure to manage backend-local memory at
some point, since the whole thing had lots of problems that I didn't
know how to solve. What you've done here looks like a much more
promising approach.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Use generation context to speed up tuplesorts

2021-07-30 Thread David Rowley
Hackers,

While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum
sorts for single column sorts), I noticed that we use a separate
memory context to store tuple data and we just reset when we're done
if the sort fits in memory, or we dump the tuples to disk in the same
order they're added and reset the context when it does not.  There is
a little pfree() work going on via writetup_heap() which I think
possibly could just be removed to get some additional gains.

Anyway, this context that stores tuples uses the standard aset.c
allocator which has the usual power of 2 wastage and additional
overheads of freelists etc.  I wondered how much faster it would go if
I set it to use a generation context instead of an aset.c one.

After running make installcheck to make the tenk1 table, running the
attached tuplesortbench script, I get this:

Master:
work_mem = '4MB';
   Sort Method: external merge  Disk: 2496kB
work_mem = '4GB';
   Sort Method: quicksort  Memory: 5541kB

Patched:
work_mem = '4MB';
   Sort Method: quicksort  Memory: 3197kB

So it seems to save quite a bit of memory getting away from rounding
up allocations to the next power of 2.

Performance-wise, there's some pretty good gains. (Results in TPS)

work_mem = '4GB';
Testmastergen sortcompare
Test1317.2665.6210%
Test2228.6388.9170%
Test3207.4330.7159%
Test4185.5279.4151%
Test5292.2563.9193%

If I drop the work_mem down to standard the unpatched version does to
disk, but the patched version does not.  The gains get a little
bigger.

work_mem = '4MB';
Testmastergen sortcompare
Test1177.5658.2371%
Test2149.7385.2257%
Test3137.5330.0240%
Test4129.0275.1213%
Test5161.7546.4338%

The patch is just a simple 1-liner at the moment.  I likely do need to
adjust what I'm passing as the blockSize to GenerationContextCreate().
  Maybe a better number would be something that's calculated from
work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64))
so that we just allocate at most a 16th of work_mem per chunk, but not
bigger than 8MB. I don't think changing this will affect the
performance of the above very much.

David
#!/bin/bash

sec=10

# This should improve
echo "select * from tenk1 order by two offset 100" > bench.sql
echo Test1
pgbench -n -f bench.sql -T $sec regression | grep tps
pgbench -n -f bench.sql -T $sec regression | grep tps
pgbench -n -f bench.sql -T $sec regression | grep tps

# This should improve
echo "select * from tenk1 order by tenthous offset 100" > bench.sql
echo Test2
pgbench -n -f bench.sql -T $sec regression | grep tps
pgbench -n -f bench.sql -T $sec regression | grep tps
pgbench -n -f bench.sql -T $sec regression | grep tps

# This should improve
echo "select * from tenk1 order by stringu1 offset 100" > bench.sql
echo Test3
pgbench -n -f bench.sql -T $sec regression | grep tps
pgbench -n -f bench.sql -T $sec regression | grep tps
pgbench -n -f bench.sql -T $sec regression | grep tps

# This should improve
echo "select * from tenk1 order by stringu2 offset 100" > bench.sql
echo Test4
pgbench -n -f bench.sql -T $sec regression | grep tps
pgbench -n -f bench.sql -T $sec regression | grep tps
pgbench -n -f bench.sql -T $sec regression | grep tps

# This should improve
echo "select * from tenk1 order by twenty offset 100" > bench.sql
echo Test5
pgbench -n -f bench.sql -T $sec regression | grep tps
pgbench -n -f bench.sql -T $sec regression | grep tps
pgbench -n -f bench.sql -T $sec regression | grep tps




generation_tuplesort.patch
Description: Binary data