Re: Use generation context to speed up tuplesorts

Ronan Dunklau Thu, 16 Dec 2021 02:57:40 -0800

Le mercredi 8 décembre 2021, 22:07:12 CET Tomas Vondra a écrit :
> On 12/8/21 16:51, Ronan Dunklau wrote:
> > Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
> >> And now comes the funny part - if I run it in the same backend as the
> >> 
> >> "full" benchmark, I get roughly the same results:
> >>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >>      
> >>      ------------+------------+---------------+----------+---------
> >>      
> >>            32768 |        512 |     806256640 |    37159 |   76669
> >> 
> >> but if I reconnect and run it in the new backend, I get this:
> >>       block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >>      
> >>      ------------+------------+---------------+----------+---------
> >>      
> >>            32768 |        512 |     806158336 |   233909 |  100785
> >>      
> >>      (1 row)
> >> 
> >> It does not matter if I wait a bit before running the query, if I run it
> >> repeatedly, etc. The machine is not doing anything else, the CPU is set
> >> to use "performance" governor, etc.
> > 
> > I've reproduced the behaviour you mention.
> > I also noticed asm_exc_page_fault showing up in the perf report in that
> > case.
> > 
> > Running an strace on it shows that in one case, we have a lot of brk
> > calls,
> > while when we run in the same process as the previous tests, we don't.
> > 
> > My suspicion is that the previous workload makes glibc malloc change it's
> > trim_threshold and possibly other dynamic options, which leads to
> > constantly moving the brk pointer in one case and not the other.
> > 
> > Running your fifo test with absurd malloc options shows that indeed that
> > might be the case (I needed to change several, because changing one
> > disable the dynamic adjustment for every single one of them, and malloc
> > would fall back to using mmap and freeing it on each iteration):
> > 
> > mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
> > mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
> > mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));
> > 
> > I get the following results for your self contained test. I ran the query
> > twice, in each case, seeing the importance of the first allocation and the
> > subsequent ones:
> > 
> > With default malloc options:
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> > 
> > ------------+------------+---------------+----------+---------
> > 
> >       32768 |        512 |     795836416 |   300156 |  207557
> >  
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> > 
> > ------------+------------+---------------+----------+---------
> > 
> >       32768 |        512 |     795836416 |   211942 |   77207
> > 
> > With the oversized values above:
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> > 
> > ------------+------------+---------------+----------+---------
> > 
> >       32768 |        512 |     795836416 |   219000 |   36223
> >  
> >  block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> > 
> > ------------+------------+---------------+----------+---------
> > 
> >       32768 |        512 |     795836416 |    75761 |   78082
> > 
> > (1 row)
> > 
> > I can't tell how representative your benchmark extension would be of real
> > life allocation / free patterns, but there is probably something we can
> > improve here.
> 
> Thanks for looking at this. I think those allocation / free patterns are
> fairly extreme, and there probably are no workloads doing exactly this.
> The idea is the actual workloads are likely some combination of these
> extreme cases.
> 
> > I'll try to see if I can understand more precisely what is happening.
> 
> Thanks, that'd be helpful. Maybe we can learn something about tuning
> malloc parameters to get significantly better performance.
>


Apologies for the long email, maybe what I will state here is obvious for 
others but I learnt a lot, so... 

I think I understood what the problem was in your generation tests: depending 
on the sequence of allocations we will raise a different maximum for 
mmap_threshold and trim_threshold. When an mmap block is freed, malloc will 
raise it's threshold as it consider memory freed to be better served by 
regular moving of the sbrk pointer, instead of using mmap to map memory.  This 
threshold is upped by multiplying it by two anytime we free a mmap'ed block.

At the same time, the trim_threshold is raised to double the mmap_threshold, 
considering  that this amount of memory should not be released to the OS 
because we have a good chance of reusing it. 

This can be demonstrated using the attached systemtap script, along with a 
patch adding new traces to generation_context for this purpose:
When running your query:

select block_size, chunk_size, x.* from
        (values (512)) AS a(chunk_size),
        (values (32768)) AS b(block_size),
     lateral generation_bench_fifo(1000000, block_size, chunk_size,
                                   2*chunk_size, 100, 10000, 5000) x;

We obtain the following trace for thresholds adjustments:

2167837: New thresholds: mmap: 135168 bytes, trim: 270336 bytes
2167837: New thresholds: mmap: 266240 bytes, trim: 532480 bytes
2167837: New thresholds: mmap: 528384 bytes, trim: 1056768 bytes
2167837: New thresholds: mmap: 1052672 bytes, trim: 2105344 bytes
2167837: New thresholds: mmap: 16003072 bytes, trim: 32006144 bytes

When running the full benchmark, we reach a higher threshold at some point:

2181301: New thresholds: mmap: 135168 bytes, trim: 270336 bytes
2181301: New thresholds: mmap: 266240 bytes, trim: 532480 bytes
2181301: New thresholds: mmap: 528384 bytes, trim: 1056768 bytes
2181301: New thresholds: mmap: 1052672 bytes, trim: 2105344 bytes
2181301: New thresholds: mmap: 16003072 bytes, trim: 32006144 bytes
2181301: New thresholds: mmap: 24002560 bytes, trim: 48005120 bytes

This is because at some point in the full benchmark, we allocate a block 
bigger than mmap threshold, which malloc serves by an mmap, then frees it 
which means we now also raise the trim_threshold. The subsequent allocations 
in the lone query end up between those thresholds, so are served by moving the 
sbrk pointer, then releasing the memory back to the OS, which turns out to be 
expensive too. 

One thing that I observed is that that effect of constantly moving the sbrk 
pointer still happens, but not as much in a real tuplesort workload. I haven't 
tested the reordering buffer for now, which is the other part of the code using 
generation context.

So, I decided to benchmark trying to control the malloc thresholds. This 
result is on my local laptop, with an I7 processor, for the benchmark 
initially proposed by David, with 4GB work_mem and default settings.

I benchmarked master, the original patch with fixed block size, and the one 
adjusting the size.

For each of them, I ran them using default malloc options (ie, adjusting the 
thresholds dynamically), and with setting MMAP_MMAP_THRESHOLD to 32MB (the 
maximum on my platform, 4 * 1024 * 1024 * sizeof(long)) and the corresponding 
MMAP_TRIM_THRESHOLD if malloc would have adjusted it by itself reaching this 
value (ie, 64MB). I did that by setting the corresponding env variables. 

The results is in the attached spreadsheet.

I will follow up with a benchmark of the test sorting a table with a width 
varying from 1 to 32 columns.

As of now, my conclusion is that for glibc malloc, the block size we use 
doesn't really matter, as long as we tell malloc to not release a certain 
amount of memory back to the system once it's been allocated. Setting the 
mmap_threshold to min(work_mem, DEFAULT_MMAP_THRESHOLD_MAX) and trim_threshold 
to two times that would IMO take us to where a long-lived backend would likely 
end anyway: as soon as we alloc min(work_mem, 32MB), we won't give it back to 
the system, and save us a huge amount a syscalls in common cases. Doing that 
will not change the allocation profiles for other platforms, and should be safe 
for those.

The only problem I see if we were to do that would be for allocations in 
excess of work_mem, which would no longer trigger a threshold raise since once 
we're in "non-dynamic" mode, glibc's malloc would keep our manually-set 
values. I guess this proposal could be refined to set it up dynamically 
ourselves when pfreeing blocks, just like malloc does.

What do you think of this analysis and idea ?


-- 
Ronan Dunklau

commit eb883bc52d6938d965870714e513ec00d18b05b0
Author: Ronan Dunklau <ronan.dunk...@aiven.io>
Date:   Fri Dec 10 15:51:54 2021 +0100

    Add tracing to generation context

diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c
index 2c98877953..b61a9c024d 100644
--- a/src/backend/utils/mmgr/generation.c
+++ b/src/backend/utils/mmgr/generation.c
@@ -41,6 +41,7 @@
 #include "lib/ilist.h"
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
+#include "pg_trace.h"
 
 
 #define Generation_BLOCKHDRSZ	MAXALIGN(sizeof(GenerationBlock))
@@ -278,7 +279,7 @@ GenerationContextCreate(MemoryContext parent,
 						&GenerationMethods,
 						parent,
 						name);
-
+	TRACE_POSTGRESQL_GENERATION_ALLOC_CREATE();
 	return (MemoryContext) set;
 }
 
@@ -313,8 +314,9 @@ GenerationReset(MemoryContext context)
 #ifdef CLOBBER_FREED_MEMORY
 		wipe_mem(block, block->blksize);
 #endif
-
+		TRACE_POSTGRESQL_GENERATION_ALLOC_FREE(block->blksize);
 		free(block);
+
 	}
 
 	set->block = NULL;
@@ -335,6 +337,7 @@ GenerationDelete(MemoryContext context)
 {
 	/* Reset to release all the GenerationBlocks */
 	GenerationReset(context);
+	TRACE_POSTGRESQL_GENERATION_ALLOC_DESTROY();
 	/* And free the context header */
 	free(context);
 }
@@ -458,6 +461,7 @@ GenerationAlloc(MemoryContext context, Size size)
 			blksize <<= 1;
 
 		block = (GenerationBlock *) malloc(blksize);
+		TRACE_POSTGRESQL_GENERATION_ALLOC_MALLOC(blksize);
 
 		if (block == NULL)
 			return NULL;
@@ -596,6 +600,7 @@ GenerationFree(MemoryContext context, void *pointer)
 	dlist_delete(&block->node);
 
 	context->mem_allocated -= block->blksize;
+	TRACE_POSTGRESQL_GENERATION_ALLOC_FREE(block->blksize);
 	free(block);
 }
 
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index b0c50a3c7f..0531ff03c8 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -91,4 +91,10 @@ provider postgresql {
 	probe wal__switch();
 	probe wal__buffer__write__dirty__start();
 	probe wal__buffer__write__dirty__done();
+
+  probe generation_alloc_malloc(Size);
+  probe generation_alloc_free(Size);
+
+  probe generation_alloc_create();
+  probe generation_alloc_destroy();
 };

#! /usr/bin/env stap

# http://developerblog.redhat.com/2015/01/06/malloc-systemtap-probes-an-example/


global sbrk, waits, arenalist, mmap_threshold = 131072, heaplist
global trim_threshold = 0
global context_creation_sbrk = 0
global cnt_sbrk_more = 0
global cnt_sbrk_less = 0
global palloced_mem = 0
global pfreed_mem = 0
global sum_sbrk_more = 0

# sbrk accounting

probe process("/lib*/libc.so.6").mark("memory_sbrk_more")
{
  sbrk += $arg2
  cnt_sbrk_more += 1
  sum_sbrk_more += $arg2
  if ($arg2 < 0)
  {
    printf("sbrk_more called with %ld", $arg2)
  }
}

probe process("/lib*/libc.so.6").mark("memory_sbrk_less")
{
  sbrk -= $arg2
  cnt_sbrk_less += 1
  sum_sbrk_less += $arg2
}



# threshold tracking

probe process("/lib*/libc.so.6").mark("memory_mallopt_free_dyn_thresholds")
{
  if ($arg1 != mmap_threshold || $arg2 != trim_threshold)
  {
  printf("%d: New thresholds: mmap: %ld bytes, trim: %ld bytes\n", tid(), $arg1,
         $arg2)
  }
  mmap_threshold = $arg1
  trim_threshold = $arg2
}

probe 
process("/home/ro/dev/postgresql-install/bin/postgres").mark("generation_alloc_malloc")
{
  palloced_mem += $arg1
}

probe 
process("/home/ro/dev/postgresql-install/bin/postgres").mark("generation_alloc_free")
{
  pfreed_mem += $arg1
}


probe 
process("/home/ro/dev/postgresql-install/bin/postgres").mark("generation_alloc_create")
{
  palloced_mem = 0;
  pfreed_mem = 0;
  cnt_sbrk_more = 0;
  cnt_sbrk_less = 0;
  sum_sbrk_more = 0;
  context_creation_sbrk = sbrk;
}

probe 
process("/home/ro/dev/postgresql-install/bin/postgres").mark("generation_alloc_destroy")
{
  printf("Palloced mem in generation: %ld\n", palloced_mem);
  printf("Pfreed mem in generation: %ld\n", pfreed_mem);
  printf("Sbrk moved: %ld\n", sbrk - context_creation_sbrk);
  printf("Sbrk calls: More: %d Less: %d\n", cnt_sbrk_more, cnt_sbrk_less);
  printf("Sbrk bytes more: %d\n", sum_sbrk_more);
}

# reporting

probe begin
{
  if (target() == 0) error ("please specify target process with -c / -x")
}

bench_results_5tests.ods
Description: application/vnd.oasis.opendocument.spreadsheet

Re: Use generation context to speed up tuplesorts

Reply via email to