Re: PG15 beta1 sort performance regression due to Generation context change

David Rowley Tue, 24 May 2022 20:09:47 -0700

On Wed, 25 May 2022 at 04:02, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Here's a draft patch for the other way of doing it.  I'd first tried
> to make the side-effects completely local to generation.c, but that
> fails in view of code like
>
>         MemoryContextAlloc(GetMemoryChunkContext(x), ...)
>
> Thus we pretty much have to have some explicit awareness of this scheme
> in GetMemoryChunkContext().  There's more than one way it could be
> done, but I thought a clean way is to invent a separate NodeTag type
> to identify the indirection case.


Thanks for coding that up. This seems like a much better idea than mine.

I ran the same benchmark as I did in the blog on your patch and got
the attached sort_bench.png.  You can see the patch fixes the 64MB
work_mem performance regression, as we'd expect.

To get an idea of the overhead of this I came up with the attached
allocate_performance_function.patch which basically just adds a
function named pg_allocate_generation_memory() which you can pass a
chunk_size for the number of bytes to allocate at once, and also a
keep_memory to tell the function how much memory to keep around before
starting to pfree previous allocations. The final parameter defines
the total amount of memory to allocate.

The attached script calls the function with varying numbers of chunk
sizes from 8 up to 2048, multiplying by 2 each step. It keeps 1MB of
memory and does a total of 1GB of allocations.

I ran the script against today's master and master + the
invent-MemoryContextLink-1.patch and got the attached tps_results.txt.

The worst-case is the 8-byte allocation size where performance drops
around 7%. For the larger chunk sizes, the drop is much less, mostly
just around <1% to ~6%. For the 2048 byte size chunks, the performance
seems to improve (?).  Obviously, the test is pretty demanding as far
as palloc and pfree go. I imagine we don't come close to anything like
that in the actual code. This test was just aimed to give us an idea
of the overhead. It might not be enough information to judge if we
should be concerned about more realistic palloc/pfree workloads.

I didn't test the performance of an aset.c context. I imagine it's
likely to be less overhead due to aset.c being generally slower from
having to jump through a few more hoops during palloc/pfree.

David

diff --git a/src/backend/utils/adt/mcxtfuncs.c 
b/src/backend/utils/adt/mcxtfuncs.c
index bb7cc94024..d11495b716 100644
--- a/src/backend/utils/adt/mcxtfuncs.c
+++ b/src/backend/utils/adt/mcxtfuncs.c
@@ -193,3 +193,122 @@ pg_log_backend_memory_contexts(PG_FUNCTION_ARGS)
 
        PG_RETURN_BOOL(true);
 }
+
+typedef struct AllocateTestNext
+{
+       struct AllocateTestNext *next;          /* ptr to the next allocation */
+} AllocateTestNext;
+
+/* #define ALLOCATE_TEST_DEBUG */
+/*
+ * pg_allocate_generation_memory
+ *             Used to test the performance of a generation memory context
+ */
+Datum
+pg_allocate_generation_memory(PG_FUNCTION_ARGS)
+{
+       int32   chunk_size = PG_GETARG_INT32(0);
+       int64   keep_memory = PG_GETARG_INT64(1);
+       int64   total_alloc = PG_GETARG_INT64(2);
+       int64   curr_memory_use = 0;
+       int64   remaining_alloc_bytes = total_alloc;
+       MemoryContext context;
+       MemoryContext oldContext;
+       AllocateTestNext           *next_free_ptr = NULL;
+       AllocateTestNext           *last_alloc = NULL;
+
+       if (chunk_size < sizeof(AllocateTestNext))
+               elog(ERROR, "chunk_size (%d) must be at least %ld bytes", 
chunk_size,
+                        sizeof(AllocateTestNext));
+       if (keep_memory > total_alloc)
+               elog(ERROR, "keep_memory (" INT64_FORMAT ") must be less than 
total_alloc (" INT64_FORMAT ")",
+                        keep_memory, total_alloc);
+
+       context = GenerationContextCreate(CurrentMemoryContext,
+                                                                         
"pg_allocate_generation_memory",
+                                                                         
ALLOCSET_DEFAULT_SIZES);
+
+       oldContext = MemoryContextSwitchTo(context);
+
+       while (remaining_alloc_bytes > 0)
+       {
+               AllocateTestNext *curr_alloc;
+
+               CHECK_FOR_INTERRUPTS();
+
+               /* Allocate the memory and update the counters */
+               curr_alloc = (AllocateTestNext *) palloc(chunk_size);
+               remaining_alloc_bytes -= chunk_size;
+               curr_memory_use += chunk_size;
+
+#ifdef ALLOCATE_TEST_DEBUG
+               elog(NOTICE, "alloc %p (curr_memory_use " INT64_FORMAT " bytes, 
remaining_alloc_bytes " INT64_FORMAT ")", curr_alloc, curr_memory_use, 
remaining_alloc_bytes);
+#endif
+
+               /*
+                * This might be the last allocation, so point this to NULL so 
know
+                * when to stop looping in the cleanup loop.
+                */
+               curr_alloc->next = NULL;
+
+               /*
+                * If the currently allocated memory has reached or exceeded 
the amount
+                * of memory we want to keep allocated at once then we'd better 
free
+                * some.  Since all allocations are the same size we only need 
to free
+                * one allocation per loop.
+                */
+               if (curr_memory_use >= keep_memory)
+               {
+                       AllocateTestNext         *next = next_free_ptr->next;
+
+                       /* free the memory and update the current memory usage 
*/
+                       pfree(next_free_ptr);
+                       curr_memory_use -= chunk_size;
+
+#ifdef ALLOCATE_TEST_DEBUG
+                       elog(NOTICE, "free %p (curr_memory_use " INT64_FORMAT " 
bytes, remaining_alloc_bytes " INT64_FORMAT ")", next_free_ptr, 
curr_memory_use, remaining_alloc_bytes);
+#endif
+                       /* get the next chunk to free */
+                       next_free_ptr = next;
+               }
+               else if (next_free_ptr == NULL)
+               {
+                       /*
+                        * Remember the first chunk to free. We will follow the 
->next
+                        * pointers to find the next chunk to free when freeing 
memory
+                        */
+                       next_free_ptr = curr_alloc;
+               }
+
+               /*
+                * Store a pointer to curr_alloc in the memory we allocated in 
the
+                * the last iteration.  This allows us to use the memory we're
+                * allocating to store a pointer to the next allocation.
+                */
+               if (last_alloc != NULL)
+                       last_alloc->next = curr_alloc;
+
+               /* remember the this allocation so we have it in the next loop 
*/
+               last_alloc = curr_alloc;
+       }
+
+       /* cleanup loop -- pfree remaining memory */
+       while (next_free_ptr != NULL)
+       {
+               AllocateTestNext         *next = next_free_ptr->next;
+
+               /* free the memory and update the current memory usage */
+               pfree(next_free_ptr);
+               curr_memory_use -= chunk_size;
+
+#ifdef ALLOCATE_TEST_DEBUG
+               elog(NOTICE, "free %p (curr_memory_use " INT64_FORMAT " bytes, 
remaining_alloc_bytes " INT64_FORMAT ")", next_free_ptr, curr_memory_use, 
remaining_alloc_bytes);
+#endif
+
+               next_free_ptr = next;
+       }
+
+       MemoryContextSwitchTo(oldContext);
+
+       PG_RETURN_VOID();
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 87aa571a33..12ed69408b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8093,6 +8093,12 @@
   prorettype => 'bool', proargtypes => 'int4',
   prosrc => 'pg_log_backend_memory_contexts' },
 
+# just for testing memory context allocation speed
+{ oid => '9319', descr => 'for testing performance of generation allocation 
and freeing',
+  proname => 'pg_allocate_generation_memory', provolatile => 'v',
+  prorettype => 'void', proargtypes => 'int4 int8 int8',
+  prosrc => 'pg_allocate_generation_memory' },
+
 # non-persistent series generator
 { oid => '1066', descr => 'non-persistent series generator',
   proname => 'generate_series', prorows => '1000',

#!/bin/bash

dbname=postgres
sec=20


for bytes in 8 16 32 64 128 256 512 1024 2048
do
	for sql in "select pg_allocate_generation_memory($bytes, 1024*1024, 1024*1024*1024);"
	do
		echo "$sql" > bench.sql
		pgbench -n -f bench.sql -T $sec -M prepared $dbname
	done
done

alloc size      MemoryContextLink (tps) master (tps)    compare
8       1.16201 1.24689 93.19%
16      2.43831 2.45643 99.26%
32      4.72368 4.87890 96.82%
64      8.90167 9.01659 98.73%
128     17.64886        16.97465        103.97%
256     31.71387        33.66108        94.22%
512     64.51953        65.44095        98.59%
1024    119.99962       123.36183       97.27%
2048    248.73356       214.43335       116.00%

Re: PG15 beta1 sort performance regression due to Generation context change

Reply via email to