Re: Addressing buffer private reference count scalability issue

Andres Freund Wed, 11 Mar 2026 11:55:32 -0700

Hi,

On 2026-03-11 18:03:34 +0000, Alexandre Felipe wrote:
> Hi Hackers,
> 
> Please find attached version 2 of this patch and the results of the
> ReadBuffer/ReleaseBuffer benchmark
> 
> For those who prefer plain text here goes an excerpt
>                         baseline   patch
> num_buffers pattern
> 1           random        194.76  183.50
>             sequential    183.75  168.02
> 2           random        204.74  187.02
>             sequential    195.27  169.58
> 3           random        222.62  192.57
>             sequential    212.81  169.55
> 7           random        233.96  219.64
>             sequential    221.94  171.94
> 11          random        365.50  247.96
>             sequential    318.24  183.96
> 19          random        390.58  267.96
>             sequential    369.51  195.35
> 100         random        424.03  286.33
>             sequential    420.74  280.23
> 988         random        445.75  301.37
>             sequential    453.61  313.30
> 10000       random        587.13  418.91
>             sequential    497.92  394.10


Nice numbers!

It'd be good to also evaluate in the context of queries, as such a focused
microbenchmark will have much higher L1/L2 cache hit ratios than workloads
that actually look at data.



> > > Here I assume that the buffer buffer sequence is independent enough from
> > the
> > > array size, so I use the buffer as the hash key directly, omitting a hash
> > > function call.
> >
> > I doubt that that's good enough. The hash table code relies on the bits
> > being
> > well mixed, but you won't get that with buffer ids.
> >
> 
> Introduced murmurhash32, it is noticeably worse for sequential access
> (by buffer index), but has some mixing.

It's not surprising it's worse, I just don't see how we could get away without
some mixing.  It's not at all crazy to have accesss patterns that just differ
in the higher bits of the buffer number (e.g. a nestloop join where the inner
side always needs a certain number of buffers). We need to be somewhat
resistant to that causing a lot of collisions (which will trigger a hashtable
growth cycle, without that necessarily fixing the issue!).

> 
> > I'm rather sceptical that the overhead of needing to shift and mask is
> > worth
> > it. I suspect we'll also want to add more state for each entry (e.g. I
> > think
> > it may be worth getting rid of the io-in-progress resowner).
> >
> 
> io-in-progress resowner name suggests that it runs only for reads...
> so maybe it could be taken out of the way of buffer hits?

It's not used for hits.

But it's used for both reads and writes.


> 
> > >    REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array
> > >    lookup, it is worth trying to remove it completely and keep only the
> > hash.
> > >    For small values we would be trading a few branches for a buffer %
> > SIZE,
> > >    for the use case of prefetch where pin/unpin in a FIFO fashion, it
> > will
> > >    save an 8-entry array lookup, and some extra data moves.
> >
> > I doubt that that's ok, in the vast majority of access we will have 0-2
> > buffers pinned. And even when we have pinned more buffers, it's
> > *exceedingly*
> > common to access the same entry repeatedly (e.g. pin, lock, unlock, unpin),
> > adding a few cycles to each of those repeated accesses will quickly show
> > up.
> >
> 
> I brought back the array, but I eliminated the linear search.

Why? In my benchmarks allowing vectorization helped a decent amount in real
queries, because it does away with all the branch misses.


> 1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache.
> 
> 2a. the dynamic array case
> REFCOUNT_ARRAY_MAX_SIZE!=REFCOUNT_ARRAY_INITIAL_SIZE
> will grow the array when it reaches a certain level of occupation.
> I have set the default occupation level to 86% so that, if enabled, for a
> random input it will grow when we have about 2*size pins in total.
> If we find a sequential pattern then it will grow without growing the hash
> table.
> For the array lookup I don't use a hash, so for small number of pins
> it will be very fast.

I doubt it makes sense to basically have two levels of hash tables.


> 2b. the static case
> REFCOUNT_ARRAY_MAX_SIZE==REFCOUNT_ARRAY_INITIAL_SIZE
> will use a static array, just as we had before and will not perform the
> linear
> search. It still has to read the size and do mask input.
>
> I tested the 4 variations and the winner is with the static array without
> the cache for the last entry.
> I increased the array size from 8 to 32, since you suggested before that
> that this could help. At that point it would have the tradeoff of a longer
> linear search, so it may help even more now.

Does your benchmark actually test repeated accesses to the same refcount? As
mentioned, those are very very common (access sequences like Pin, (Lock,
Unlock)+, Unpin are extremely common).



> From 5de90fcd14e5d32c3165c3e7a278adaa44f4d9d1 Mon Sep 17 00:00:00 2001
> From: Alexandre Felipe <[email protected]>
> Date: Wed, 11 Mar 2026 12:05:50 +0000
> Subject: [PATCH 2/5] Refactoring reference counting
> 
> This moves the private reference count logic out of
> backend/storage/buffer/buffmgr.c that is still 8000+
> lines long. Preparing for the changes to come.


I don't think the new naming scheme (GetSharedBufferEntry etc) is good, as
that does not *at all* communicate that this is backend private state.

I'd strongly advise separating moving code from a large scale rename.  I
certainly won't waste time trying to see the difference between what was just
moved and what was changed at the same time.


> diff --git a/.gitignore b/.gitignore
> index 4e911395fe..fddb7f861d 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -43,3 +43,5 @@ lib*.pc
>  /Release/
>  /tmp_install/
>  /portlock/
> +
> +.*
> \ No newline at end of file

What? Certainly not.



> From bf146598cd94110da255c6fb45b4a5c58002a91d Mon Sep 17 00:00:00 2001
> From: Alexandre Felipe <[email protected]>
> Date: Wed, 11 Mar 2026 12:07:54 +0000
> Subject: [PATCH 3/5] Using simplehash
> 
> This patch replaces the HTAB implementation with a simplehash
> as suggested by Andres Freund [1]. The simplehash implements a templated
> open addressing hash, which have entry size information at compile time.
> 
> The access strategy of the simplehash is very close to the plain array
> that was seen to be very efficient compared to the HTAB implementation,
> with the additional advantage of using the key value to initialize the
> search closer to where the key actually is, instead of always scanning
> the entire array.
> 
> Adds one macro to the simplehash implementation enable overriding the
> empty position detection, to enable using buffer == InvalidBuffer
> instead of requiring a separate status field.
> 
> [1] 
> https://www.postgresql.org/message-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa%406oxsemms5mw4
> 
> ---
>  src/backend/storage/buffer/bufmgr_refcnt.c | 89 +++++++++++-----------
>  src/include/lib/simplehash.h               | 59 +++++++++-----
>  2 files changed, 84 insertions(+), 64 deletions(-)

I don't think it's a good idea to introduce new simplehash infrastructure as
part of this larger change.  You also haven't documented the new stuff.



> From 4b96a723bfc3a6fe78e05bc3ce454b0b160679d1 Mon Sep 17 00:00:00 2001
> From: Alexandre Felipe <[email protected]>
> Date: Wed, 11 Mar 2026 12:08:03 +0000
> Subject: [PATCH 4/5] Compact PrivateRefCountEntry
> 
> The previous implementation used an 8-bytes (64-bit) entry to store
> a uint32 count and an uint32 lock mode. That is fine when we store
> the data separate from the key (buffer). But in the simplehash
> {key, value} are stored together, so each entry is 12-bytes.
> This is somewhat awkward as we have to either pad the entry to 16-bytes,
> or the access will be an alternating aligned/misaligned addreses.
> 
> Lock can assume only 4 values, and 2^30 is a decent limit for the
> number of pins on a single buffer. So this change is packing the
> {count[31:2], lock[1:0]} into a single uint32.
> 
> Incrementing/decrementing the count continue the same, just using
> 4 instead of 1, lock mode access will require one or two additional
> bitwise operations. The exact count requires one shift, and is used
> only for debugging. A special function is provided to check whether
> count == 1.

Have you actually evaluated the benefit from this? Pretty sceptical it's worth
it.

Greetings,

Andres Freund

Re: Addressing buffer private reference count scalability issue

Reply via email to