Re: Addressing buffer private reference count scalability issue

Alexandre Felipe Wed, 11 Mar 2026 15:42:01 -0700

On Wed, Mar 11, 2026 at 6:55 PM Andres Freund <[email protected]> wrote:


> Nice numbers!
>
> It'd be good to also evaluate in the context of queries, as such a focused
> microbenchmark will have much higher L1/L2 cache hit ratios than workloads
> that actually look at data
>

I ran with the query you first used to illustrate the issue, and it was
slower.
Went back and tried a strict copy of the code and it was 3% slower.
passing CFLAGS += -falign-functions=64 this shows no degradation.
Maybe LTO is what we need here?


It's not surprising it's worse, I just don't see how we could get away
> without
> some mixing.


I will keep that for the future.



> It's not at all crazy to have accesss patterns that just differ
> in the higher bits of the buffer number (e.g. a nestloop join where the
> inner
> side always needs a certain number of buffers). We need to be somewhat
> resistant to that causing a lot of collisions (which will trigger a
> hashtable
> growth cycle, without that necessarily fixing the issue!).
>

Yes, I understand that possibility.


> > I brought back the array, but I eliminated the linear search.
>
> Why? In my benchmarks allowing vectorization helped a decent amount in real
> queries, because it does away with all the branch misses.


I basically treated the array as a hash table with a weak hash function and
delegated collisions to simplehash.
In the worse case would do a simple array[buffer & mask], never fails with
one branch, and would fail in ~3% of the cases with two branches.
And would eliminate the branches necessary to check the 1-entry cache.

But notice that this was the last patch, if you like everything except that
it is just a matter of picking 02, 03, 04.

> 1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache.
> >
> > 2a. the dynamic array case
> > REFCOUNT_ARRAY_MAX_SIZE!=REFCOUNT_ARRAY_INITIAL_SIZE
> > will grow the array when it reaches a certain level of occupation.
> > I have set the default occupation level to 86% so that, if enabled, for a
> > random input it will grow when we have about 2*size pins in total.
> > If we find a sequential pattern then it will grow without growing the
> hash
> > table.
> > For the array lookup I don't use a hash, so for small number of pins
> > it will be very fast.
>
> I doubt it makes sense to basically have two levels of hash tables.
>
>
> > 2b. the static case
> > REFCOUNT_ARRAY_MAX_SIZE==REFCOUNT_ARRAY_INITIAL_SIZE
> > will use a static array, just as we had before and will not perform the
> > linear
> > search. It still has to read the size and do mask input.
> >
> > I tested the 4 variations and the winner is with the static array without
> > the cache for the last entry.
> > I increased the array size from 8 to 32, since you suggested before that
> > that this could help. At that point it would have the tradeoff of a
> longer
> > linear search, so it may help even more now.
>
> Does your benchmark actually test repeated accesses to the same refcount?
> As
> mentioned, those are very very common (access sequences like Pin, (Lock,
> Unlock)+, Unpin are extremely common).


Yes that is the case a FIFO with one buffer, I didn't include the lock
unlock tests here.
I did it a some point and it

I don't think the new naming scheme (GetSharedBufferEntry etc) is good, as
> that does not *at all* communicate that this is backend private state.
>

I suspected that, GetEntryForPrivateReferenceCountOfSharedBuffer would be
more accurate right?
I probably will stick with the original names.


> I'd strongly advise separating moving code from a large scale rename.  I
> certainly won't waste time trying to see the difference between what was
> just
> moved and what was changed at the same time.
>

Would you prefer not moving the code at all? One of the main reasons
for this was the changes in data structure on the patch 04, that I will not
include in the next version.



> > diff --git a/.gitignore b/.gitignore
> > index 4e911395fe..fddb7f861d 100644
> > --- a/.gitignore
> > +++ b/.gitignore
> > @@ -43,3 +43,5 @@ lib*.pc
> >  /Release/
> >  /tmp_install/
> >  /portlock/
> > +
> > +.*
> > \ No newline at end of file
>
> What? Certainly not.
>

Do you mean, we should certainly not exclude hidden files from git?
I usually build with prefix postgresql/.build/patch-*/
Then whenever I checkout something I have to keep adding this again.

I don't think it's a good idea to introduce new simplehash infrastructure as
> part of this larger change.


Do you think it is worth doing that as a separate patch? Then we get it
out of the way on this that probably will go a few more versions?

You also haven't documented the new stuff.


Do you mean as source comments, or is there a separate documentation
for this?



> > The previous implementation used an 8-bytes (64-bit) entry to store
> > a uint32 count and an uint32 lock mode. That is fine when we store
> > the data separate from the key (buffer). But in the simplehash
> > {key, value} are stored together, so each entry is 12-bytes.
> > This is somewhat awkward as we have to either pad the entry to 16-bytes,
> > or the access will be an alternating aligned/misaligned addreses.
> >
> > Lock can assume only 4 values, and 2^30 is a decent limit for the
> > number of pins on a single buffer. So this change is packing the
> > {count[31:2], lock[1:0]} into a single uint32.
> >
> > Incrementing/decrementing the count continue the same, just using
> > 4 instead of 1, lock mode access will require one or two additional
> > bitwise operations. The exact count requires one shift, and is used
> > only for debugging. A special function is provided to check whether
> > count == 1.
>
> Have you actually evaluated the benefit from this? Pretty sceptical it's
> worth
> it.
>

I tested, and I agree, not worth it from a speed perspective.
At this point the only part left is the introduction of the simplehash.

However what I will try is to store just the buffer number in the hash
and keep another array for the entries, who knows that works better.

Re: Addressing buffer private reference count scalability issue

Reply via email to