Hello hackers,

A colleague of mine, Jim Mlodgenski, has been poking at NUMA behavior on some 
of the newer AWS bare-metal instance types (r8i in particular, which exposes 6 
NUMA nodes via SNC3 on a 2-socket box), and in the process landed on a very 
small change to freelist.c that I think is worth showing around.  His patch is 
attached with some tweaks of my own.

Full disclosure: the exploration that led Jim to this patch idea was done with 
help from an AI assistant (Kiro); the idea, the benchmarking, and the final 
shape of the patch are human-driven, but I wanted to be up front about how his 
investigation started.  Happy to discuss that separately if people want to.

The one-line summary: instead of advancing nextVictimBuffer one buffer at a 
time via pg_atomic_fetch_add_u32, each backend claims a batch of 64 consecutive 
buffer IDs from the shared hand and then iterates them privately.  Global sweep 
order is preserved -- every buffer is still
visited exactly once per complete pass -- but the atomic contention on that one 
cache line drops by roughly the batch size.


Why this matters
----------------

On multi-socket boxes under eviction pressure, every backend that needs a 
victim buffer ends up CAS'ing the same cache line.  On a single socket, a 
locked RMW on that cache line stays warm in L1/L2 and completes in ~20ns.  On 
2+ sockets, the line bounces over QPI/UPI at ~100-200ns per op, and with 
hundreds of backends running StrategyGetBuffer() concurrently, the line 
ping-pongs constantly.  It's a textbook NUMA scalability bottleneck, and once 
shared_buffers is smaller than the working set and the sweep is running 
continuously, that single atomic is what you hit in a perf profile (elevated 
bus-cycles, cache-misses on the cache line holding nextVictimBuffer).

Andres pointed at the same spot in his pgconf.eu 2024 talk, and Tomas called it 
out in the "Adding basic NUMA awareness" thread [1] -- so this isn't news to 
anyone who's been looking at this area.  What I think is new is a fix that's 
just this, without any of the surrounding architectural change.

The framing (credit to Jim): the clock hand is doing two jobs.  It 
*coordinates* backends so they don't redundantly decrement usage_count on the 
same buffers and so they eventually visit every buffer in the pool exactly once 
per pass.  It also *serializes* access to the counter.  Coordination is the 
part we want.  Serialization is the part that's killing us on bigger NUMA 
boxes.  Batching keeps the coordination and thins out the serialization.


How it works
------------

Two per-backend statics, MyBatchPos and MyBatchEnd.  When a backend calls 
ClockSweepTick() and its local batch is exhausted, it does a single fetch-add 
of CLOCK_SWEEP_BATCH_SIZE (64) against nextVictimBuffer and now owns that 
range.  Subsequent ticks just bump the local counter.

Wraparound got a small rewrite.  The original code had the backend that crossed 
NBuffers drive completePasses++ under the spinlock via a CAS loop.  With 
batching, multiple backends can each land a fetch-add that returns a value >= 
NBuffers in the same pass, so the logic now is: whoever sees a start >= 
NBuffers takes the spinlock, re-reads the counter, and if it's still out of 
range does a single CAS to wrap it and bumps completePasses.  If somebody else 
already wrapped, we just release and move on.  StrategySyncStart() still sees a 
consistent (nextVictimBuffer, completePasses) pair.

The batch size is gated on whether we actually have multiple NUMA nodes.  On a 
single-socket box the atomic is already socket-local, batching just makes 
backends skip further ahead than they need to, so we fall back to batch size 1 
-- which is bit-for-bit the original behavior.  The guard:

    if (pg_numa_init() != -1 && pg_numa_get_max_node() >= 1)
        ClockSweepBatchSize = Min(CLOCK_SWEEP_BATCH_SIZE, (uint32) NBuffers);
    else
        ClockSweepBatchSize = 1;

Min() against NBuffers covers the small-shared_buffers corner so a batch never 
wraps the pool multiple times in one claim.


Does batching mess up the meaning of usage_count?
--------------------------------------------------

Short answer: no.  I want to walk through this because it was my first concern 
too, and I think it's the question that will come up most on review.

The clock sweep's usage_count is an access-frequency approximation measured in 
units of *complete passes*.  A buffer with usage_count = N survives N passes 
without a re-pin.  The semantic meaning lives at pass granularity, not at 
individual-buffer granularity.

What batching changes: intra-pass temporal ordering.  Without batching, with N 
backends sweeping, decrements are interleaved -- backend A hits B[0], backend B 
hits B[1], backend C hits B[2].  With batching, backend A hits B[0..63] in a 
tight local burst, then backend B hits B[64..127], etc.  The 64-buffer chunks 
are decremented in bursts rather than individually.

Why it doesn't matter:

  1. Every buffer still gets decremented exactly once per complete
     pass.  The invariant the algorithm actually depends on is
     untouched.

  2. A buffer's survival window is the time between consecutive
     passes.  That's milliseconds to seconds under load.  Whether
     B[0] gets decremented 50us before or 50us after B[63] within
     the same pass is below the resolution of anything usage_count
     is trying to measure.

  3. The bgwriter's feedback loop reads (nextVictimBuffer,
     completePasses, numBufferAllocs) via StrategySyncStart() every
     ~200ms.  nextVictimBuffer still advances at the same *total*
     rate (64 per atomic op, but atomic ops happen 1/64 as often).
     The position it reports can jitter by up to 64 buffers relative
     to the one-at-a-time case, but BgBufferSync()'s smoothed
     estimates operate over thousands of buffers per cycle, so the
     jitter disappears into the averaging.  numBufferAllocs still
     increments once per allocation.  strategy_delta,
     smoothed_alloc, smoothed_density, reusable_buffers_est -- all
     unaffected in any way I can see.

Table form, because it's easier to argue with:

  Property                          | Unpatched      | Batched
  ----------------------------------+----------------+----------------
  Buffers visited per pass          | NBuffers       | NBuffers
  Decrements per buffer per pass    | 1              | 1
  Eviction threshold                | usage_count==0 | usage_count==0
  Max survival (passes)             | 6              | 6
  Decrement ordering within a pass  | interleaved    | chunked
  bgwriter allocation rate signal   | accurate       | accurate
  Cross-socket atomic traffic       | 1 per buffer   | 1 per 64

There is one subtle difference worth naming.  When a backend finds a victim at 
B[5] of its batch, it returns with MyBatchEnd still sitting at B[63].  The next 
time that backend needs a victim it resumes at B[6], not at wherever the global 
hand now points.  So the backend drains its batch over multiple 
StrategyGetBuffer() calls rather than all at once.  Under heavy load, where 
batches are consumed in microseconds, this is invisible.  Under light load, the 
implication is that some buffers can sit with slightly stale usage_count for 
longer than they would have before.  But "light load" means "the sweep is 
barely moving and nothing wants to evict anyway" -- so the effect
doesn't show up where it would hurt.

There's also a small positive side-effect: cache locality.  The backend that 
just touched BufferDescriptor[B[0]] has the adjacent descriptors warm in L1/L2. 
 Walking B[0..63] locally is cheaper than walking a striped interleaving where 
each descriptor was last touched by a different core.  I haven't tried to 
isolate this in perf, but it falls out naturally.


Benchmarks
----------

Jim ran these; I'm still working on reproducing them locally and will post 
independent numbers in a follow-up.  All bare metal, Linux, huge pages enabled 
throughout (more on that below), postmaster pinned to node 0 with `numactl 
--cpunodebind=0` because otherwise stock TPS varied from 31K to 40K depending 
on which node the postmaster happened to land on at launch -- worth flagging 
for anyone trying to reproduce.

Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the working 
set always spills and the sweep is hot.

  r8i.metal-96xl (384 vCPUs, 2 sockets, 6 NUMA nodes via SNC3):

    pgbench RO:
      Clients   Stock    Patched   Delta
      64        31,457   36,353    +16%
      128       31,678   37,864    +20%
      256       31,510   37,558    +19%
      384       31,431   37,464    +19%
      512       31,329   37,040    +18%

    pgbench RW:
      Clients   Stock    Patched   Delta
      64         7,685    7,713     0%
      128       10,420   10,541    +1%
      256       12,393   12,463    +1%
      384       15,317   15,197    -1%
      512       17,930   17,978     0%

  m6i.metal (128 vCPUs, 2 sockets, Ice Lake):
    RO +19-20%, RW within noise.

  c8i.metal-48xl (192 vCPUs, 1 socket):
    Single-socket -> batch_size=1 -> original code path.  No
    behavioral change.  (I double-checked this one specifically
    because it's the sanity test for the gate.)

  HammerDB TPC-C on m6i.metal (1000 warehouses):
    VUs   Stock     Patched   Delta
    128   358,518   349,787    -2%
    256   332,098   330,272    -1%
    384   365,782   377,519    +3%
    512   370,663   386,526    +4%

No TPC-C regression, which was the thing we were most worried about. An earlier 
attempt (per-socket partitioned sweep, see below) was -13% on this same 
workload.

The general shape is: the scaling curve flattens later.  Unpatched, TPS tops 
out around 128 clients and stays flat up to 512 because backends are spending 
cycles waiting on the cache line rather than
doing work.  Patched, the curve keeps rising past the point where unpatched 
plateaus.

Huge pages caveat: all of the above was run with huge pages on, on large-memory 
instances (the r8i.96xl has 3TB, so Jim never considered running without them). 
 We have not characterized the non-huge-pages case.  That's on my list; I don't 
expect it to change the conclusion, but I shouldn't speak for data I haven't 
collected.


Relationship to Tomas's NUMA series
-----------------------------------

Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer 
interleaving across nodes, partitioned freelists, partitioned clock sweep, 
PGPROC interleaving, and related pieces.  I want to be careful here because I 
don't think we should frame this patch as competing with that work.

One thing I found striking as I re-read the thread: in the benchmarks Tomas 
posted later in the series, *most of the benefit comes from partitioning the 
clock sweep*, and the NUMA memory-placement layer on top sometimes runs slower 
than partitioning alone.  His own conclusion, quoted roughly: the benefit 
mostly comes from just partitioning the clock sweep, and it's largely 
independent of the NUMA stuff; the NUMA partitioning is often slower.

That observation is the thing that makes me think batching is worth considering 
on its own.  It's going after the same bottleneck Tomas's partitioning 
addresses, but:

  - without splitting global eviction visibility (which is where
    cross-partition stealing gets complicated),
  - without requiring NUMA-aware buffer placement (which has huge
    page alignment, descriptor-partition-mid-page, and resize
    complications that are still being worked out in that thread),
  - without touching PGPROC or bgwriter.

What this patch does *not* do:
  - place buffers on specific NUMA nodes
  - partition the freelist
  - touch PGPROC
  - add new GUCs
  - change bgwriter

What this patch *does* do:
  - target exactly the clock-sweep contention that Tomas's
    partitioning targets, and reduce it by ~64x, in ~30 lines.

If Tomas's series lands in full, this patch becomes redundant for its primary 
use case (though even within a partitioned sweep, the per-partition atomic 
still benefits from batching, so it's arguably a useful primitive either way).  
If Tomas's series lands incrementally over several cycles -- which the open 
items in that thread suggest is the realistic path -- this gets us a real chunk 
of the multi-socket win now.

This patch is also orthogonal to my earlier thread about removing the freelist 
entirely [2], but given the proximity to that code Jim agreed that I could 
propose/steward it here on the list for consideration.


Open questions / things I'd like feedback on
--------------------------------------------

- Batch size.  64 is a round number that worked well in testing, but
  Nathan raised the reasonable point that on small shared_buffers
  with high concurrency, a fixed 64 could be unfortunate.  Options:
  scale with shared_buffers (Min(64, NBuffers / N) for some N), scale
  with max_connections, keep it fixed but let operators tune it, or
  make it a function of NUMA node count.  I don't have a strong
  opinion yet; the Min(batch, NBuffers) cap covers the "obviously
  wrong" corner but doesn't speak to the "several hundred backends
  on a few-MB shared_buffers" shape.  Numbers/ideas/proposals welcome.

- NUMA detection.  The gate uses pg_numa_init() /
  pg_numa_get_max_node().  On systems where libnuma isn't available,
  or where get_mempolicy is blocked (some container configurations),
  we fall back to batch size 1.  That's safe but it misses the
  "single socket, many cores, still benefits from fewer atomics"
  case.  Might be worth a way to force-enable, or batching on all
  systems with a smaller batch size when single-socket.  I'd like to
  measure before deciding.

- Eviction pattern on reads.  Nathan also flagged that with batching,
  the buffers a backend ends up pinning in one StrategyGetBuffer()
  call will tend to be contiguous in buffer-id space rather than
  scattered, which is a different allocation pattern than today.
  The usage_count analysis above says this is benign, but if anyone
  has an intuition for a workload where this would be observable
  (e.g., something that cares about the mapping between buffer-id
  and relation locality), I'd like to hear it.

- nextVictimBuffer wraparound.  The current code has a mild overflow
  concern papered over with "highly unlikely and wouldn't be
  particularly harmful".  With batching this is no worse than before,
  but if we're already touching this function, it might be worth
  thinking about whether to tighten it up in the same patch or a
  follow-up.

- Should the non-NUMA value for this be derived from core counts that
  imply L1/L2 cache layouts or simply default to 8 rather than 1 to
  realize some benefit?

- Should there be a postgresql.conf setting for this that takes
  precedence?


I'll run the non-huge-pages variant, reproduce the r8i numbers, poke at the 
small-shared_buffers corner, and post perf stat output showing the 
atomic/cache-miss deltas over the next few days.  In the meantime, eyeballs and 
skepticism welcome -- I would especially welcome comments from Andres, who's 
been in this code recently, and from Tomas, whose series has the most overlap.

I realize that we're past feature freeze and working on release notes for v19, 
so the chances of merging this are slim to none.  I think this could be 
considered a "performance bug fix for NUMA systems" in this release, but that 
is stretching it a bit.  It is a big ask at this stage to land a change like 
this.

best.

-greg

[1] 
https://www.postgresql.org/message-id/[email protected]
[2] 
https://www.postgresql.org/message-id/[email protected]

Attachment: v1-0001-Reduce-clock-sweep-atomic-contention-by-claiming-.patch
Description: Binary data

Reply via email to