Re: Adding basic NUMA awareness

Jakub Wartak Tue, 30 Jun 2026 05:51:58 -0700

On Mon, Jun 29, 2026 at 9:42 AM Jakub Wartak
<[email protected]> wrote:
>
> On Thu, Jun 25, 2026 at 3:49 PM Tomas Vondra <[email protected]> wrote:
> >
> > >> I have some results from a new round of benchmarks, and it's a bit
> > >> disappointing. Or rather, there seem to be some issues that I can't
> > >> figure out, causing regressions.
> > > [..]
> > >> This chart is for median latency (in milliseconds):
> > >>
> > >>   clients       master     0003      0004    0003/on    0004/on
> > >>   -------------------------------------------------------------
> > >>         1        12767    12582     14509      12807      15307
> > >>         8        14383    14355     14149      14069      16165
> > >>        32        14756    15198     14836      14984      17128
> > >>        --------------------------------------------------------
> > >>         1                  103%      114%       100%       120%
> > >>         8                  101%       98%        98%       112%
> > >>        32                  102%      101%       102%       116%
> > >>


[..lots of variables..]

> > I'll try, but if you could try running some experiments on your own,
> > that might be helpful.
> [..]
> > > Hopefully next week I'll try to repro those numbers to see if I can
> > > help more.
> > >
> >
> > Thank you! That'd be great.
>
> Yeah, I'll try my best, we'll see how it goes. Right now I've just dropped
> that fscachenuma proggie to aid us in troubleshooting.
>
> -J.
>
> [0] - https://github.com/jakubwartakEDB/fscachenuma

Hi Tomas,

OK, so I've run couple of tests and modified run.sh and also tried to fix
some inefficiencies spotted while testing this. Note the attached
performance matrix is in TPS (so more is better). Raw results/CSV and
scripts are attached too.

* run2 = 2 workloads, partitioned pgbench_accounts
* run3 = just pgbenchS w/o partitioning + warmup
* run4 = semi-like pgbenchS w/o partitioning but 100k rows + warmup

One important modification in those run shell scripts is that they
clean page-cache (drop_cached) as mentioned earlier to avoid false results
where everything would on node#N after pgbench -i ran. Probably I did
not get any regressions you've got, because of this. Or better diff -u
run*.sh scripts.

The "inst-optimized" is just the same patchset (so "inst-patchset") + crude
attempt in 0008 to make further smooth out things and avoid regressions while
I've been working on this. 0008 does couple of things:

a. implements CPU/node caching instead quering it every single buffer. Even
   if on x86_64 that is optimized by vdso/kernel to avoid the real syscall,
   the semi-syscall tax seems to be visible when fetching lots of buffers.
   128 is arbitrary and still kind of low (128*8kB=1MB, and we are doing
   hundreths of MB/s; while rescheduling happened only every couple of
   seconds).

b1. minimize the attempt to use other partittions till some threshold (
   and then it relies on the scan-all-partitions)

b2. avoids selecting idle partitions (defined as avg_allocs/2) - if there
   are low allocations there it is debatable if cache utilization is better
   or sticking to lower latency is better (e.g. in some workloads buffer
   reuse is close to 0, so lower latency is clearly better)

Results are attached, some observations:

0.There were vast differences in how pg_ctl is started (interleaved or not),
  so I've decided in the end to show relative to both situations.

1.In run2/seqconcurrscans I've saturated my interconnect and that's why
  it's giving 129-155% there. I don't have access physiscal hw, but I suspect
  that modern 2socket EPYC5 has like ~614GB/s per socket RAM bandwidth,
  but the max oneway bandwith of the interconnect is around ~220GB/s (
  no way to provie it), so *IF* with hundreths of cores we would be able
  fetch at this rate we could saturte modern hardware too that way (and
  we birefly touched related topic: batched executor, accelerating it
  so fast those effects could be more easily achieveable)

2.run3 has no partiitioning because according to perf and my eyes, it
  spent time not on the buffers itself (thus it was way heavier on CPU
  [partitioning] than on memory...), so that's how run3 was born without
  partitions :D

3.The warmup is critical for run3/pgbenchS, as I've noticed that depending
  on ${luck} if you start the "master" (baseline w/o interleaving) and pgbench
  it right away everything might land on node0 (s_b, pagecache), so "master"
  was basically cheating in benchmarks vs especially Your's patchset where
  it was spreading way too soon. Having drop_caches, additional warump and
  only then proper pgbench kind of reduces that luck-factor. In general I
  think all runs with c=1 seem to have kind of low singal-to-noise ratio. I
  was thinking about pinning to always stick to the same NUMA node from start
  to win against master just for this c=1 scenarios, but "meh".

3b. in short for pgbench -S we can gain like 2-5%

4.run4 was made just to prove that workload fetching more buffers, than
  the standard pgbench -S (1 row?), seems to be the key to prove
  optimizations in 0008 (other than showing good benefits for seqconcurrscans
  of course). So run4 just shows benefit compared to 0001-0007 alone.

Stil on the table:

1. maybe even better balancing is possible (?), but this one is seems enough?
   I'm out of other ideas, well other than the
   "shared-relation-use-by-foreign-node" idea described much earlier (but
   I won't be able to pull that off), so I'm not entering this rabbit hole
   any deeper.

2. Digging into io_method=worker optimizations (answering question: are they
   necessary?) Maybe I'll throw in run5 quite soon, this is going to be
   crucial to answer.

3. Potentially mentioned earlier BAS strategies (forcing just use of local
   partitions for known-to-be-only-local-users: CTAS/VACCUM/etc), but I'm
   afarid that's not for me as I would certainly break/violate some
   invisible to me boundary.

Maybe You could run those run*.sh with master vs inst-patchset/optimized?
(I'm not sure, maybe there's even different factor at play too...)

-J.

Title: Performance Evaluation Matrix

Table 1: Performance Relative to Master Default (100% Baseline)

Values within ±2% of the baseline are considered noise and are uncolored.

Benchmark	Clients	master default (Baseline)	master interleave	optimized (numa=off, bal=off)	optimized (numa=on, bal=off)	optimized (numa=on, bal=on)	patched (numa=off, bal=off)	patched (numa=on, bal=off)	patched (numa=on, bal=on)
pgbenchS	1	100.00%	99.09%	98.80%	100.30%	101.56%	101.91%	101.77%	101.28%
pgbenchS	8	100.00%	100.34%	99.64%	100.06%	99.80%	100.78%	100.87%	101.24%
pgbenchS	32	100.00%	100.16%	99.59%	99.40%	99.53%	100.20%	100.16%	100.06%
seqconcurrscans	1	100.00%	77.29%	100.68%	97.11%	102.78%	97.76%	95.93%	75.08%
seqconcurrscans	8	100.00%	71.27%	86.54%	112.56%	110.55%	99.79%	103.61%	99.50%
seqconcurrscans	32	100.00%	94.33%	100.38%	118.67%	122.60%	109.97%	108.72%	107.00%

Table 2: Performance Relative to Master Interleave (100% Baseline)

Values within ±2% of the baseline are considered noise and are uncolored.

Benchmark	Clients	master interleave (Baseline)	master default	optimized (numa=off, bal=off)	optimized (numa=on, bal=off)	optimized (numa=on, bal=on)	patched (numa=off, bal=off)	patched (numa=on, bal=off)	patched (numa=on, bal=on)
pgbenchS	1	100.00%	100.92%	99.71%	101.22%	102.49%	102.84%	102.70%	102.21%
pgbenchS	8	100.00%	99.66%	99.30%	99.72%	99.46%	100.43%	100.53%	100.90%
pgbenchS	32	100.00%	99.84%	99.44%	99.25%	99.37%	100.04%	100.00%	99.90%
seqconcurrscans	1	100.00%	129.39%	130.27%	125.65%	132.98%	126.49%	124.13%	97.14%
seqconcurrscans	8	100.00%	140.31%	121.42%	157.93%	155.12%	140.01%	145.37%	139.60%
seqconcurrscans	32	100.00%	106.02%	106.42%	125.81%	129.97%	116.58%	115.26%	113.44%

numabenchhackersreview-2026-06-30.tgz
Description: application/compressed-tar

Title: Performance Evaluation Matrix

Table 1: Performance Relative to Master Default (100% Baseline)

Values within ±2% of the baseline are considered noise and are uncolored.

Benchmark	Clients	master default (Baseline)	master interleave	optimized (numa=off, bal=off)	optimized (numa=on, bal=off)	optimized (numa=on, bal=on)	patched (numa=off, bal=off)	patched (numa=on, bal=off)	patched (numa=on, bal=on)
pgbenchS	1	100.00%	118.07%	115.29%	98.55%	102.76%	115.50%	98.88%	113.40%
pgbenchS	8	100.00%	102.12%	102.74%	104.22%	105.09%	104.36%	104.70%	103.59%
pgbenchS	32	100.00%	100.15%	98.98%	98.67%	99.27%	100.38%	100.32%	100.51%

Table 2: Performance Relative to Master Interleave (100% Baseline)

Values within ±2% of the baseline are considered noise and are uncolored.

Benchmark	Clients	master interleave (Baseline)	master default	optimized (numa=off, bal=off)	optimized (numa=on, bal=off)	optimized (numa=on, bal=on)	patched (numa=off, bal=off)	patched (numa=on, bal=off)	patched (numa=on, bal=on)
pgbenchS	1	100.00%	84.69%	97.64%	83.46%	87.03%	97.82%	83.74%	96.04%
pgbenchS	8	100.00%	97.93%	100.61%	102.06%	102.91%	102.20%	102.53%	101.45%
pgbenchS	32	100.00%	99.85%	98.84%	98.53%	99.12%	100.23%	100.17%	100.36%

From 8513d188ed5ed999e72fc3a58046bbc1ff9f5688 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Tue, 30 Jun 2026 14:22:02 +0200
Subject: [PATCH v20260630-0008] clock-sweep: cached CPU/NUMA node and more
 locality-aware balancing

Enhancements on top of 0001-0007, to have sligthly better NUMA locality
and perfromance.

1. Cache numa_node_of_cpu()/sched_getcpu() per backend in
   ClockSweepPartitionIndex(), refreshing every CLOCKSWEEP_CPU_NODE_REFRESH
   allocations rather than on every call (visible hot buffer path in perf)

2. CLOCKSWEEP_BALANCE_THRESHOLD - make it less likely to redirect on any
   surplus of allocations (so scatter buffers LESS onto remote nodes).
   With this, it redirects its allocations to other (remote?) partitions
   when the allocation exceeds the per-partition average allocation rate
   by this percentage factor .

3. Avoid redirects to "idle" partitions: a redirect partition target
   must have some traffic which is at least 2x our demand. This elimnates
   cold partitions, but we can still reach them using scan-all-partitions
   fallback.
---
 src/backend/storage/buffer/freelist.c | 85 +++++++++++++++++++++++----
 1 file changed, 74 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e677c71e0b3..d64c2c67eb6 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -55,6 +55,9 @@
  */
 #define CLOCKSWEEP_HISTORY_COEFF	0.5
 
+/* How often backend should re-fetch the CPU/node on which it is running on? */
+#define CLOCKSWEEP_CPU_NODE_REFRESH	128
+
 /*
  * GUCs controlling the NUMA-aware clock-sweep behavior.
  *
@@ -70,6 +73,7 @@
  * clocksweep_scan_all_partitions - when enabled, looking for a free buffer
  * scans all clock-sweep partitions (in a round-robin way), not just the
  * backend's "home" partition.
+ *
  */
 bool		clocksweep_balance = true;
 bool		clocksweep_balance_recalc = true;
@@ -368,13 +372,29 @@ ClockSweepPartitionIndex(void)
 #ifdef USE_LIBNUMA
 	if (shared_buffers_numa)
 	{
-		int		cpu;
+		/*
+		 * Cache the CPU/NUMA node, refreshing only every CLOCKSWEEP_CPU_NODE_REFRESH
+		 * allocations. It appears that sched_getcpu()/numa_node_of_cpu() are not free.
+		 * On some platforms it take price of full system call, or the rest (x86_64?)
+		 * is can be use VDSO optimization. The backend rarely migrates between NUMA
+		 * nodes, and the balance logic only needs to notice migration after some time,
+		 * so an occasional refresh is good enough.
+		 */
+		static int		cached_node = -1;
+		static uint32	refresh_counter = 0;
+
+		if (cached_node < 0 || (refresh_counter++ % CLOCKSWEEP_CPU_NODE_REFRESH) == 0)
+		{
+		  int cpu;
 
-		/* XXX do we need to check sched_getcpu is available, somehow? */
-		if ((cpu = sched_getcpu()) < 0)
+		  /* XXX do we need to check sched_getcpu is available, somehow? */
+		  if ((cpu = sched_getcpu()) < 0)
 			elog(ERROR, "sched_getcpu failed: %m");
 
-		node = numa_node_of_cpu(cpu);
+		  /* XXX/JW: use libnuma wrapper for this */
+		  cached_node = numa_node_of_cpu(cpu);
+		}
+		node = cached_node;
 	}
 #endif
 
@@ -768,7 +788,8 @@ StrategySyncBalance(void)
 
 	uint32	total_allocs = 0,	/* total number of allocations */
 			avg_allocs,			/* average allocations (per partition) */
-			delta_allocs = 0;	/* sum of allocs above average */
+			delta_allocs = 0,	/* sum of allocs above average */
+			redirect_cutoff;	/* redirect only above this many allocs */
 
 	if (!clocksweep_balance || !clocksweep_balance_recalc)
 		return;
@@ -852,6 +873,20 @@ StrategySyncBalance(void)
 		return;
 	}
 
+	/*
+	 * A partition only redirects allocations to other partitions when it
+	 * exceeds the average by more than some threshold percent.
+	 * Below this cutoff we keep allocations local, to preserve NUMA locality.
+	 *
+	 * TODO: maybe better value is possible. On 4s with 25 I've got good results,
+	 *       but with value of 50 I've got slight degradation. Maybe it should 
+	 *       be equal to 100/numa_nodes ?
+	 *
+	 */
+#define CLOCKSWEEP_CUTOFF_THRESHOLD 25
+	redirect_cutoff = avg_allocs +
+		(uint32) ((uint64) avg_allocs * CLOCKSWEEP_CUTOFF_THRESHOLD / 100);
+
 	/*
 	 * The actual rebalancing
 	 *
@@ -884,10 +919,15 @@ StrategySyncBalance(void)
 		/* reset the weights to start from scratch */
 		memset(balance, 0, sizeof(uint8) * MAX_BUFFER_PARTITIONS);
 
-		/* does this partition has fewer or more than avg_allocs? */
-		if (allocs[i] < avg_allocs)
+		/*
+		 * Does this partition exceed its fair share by more than the
+		 * threshold? If not, keep all allocations local - redirecting them
+		 * would push memory onto remote NUMA nodes for no real benefit when
+		 * the load is already close to balanced.
+		 */
+		if (allocs[i] <= redirect_cutoff)
 		{
-			/* fewer - don't redirect any allocations elsewhere */
+			/* near fair share (or below) - keep allocations local */
 			balance[i] = 100;
 		}
 		else
@@ -902,22 +942,45 @@ StrategySyncBalance(void)
 			/* fraction of the "total" delta */
 			double	delta_frac = (allocs[i] - avg_allocs) * 1.0 / delta_allocs;
 
-			/* keep just enough allocations to meet the target */
-			balance[i] = (100.0 * avg_allocs / allocs[i]);
+			/* how much we keep local; we hand out the rest below */
+			int		kept = 100;
 
 			/* redirect the extra allocations */
 			for (int j = 0; j < StrategyControl->num_partitions; j++)
 			{
 				/* How many allocations to receive from i-th partition? */
 				uint32	receive_allocs = delta_frac * (avg_allocs - allocs[j]);
+				int		w;
+
+				/* do not redirect to ourselves */
+				if (j == i)
+					continue;
 
 				/* ignore partitions that don't need additional allocations */
 				if (allocs[j] > avg_allocs)
 					continue;
 
+				/*
+				 * Only use other partitions that actually have demand of
+				 * their own (avoid idle). If we fail, there's always the
+				 * scan-all-partitions fallback.
+				 *
+				 * TODO:: just guessing,heuristics
+				 */
+				if (allocs[j] < (avg_allocs / 2))
+					continue;
+
 				/* fraction to redirect */
-				balance[j] = (100.0 * receive_allocs / allocs[i]) + 0.5;
+				w = (int) ((100.0 * receive_allocs / allocs[i]) + 0.5);
+				balance[j] = w;
+				kept -= w;
 			}
+
+			/* avoid negative balances */
+			if (kept > 0)
+				balance[i] = kept;
+			else
+				balance[i] = 1;
 		}
 
 		/* combine the old and new weights (hysteresis) */
-- 
2.43.0

Title: Performance Evaluation Matrix

Table 1: Performance Relative to Master Default (100% Baseline)

Values within ±2% of the baseline are considered noise and are uncolored.

Benchmark	Clients	master default (Baseline)	master interleave	optimized (numa=off, bal=off)	optimized (numa=on, bal=off)	optimized (numa=on, bal=on)	patched (numa=off, bal=off)	patched (numa=on, bal=off)	patched (numa=on, bal=on)
pgbenchS100krows	1	100.00%	91.38%	93.38%	98.57%	94.32%	81.72%	84.25%	76.65%
pgbenchS100krows	8	100.00%	100.98%	100.79%	100.75%	102.13%	87.25%	87.75%	87.32%
pgbenchS100krows	32	100.00%	100.79%	101.87%	103.10%	103.76%	90.77%	92.42%	92.02%

Table 2: Performance Relative to Master Interleave (100% Baseline)

Values within ±2% of the baseline are considered noise and are uncolored.

Benchmark	Clients	master interleave (Baseline)	master default	optimized (numa=off, bal=off)	optimized (numa=on, bal=off)	optimized (numa=on, bal=on)	patched (numa=off, bal=off)	patched (numa=on, bal=off)	patched (numa=on, bal=on)
pgbenchS100krows	1	100.00%	109.43%	102.18%	107.86%	103.22%	89.43%	92.19%	83.87%
pgbenchS100krows	8	100.00%	99.03%	99.80%	99.77%	101.14%	86.40%	86.90%	86.47%
pgbenchS100krows	32	100.00%	99.22%	101.07%	102.30%	102.95%	90.06%	91.69%	91.30%

Re: Adding basic NUMA awareness

Table 1: Performance Relative to Master Default (100% Baseline)

Table 2: Performance Relative to Master Interleave (100% Baseline)

Table 1: Performance Relative to Master Default (100% Baseline)

Table 2: Performance Relative to Master Interleave (100% Baseline)

Table 1: Performance Relative to Master Default (100% Baseline)

Table 2: Performance Relative to Master Interleave (100% Baseline)

Reply via email to