On Thu, Jun 25, 2026 at 3:49 PM Tomas Vondra <[email protected]> wrote:
>
> >> I have some results from a new round of benchmarks, and it's a bit
> >> disappointing. Or rather, there seem to be some issues that I can't
> >> figure out, causing regressions.
> > [..]
> >> This chart is for median latency (in milliseconds):
> >>
> >>   clients       master     0003      0004    0003/on    0004/on
> >>   -------------------------------------------------------------
> >>         1        12767    12582     14509      12807      15307
> >>         8        14383    14355     14149      14069      16165
> >>        32        14756    15198     14836      14984      17128
> >>        --------------------------------------------------------
> >>         1                  103%      114%       100%       120%
> >>         8                  101%       98%        98%       112%
> >>        32                  102%      101%       102%       116%
> >>
> >
> > I haven't tried it yet, however I can spot some things:
> >
> > No crystal clear idea why, but in the script I can see that you have
> > io_method = io_uring and are not dropping_caches, so IMHO it is too complex
> > interaction at this stage.
> >
>
> By caches I assume you mean page cache? The test is meant so simulate a
> cached system, copying data between shared buffers and page cache. My
> expectation is that once we start hitting I/O, it'll completely hide
> most differences due to NUMA.

No, it wont completley hide it, those differences at least here still matter
(AFAIR right now like +/- 10% here)

> > One hint: such setup is going to be problematic for proving numbers.
> > On the meeting I've tried to describe that I've been using io_method = sync
> > instead of 'worker' to get more predicitable results (together with echo 3
> >> drop_caches), because then it is that backend's CPU/$NODE doing that
> > pread()/pwrite() -- or any other operating performing the load --
> > it is going to put that file onto that_specific_$NODE --
> > so even if you have sequence like:
> >     pgbench -i
> >     pg_ctl restart
> >     pgbench -c XX
> >
>
> Hmm, I missed that point during the meeting. I wonder if "worker" is
> interacting with the NUMA somehow (I mean, does it load it into the
> right node?). But I'm using io_uring, and it's not clear to me why sync
> would be better for benchmarking?
>
> Ultimately, we need to make sure it works well with io_uring anyway,
> right? Even if "sync" happens to be better for benchmarking (or even for
> NUMA stuff), we have to make it work with worker/io_uring. Because
> that's what practical systems use.

Yes, we need to make work with more advanced, but I don't think we are there
yet (we'll need some more patches in orde rto demonstrate it reliably).

> > then pgbench -i even with shared_buffers_numa=on will spread into many
> > nodes the Buffers, yet after the restart the VFS cache portion of the data
> > will still reside on single specific $NODE that wrote it to the filesystem
> > (due to local-first-tocuh-affinity even for VFS cache),
> > [.. blabla , use io_method=sync ]
> >
>
> Ah, you're suggesting the page cache stuff will be placed on a single
> NUMA node? That may be true, it's a good point. And maybe it could skew
> the results in a bad way.

I've just published [0], see for yourself:

This happens especiall after pgbench -i, so:
     pgbench -i # pagecache placement on one NUMA node
     pg_ctl restart
     pgbench -c XX

is day and night different than let's say:
     pgbench -i
     echo 3 > drop_caches
     pg_ctl restart
     pgbench -c XX # pagecache placement happens by many backends
                   # potentially many NUMA nodes

> Still, that would be the case even without the NUMA partitioning, no?

Right, in my experience we should not benchmark against master started
with the default pg_ctl (that's is without numactl --interleave=all) because
it is confusing to reason about it due how the s_b could laid out without
that interleaving. I mean later we can switch to that default, but IMHO not
yet.

> > Maybe some other suggestions:
> >
> > Q1) Maybe some crosschecks first?
> >        # balance should be equal between nodes even for baseline
> >        # linux kernel has tendency to fit shm into one if it fits
> >        find /sys/devices/system/node*/ -name 'free_hugepages' -exec
> > grep -H . {} \;
> >
> >        # check N0 and N1 even for default policy, might also reveal 
> > imbalance
> >        # lots of RAM and too big huge_pages allows fitting whole shm
> > into just N0
> >        # see point 4 from [1]
> >        grep /anon_h /proc/$SOMEREALBACKENDPID/numa_maps
> >
> >        # then during pgbench -c run maybe those:
> >        mpstat -N ALL 1
> >        perf stat -a -e 
> > uncore_imc/cas_count_read/,uncore_imc/cas_count_write/ \
> >           --per-socket -I 1000  # or -M
> > memory_bandwidth_read,memory_bandwidth_write
> >
> >     (it might reveal that problem I've described above about io_method:
> >     even with pgbench -c 1 you might be reading from all sockets/wrong 
> > sockets
> >     instead of the correct one with affinity)
> >
>
> I'll try, but if you could try running some experiments on your own,
> that might be helpful.
[..]
> > Hopefully next week I'll try to repro those numbers to see if I can
> > help more.
> >
>
> Thank you! That'd be great.

Yeah, I'll try my best, we'll see how it goes. Right now I've just dropped
that fscachenuma proggie to aid us in troubleshooting.

-J.

[0] - https://github.com/jakubwartakEDB/fscachenuma


Reply via email to