Re: Adding basic NUMA awareness

Jakub Wartak Wed, 30 Jul 2025 01:29:58 -0700

On Mon, Jul 28, 2025 at 4:22 PM Tomas Vondra <[email protected]> wrote:


Hi Tomas,

just a quick look here:

> 2) The PGPROC part introduces a similar registry, [..]
>
> There's also a view pg_buffercache_pgproc. The pg_buffercache location
> is a bit bogus - it has nothing to do with buffers, but it was good
> enough for now.

If you are looking for better names: pg_shmem_pgproc_numa would sound
like a more natural name.

> 3) The PGPROC partitioning is reworked and should fix the crash with the
> GUC set to "off".

Thanks!

> simple benchmark
> ----------------
[..]
> There's results for the three "pgbench pinning" strategies, and that can
> have pretty significant impact (colocated generally performs much better
> than either "none" or "random").

Hint: real world is that network cards are usually located on some PCI
slot that is assigned to certain node (so traffic is flowing from/to
there), so probably it would make some sense to put pgbench outside
this machine and remove this as "variable" anyway and remove the need
for that pgbench --pin-cpus in script. In optimal conditions: most
optimized layout would be probably to have 2 cards on separate PCI
slots, each for different node and some LACP between those, with
xmit_hash_policy allowing traffic distribution on both of those cards
-- usually there's not just single IP/MAC out there talking to/from
such server, so that would be real-world (or lack of) affinity.

Also classic pgbench workload, seems to be poor fit for testing it out
(at least v3-0001 buffers), there I would propose sticking to just
lots of big (~s_b size) full table seq scans to put stress on shared
memory. Classic pgbench is usually not there enough to put serious
bandwidth on the interconnect by my measurements.

> For the "bigger" machine (wiuth 176 cores) the incremental results look
> like this (for pinning=none, i.e. regular pgbench):
>
>
>       mode   s_b buffers localal no-tail freelist sweep pgproc pinning
>   ====================================================================
>   prepared  16GB     99%    101%    100%     103%  111%    99%    102%
>             32GB     98%    102%     99%     103%  107%   101%    112%
>              8GB     97%    102%    100%     102%  101%   101%    106%
>   --------------------------------------------------------------------
>     simple  16GB    100%    100%     99%     105%  108%    99%    108%
>             32GB     98%    101%    100%     103%  100%   101%     97%
>              8GB    100%    100%    101%      99%  100%   104%    104%
>
> The way I read this is that the first three patches have about no impact
> on throughput. Then freelist partitioning and (especially) clocksweep
> partitioning can help quite a bit. pgproc is again close to ~0%, and
> PGPROC pinning can help again (but this part is merely experimental).

Isn't the "pinning" column representing just numa_procs_pin=on ?
(shouldn't it be tested with numa_procs_interleave = on?)

[..]
> To quantify this kind of improvement, I think we'll need tests that
> intentionally cause (or try to) imbalance. If you have ideas for such
> tests, let me know.

Some ideas:
1. concurrent seq scans hitting s_b-sized table
2. one single giant PX-enabled seq scan with $VCPU workers (stresses
the importance of interleaving dynamic shm for workers)
3. select txid_current() with -M prepared?

> reserving number of huge pages
> ------------------------------
[..]
> It took me ages to realize what's happening, but it's very simple. The
> nr_hugepages is a global limit, but it's also translated into limits for
> each NUMA node. So when you write 16828 to it, in a 4-node system each
> node gets 1/4 of that. See
>
>   $ numastat -cm
>
> Then we do the mmap(), and everything looks great, because there really
> is enough huge pages and the system can allocate memory from any NUMA
> node it needs.

Yup, similiar story as with OOMs just for per-zone/node.

> And then we come around, and do the numa_tonode_memory(). And that's
> where the issues start, because AFAIK this does not check the per-node
> limit of huge pages in any way. It just appears to work. And then later,
> when we finally touch the buffer, it tries to actually allocate the
> memory on the node, and realizes there's not enough huge pages. And
> triggers the SIGBUS.

I think that's why options for strict policy numa allocation exist and
I had the option to use it in my patches (anyway with one big call to
numa_interleave_memory() for everything it was much more simpler and
just not micromanaging things). Good reads are numa(3) but e.g.
mbind(2) underneath will tell you that e.g. `Before Linux 5.7.
MPOL_MF_STRICT was ignored on huge page mappings.` (I was on 6.14.x,
but it could be happening for you too if you start using it). Anyway,
numa_set_strict() is just wrapper around setting this exact flag

Anyway remember that volatile pg_numa_touch_mem_if_required()? - maybe
that should be always called in your patch series to pre-populate
everything during startup, so that others testing will get proper
guaranteed layout, even without issuing any pg_buffercache calls.

> The only way around this I found is by inflating the number of huge
> pages, significantly above the shared_memory_size_in_huge_pages value.
> Just to make sure the nodes get enough huge pages.
>
> I don't know what to do about this. It's quite annoying. If we only used
> huge pages for the partitioned parts, this wouldn't be a problem.

Meh, sacrificing a couple of huge pages (worst-case 1GB ?) just to get
NUMA affinity, seems like a logical trade-off, doesn't it?
But postgres -C shared_memory_size_in_huge_pages  still works OK to
establish the exact count for vm.nr_hugepages, right?

Regards,
-J.

Re: Adding basic NUMA awareness

Reply via email to