Re: Adding basic NUMA awareness

Tomas Vondra Wed, 30 Jul 2025 04:11:09 -0700

On 7/30/25 10:29, Jakub Wartak wrote:
> On Mon, Jul 28, 2025 at 4:22 PM Tomas Vondra <[email protected]> wrote:
> 
> Hi Tomas,
> 
> just a quick look here:
> 
>> 2) The PGPROC part introduces a similar registry, [..]
>>
>> There's also a view pg_buffercache_pgproc. The pg_buffercache location
>> is a bit bogus - it has nothing to do with buffers, but it was good
>> enough for now.
> 
> If you are looking for better names: pg_shmem_pgproc_numa would sound
> like a more natural name.
> 
>> 3) The PGPROC partitioning is reworked and should fix the crash with the
>> GUC set to "off".
> 
> Thanks!
> 
>> simple benchmark
>> ----------------
> [..]
>> There's results for the three "pgbench pinning" strategies, and that can
>> have pretty significant impact (colocated generally performs much better
>> than either "none" or "random").
> 
> Hint: real world is that network cards are usually located on some PCI
> slot that is assigned to certain node (so traffic is flowing from/to
> there), so probably it would make some sense to put pgbench outside
> this machine and remove this as "variable" anyway and remove the need
> for that pgbench --pin-cpus in script. In optimal conditions: most
> optimized layout would be probably to have 2 cards on separate PCI
> slots, each for different node and some LACP between those, with
> xmit_hash_policy allowing traffic distribution on both of those cards
> -- usually there's not just single IP/MAC out there talking to/from
> such server, so that would be real-world (or lack of) affinity.
>


The pgbench pinning certainly reduces some of the noise / overhead you
get when using multiple machines. I use it to "isolate" patches, and
make the effects more visible.

> Also classic pgbench workload, seems to be poor fit for testing it out
> (at least v3-0001 buffers), there I would propose sticking to just
> lots of big (~s_b size) full table seq scans to put stress on shared
> memory. Classic pgbench is usually not there enough to put serious
> bandwidth on the interconnect by my measurements.
> 

Yes, that's possible. The simple pgbench workload is a bit of a "worst
case" for the NUMA patches, in that it's can benefit less from the
improvements, and it's also fairly sensitive to regressions.

I plan to do more tests with other types of workloads, like the one
doing a lot of large sequential scans, etc.

>> For the "bigger" machine (wiuth 176 cores) the incremental results look
>> like this (for pinning=none, i.e. regular pgbench):
>>
>>
>>       mode   s_b buffers localal no-tail freelist sweep pgproc pinning
>>   ====================================================================
>>   prepared  16GB     99%    101%    100%     103%  111%    99%    102%
>>             32GB     98%    102%     99%     103%  107%   101%    112%
>>              8GB     97%    102%    100%     102%  101%   101%    106%
>>   --------------------------------------------------------------------
>>     simple  16GB    100%    100%     99%     105%  108%    99%    108%
>>             32GB     98%    101%    100%     103%  100%   101%     97%
>>              8GB    100%    100%    101%      99%  100%   104%    104%
>>
>> The way I read this is that the first three patches have about no impact
>> on throughput. Then freelist partitioning and (especially) clocksweep
>> partitioning can help quite a bit. pgproc is again close to ~0%, and
>> PGPROC pinning can help again (but this part is merely experimental).
> 
> Isn't the "pinning" column representing just numa_procs_pin=on ?
> (shouldn't it be tested with numa_procs_interleave = on?)
> 

Maybe I don't understand the question, but the last column (pinning)
compares two builds.

1) Build with all the patches up to "pgproc interleaving" (and all of
the GUCs set to "on").

2) Build with all the patches from (1), and "pinning" too (again, all
GUCs set to "on).

Or do I misunderstand the question?

> [..]
>> To quantify this kind of improvement, I think we'll need tests that
>> intentionally cause (or try to) imbalance. If you have ideas for such
>> tests, let me know.
> 
> Some ideas:
> 1. concurrent seq scans hitting s_b-sized table
> 2. one single giant PX-enabled seq scan with $VCPU workers (stresses
> the importance of interleaving dynamic shm for workers)
> 3. select txid_current() with -M prepared?
> 

Thanks. I think we'll try something like (1), but it'll need to be a bit
more elaborate, because scans on tables larger than 1/4 shared buffers
use a small circular buffer.

>> reserving number of huge pages
>> ------------------------------
> [..]
>> It took me ages to realize what's happening, but it's very simple. The
>> nr_hugepages is a global limit, but it's also translated into limits for
>> each NUMA node. So when you write 16828 to it, in a 4-node system each
>> node gets 1/4 of that. See
>>
>>   $ numastat -cm
>>
>> Then we do the mmap(), and everything looks great, because there really
>> is enough huge pages and the system can allocate memory from any NUMA
>> node it needs.
> 
> Yup, similiar story as with OOMs just for per-zone/node.
> 
>> And then we come around, and do the numa_tonode_memory(). And that's
>> where the issues start, because AFAIK this does not check the per-node
>> limit of huge pages in any way. It just appears to work. And then later,
>> when we finally touch the buffer, it tries to actually allocate the
>> memory on the node, and realizes there's not enough huge pages. And
>> triggers the SIGBUS.
> 
> I think that's why options for strict policy numa allocation exist and
> I had the option to use it in my patches (anyway with one big call to
> numa_interleave_memory() for everything it was much more simpler and
> just not micromanaging things). Good reads are numa(3) but e.g.
> mbind(2) underneath will tell you that e.g. `Before Linux 5.7.
> MPOL_MF_STRICT was ignored on huge page mappings.` (I was on 6.14.x,
> but it could be happening for you too if you start using it). Anyway,
> numa_set_strict() is just wrapper around setting this exact flag
> 
> Anyway remember that volatile pg_numa_touch_mem_if_required()? - maybe
> that should be always called in your patch series to pre-populate
> everything during startup, so that others testing will get proper
> guaranteed layout, even without issuing any pg_buffercache calls.
> 

I think I tried using numa_set_strict, but it didn't change the behavior
(i.e. the numa_tonode_memory didn't error out).

>> The only way around this I found is by inflating the number of huge
>> pages, significantly above the shared_memory_size_in_huge_pages value.
>> Just to make sure the nodes get enough huge pages.
>>
>> I don't know what to do about this. It's quite annoying. If we only used
>> huge pages for the partitioned parts, this wouldn't be a problem.
> 
> Meh, sacrificing a couple of huge pages (worst-case 1GB ?) just to get
> NUMA affinity, seems like a logical trade-off, doesn't it?
> But postgres -C shared_memory_size_in_huge_pages  still works OK to
> establish the exact count for vm.nr_hugepages, right?
> 

Well, yes and no. It tells you the exact number of huge pages, but it
does not tell you how much you need to inflate it to account for the
non-shared buffer part that may get allocated on a random node.


regards

-- 
Tomas Vondra

Re: Adding basic NUMA awareness

Reply via email to