On Mon, Jul 28, 2025 at 4:22 PM Tomas Vondra <to...@vondra.me> wrote:
Hi Tomas, just a quick look here: > 2) The PGPROC part introduces a similar registry, [..] > > There's also a view pg_buffercache_pgproc. The pg_buffercache location > is a bit bogus - it has nothing to do with buffers, but it was good > enough for now. If you are looking for better names: pg_shmem_pgproc_numa would sound like a more natural name. > 3) The PGPROC partitioning is reworked and should fix the crash with the > GUC set to "off". Thanks! > simple benchmark > ---------------- [..] > There's results for the three "pgbench pinning" strategies, and that can > have pretty significant impact (colocated generally performs much better > than either "none" or "random"). Hint: real world is that network cards are usually located on some PCI slot that is assigned to certain node (so traffic is flowing from/to there), so probably it would make some sense to put pgbench outside this machine and remove this as "variable" anyway and remove the need for that pgbench --pin-cpus in script. In optimal conditions: most optimized layout would be probably to have 2 cards on separate PCI slots, each for different node and some LACP between those, with xmit_hash_policy allowing traffic distribution on both of those cards -- usually there's not just single IP/MAC out there talking to/from such server, so that would be real-world (or lack of) affinity. Also classic pgbench workload, seems to be poor fit for testing it out (at least v3-0001 buffers), there I would propose sticking to just lots of big (~s_b size) full table seq scans to put stress on shared memory. Classic pgbench is usually not there enough to put serious bandwidth on the interconnect by my measurements. > For the "bigger" machine (wiuth 176 cores) the incremental results look > like this (for pinning=none, i.e. regular pgbench): > > > mode s_b buffers localal no-tail freelist sweep pgproc pinning > ==================================================================== > prepared 16GB 99% 101% 100% 103% 111% 99% 102% > 32GB 98% 102% 99% 103% 107% 101% 112% > 8GB 97% 102% 100% 102% 101% 101% 106% > -------------------------------------------------------------------- > simple 16GB 100% 100% 99% 105% 108% 99% 108% > 32GB 98% 101% 100% 103% 100% 101% 97% > 8GB 100% 100% 101% 99% 100% 104% 104% > > The way I read this is that the first three patches have about no impact > on throughput. Then freelist partitioning and (especially) clocksweep > partitioning can help quite a bit. pgproc is again close to ~0%, and > PGPROC pinning can help again (but this part is merely experimental). Isn't the "pinning" column representing just numa_procs_pin=on ? (shouldn't it be tested with numa_procs_interleave = on?) [..] > To quantify this kind of improvement, I think we'll need tests that > intentionally cause (or try to) imbalance. If you have ideas for such > tests, let me know. Some ideas: 1. concurrent seq scans hitting s_b-sized table 2. one single giant PX-enabled seq scan with $VCPU workers (stresses the importance of interleaving dynamic shm for workers) 3. select txid_current() with -M prepared? > reserving number of huge pages > ------------------------------ [..] > It took me ages to realize what's happening, but it's very simple. The > nr_hugepages is a global limit, but it's also translated into limits for > each NUMA node. So when you write 16828 to it, in a 4-node system each > node gets 1/4 of that. See > > $ numastat -cm > > Then we do the mmap(), and everything looks great, because there really > is enough huge pages and the system can allocate memory from any NUMA > node it needs. Yup, similiar story as with OOMs just for per-zone/node. > And then we come around, and do the numa_tonode_memory(). And that's > where the issues start, because AFAIK this does not check the per-node > limit of huge pages in any way. It just appears to work. And then later, > when we finally touch the buffer, it tries to actually allocate the > memory on the node, and realizes there's not enough huge pages. And > triggers the SIGBUS. I think that's why options for strict policy numa allocation exist and I had the option to use it in my patches (anyway with one big call to numa_interleave_memory() for everything it was much more simpler and just not micromanaging things). Good reads are numa(3) but e.g. mbind(2) underneath will tell you that e.g. `Before Linux 5.7. MPOL_MF_STRICT was ignored on huge page mappings.` (I was on 6.14.x, but it could be happening for you too if you start using it). Anyway, numa_set_strict() is just wrapper around setting this exact flag Anyway remember that volatile pg_numa_touch_mem_if_required()? - maybe that should be always called in your patch series to pre-populate everything during startup, so that others testing will get proper guaranteed layout, even without issuing any pg_buffercache calls. > The only way around this I found is by inflating the number of huge > pages, significantly above the shared_memory_size_in_huge_pages value. > Just to make sure the nodes get enough huge pages. > > I don't know what to do about this. It's quite annoying. If we only used > huge pages for the partitioned parts, this wouldn't be a problem. Meh, sacrificing a couple of huge pages (worst-case 1GB ?) just to get NUMA affinity, seems like a logical trade-off, doesn't it? But postgres -C shared_memory_size_in_huge_pages still works OK to establish the exact count for vm.nr_hugepages, right? Regards, -J.