Hi, On 2026-01-13 02:13:40 +0100, Tomas Vondra wrote: > On the azure VM (scale 200, 32GB sb), there's still no difference:
One possibility is that the host is configured with memory interleaving. That configures the memory so that physical memory addresses interleave between the different NUMA nodes, instead of really being node local. That can help avoid bad performance characteristics for NUMA naive applications. I don't quite know how to figure that out though, particularly from within a VM :(. Even something like https://github.com/nviennot/core-to-core-latency or intel's mlc will not necessarily be helpful, because it depends on which node the measured cacheline ends up on. But I'd probably still test it, just to see whether you're observing very different latencies between the systems. > > Interestingly I do see a performance difference, albeit a smaller one, even > > with OFFSET. I see similar numbers on two different 2 socket machines. > > > > I wonder how significant is the number of sockets. The Azure is a single > socket with 2 NUMA nodes, so maybe the latency differences are not > significant enough to affect this kind of tests. Ah, yes, a single socket machine might not show that much of an increase, at least in simpler cases. One of my workstations has two sockets, but each socket has two numa nodes, the latency difference between the same numa node and the other numa node in the same socket is small, but the difference to the other socket is ~1.5x. Using intel's mlc: Measuring idle latencies for sequential access (in ns)... Numa node Numa node 0 1 2 3 0 98.6 106.9 157.6 167.9 1 105.8 99.4 158.4 170.5 2 157.2 167.4 103.6 105.6 3 158.4 171.2 104.5 104.3 So there's a about a 2-10ns latency difference between 0,1 and 2,3, but about a 50-60ns diffence across sockets... > The xeon is a 2-socket machine, but it's also older (~10y). It's perhaps worth noting that memory access latency has been *in*creasing in the last generation or two of hardware... Greetings, Andres Freund
