Re: Adding basic NUMA awareness

Tomas Vondra Mon, 12 Jan 2026 17:26:16 -0800

On 1/13/26 02:08, Andres Freund wrote:
> Hi,
> 
> On 2026-01-13 01:51:09 +0100, Tomas Vondra wrote:
>> On 1/13/26 01:10, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2026-01-13 00:58:49 +0100, Tomas Vondra wrote:
>>>> On 1/10/26 02:42, Andres Freund wrote:
>>>>> psql -Xq -c 'SELECT pg_buffercache_evict_all();' -c 'SELECT numa_node, 
>>>>> sum(size) FROM pg_shmem_allocations_numa GROUP BY 1;' && perf stat 
>>>>> --per-socket  -M memory_bandwidth_read,memory_bandwidth_write -a psql -c 
>>>>> 'SELECT sum(abalance) FROM pgbench_accounts;'
>>>
>>>> And then I initialized pgbench with scale that is much larger than
>>>> shared buffers, but fits into RAM. So cached, but definitely > NB/4. And
>>>> then I ran
>>>>
>>>>   select * from pgbench_accounts offset 1000000000;
>>>>
>>>> which does a sequential scan with the circular buffer you mention abobe
>>>
>>> Did you try it with the query I suggested? One plausible reason why you did
>>> not see an effect with your query is that with a huge offset you actually
>>> never deform the tuple, which is an important and rather latency sensitive
>>> path.
>>>
>>
>> I did try with the agg query too, and there's still no difference on
>> either machine.
> 
> Could you provide numactl --hardware for both?  There may be more than two
> numa nodes on a system with 2 sockets, due to one socket being split into two
> - in which case the latency between 0,1 might be a lot lower than say 0 and 3.
> 

xeon:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 44
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
node 0 size: 32066 MB
node 0 free: 13081 MB
node 1 cpus: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
42 43 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
node 1 size: 32210 MB
node 1 free: 17764 MB
node distances:
node     0    1
   0:   10   21
   1:   21   10

azure/epyc:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 193412 MB
node 0 free: 147949 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
92 93 94 95
node 1 size: 193513 MB
node 1 free: 151577 MB
node distances:
node   0   1
  0:  10  11
  1:  11  10


> 
>> I can't do the perf on the Azure VM, because the Ubuntu is image is
>> borked and does not allow installing the package. But on my xeon I can
>> do the perf, and that gives me this:
>>
>> numactl --membind=0 --cpunodebind=0 ~/builds/master-test/bin/pg_ctl
>> -----------------------------------------------------------------------
>> S0 1     24,677,226    UNC_M_CAS_COUNT.WR #    79.0 MB/s ... idth_write
>> S0 1 20,001,829,522 ns duration_time                     ...
>> S0 1    972,631,426    UNC_M_CAS_COUNT.RD #  3112.2 MB/s ... idth_read
>> S0 1 20,001,822,807 ns duration_time                     ...
>> S1 1     15,602,233    UNC_M_CAS_COUNT.WR #    49.9 MB/s ... idth_write
>> S1 1    712,431,146    UNC_M_CAS_COUNT.RD #  2279.6 MB/s ... idth_read
>>
>>
>> numactl --membind=0 --cpunodebind=1 ~/builds/master-test/bin/pg_ctl
>> -----------------------------------------------------------------------
>> S0 1     47,931,019    UNC_M_CAS_COUNT.WR #    153.4 MB/s ... idth_write
>> S0 1 20,002,933,380 ns duration_time                      ...
>> S0 1  1,007,386,994    UNC_M_CAS_COUNT.RD #   3223.2 MB/s ... idth_read
>> S0 1 20,002,927,341 ns duration_time                      ...
>> S1 1     10,310,201    UNC_M_CAS_COUNT.WR #     33.0 MB/s ... idth_write
>> S1 1    714,826,668    UNC_M_CAS_COUNT.RD #   2287.2 MB/s ... idth_read
>>
>> so there is a little bit of a difference for some stats, but not much.
>>
>>
>> FWIW this is from
>>
>> perf stat --per-socket  -M memory_bandwidth_read,memory_bandwidth_write
>> -a -- sleep 20
>>
>> while the agg query runs in a loop.
> 
> FWIW doing one perf stat for each execution is preferrable for comparison,
> because otherwise you can hide large differences in total number of memory
> accesses if the runtimes for the queries in the two "numa configurations" are
> different.
> 

Good point, I'll do that next time. But in this case they are not all
that different, I think.

regards

-- 
Tomas Vondra
Re: Adding basic NUMA awareness

Reply via email to