Hi,
On 2026-01-12 19:10:00 -0500, Andres Freund wrote:
> On 2026-01-13 00:58:49 +0100, Tomas Vondra wrote:
> > On 1/10/26 02:42, Andres Freund wrote:
> > > psql -Xq -c 'SELECT pg_buffercache_evict_all();' -c 'SELECT numa_node,
> > > sum(size) FROM pg_shmem_allocations_numa GROUP BY 1;' && perf stat
> > > --per-socket -M memory_bandwidth_read,memory_bandwidth_write -a psql -c
> > > 'SELECT sum(abalance) FROM pgbench_accounts;'
>
> > And then I initialized pgbench with scale that is much larger than
> > shared buffers, but fits into RAM. So cached, but definitely > NB/4. And
> > then I ran
> >
> > select * from pgbench_accounts offset 1000000000;
> >
> > which does a sequential scan with the circular buffer you mention abobe
>
> Did you try it with the query I suggested? One plausible reason why you did
> not see an effect with your query is that with a huge offset you actually
> never deform the tuple, which is an important and rather latency sensitive
> path.
Btw, this doesn't need anywhere close to as much data, it should be visible as
soon as you're >> L3.
To show why
SELECT * FROM pgbench_accounts OFFSET 100000000
doesn't show an effect but
SELECT sum(abalance) FROM pgbench_accounts;
does, just look at the difference using the perf command I posted. Here on a
scale 200.
numactl --membind 0 --cpunodebind 0
offset:
S0 6 47,138,135 UNC_M_CAS_COUNT.WR # 3884.1
MB/s memory_bandwidth_write
S0 20 780,343,577 duration_time
S0 6 61,685,331 UNC_M_CAS_COUNT.RD # 5082.8
MB/s memory_bandwidth_read
S0 20 780,353,818 duration_time
S1 6 1,238,568 UNC_M_CAS_COUNT.WR # 102.1
MB/s memory_bandwidth_write
S1 6 1,475,224 UNC_M_CAS_COUNT.RD # 121.6
MB/s memory_bandwidth_read
0.776715450 seconds time elapsed
agg:
S0 6 53,145,706 UNC_M_CAS_COUNT.WR # 2000.8
MB/s memory_bandwidth_write
S0 20 1,706,046,493 duration_time
S0 6 111,390,488 UNC_M_CAS_COUNT.RD # 4193.5
MB/s memory_bandwidth_read
S0 20 1,706,057,341 duration_time
S1 6 3,968,454 UNC_M_CAS_COUNT.WR # 149.4
MB/s memory_bandwidth_write
S1 6 4,026,212 UNC_M_CAS_COUNT.RD # 151.6
MB/s memory_bandwidth_read
numactl --membind 0 --cpunodebind 1
offset:
S0 6 91,982,003 UNC_M_CAS_COUNT.WR # 7036.4
MB/s memory_bandwidth_write
S0 20 842,785,290 duration_time
S0 6 113,076,316 UNC_M_CAS_COUNT.RD # 8650.1
MB/s memory_bandwidth_read
S0 20 842,797,430 duration_time
S1 6 1,545,612 UNC_M_CAS_COUNT.WR # 118.2
MB/s memory_bandwidth_write
S1 6 2,354,087 UNC_M_CAS_COUNT.RD # 180.1
MB/s memory_bandwidth_read
0.836623794 seconds time elapsed
agg:
S0 6 133,267,754 UNC_M_CAS_COUNT.WR # 3980.9
MB/s memory_bandwidth_write
S0 20 2,146,221,284 duration_time
S0 6 159,951,549 UNC_M_CAS_COUNT.RD # 4777.9
MB/s memory_bandwidth_read
S0 20 2,146,233,675 duration_time
S1 6 71,543,708 UNC_M_CAS_COUNT.WR # 2137.1
MB/s memory_bandwidth_write
S1 6 49,584,957 UNC_M_CAS_COUNT.RD # 1481.2
MB/s memory_bandwidth_read
2.142535432 seconds time elapsed
Note how much bigger the absolute numbers of reads and writes are for the
aggregate compared to the offset.
Interestingly I do see a performance difference, albeit a smaller one, even
with OFFSET. I see similar numbers on two different 2 socket machines.
Greetings,
Andres Freund