Hi,
On 2023-04-24 16:39:36 -0400, Melanie Plageman wrote:
> On Wed, Apr 19, 2023 at 10:23:26AM -0700, Andres Freund wrote:
> > Hi,
> >
> > I noticed that the numbers in pg_stat_io dont't quite add up to what I
> > expected in write heavy workloads. Particularly for checkpointer, the
> > numbers
> > for "write" in log_checkpoints output are larger than what is visible in
> > pg_stat_io.
> >
> > That partially is because log_checkpoints' "write" covers way too many
> > things,
> > but there's an issue with pg_stat_io as well:
> >
> > Checkpoints, and some other sources of writes, will often end up doing a lot
> > of smgrwriteback() calls - which pg_stat_io doesn't track. Nor do any
> > pre-existing forms of IO statistics.
> >
> > It seems pretty clear that we should track writeback as well. I wonder if
> > it's
> > worth doing so for 16? It'd give a more complete picture that way. The
> > counter-argument I see is that we didn't track the time for it in existing
> > stats either, and that nobody complained - but I suspect that's mostly
> > because
> > nobody knew to look.
>
> Not complaining about making pg_stat_io more accurate, but what exactly
> would we be tracking for smgrwriteback()? I assume you are talking about
> IO timing. AFAICT, on Linux, it does sync_file_range() with
> SYNC_FILE_RANGE_WRITE, which is asynchronous. Wouldn't we just be
> tracking the system call overhead time?
It starts blocking once "enough" IO is in flight. For things like an immediate
checkpoint, that can happen fairly quickly, unless you have a very fast IO
subsystem. So often it'll not matter whether we track smgrwriteback(), but
when it matter, it can matter a lot.
As an example, I inited' a pgbench w/ scale 1000, on a decent but not great
NVMe SSD. Created dirty data with:
c=96;/srv/dev/build/m-opt/src/bin/pgbench/pgbench --random-seed=0 -n -M
prepared -c$c -j$c -T30 -P1
and then measured the checkpoint:
perf trace -s -p $pid_of_checkpointer psql -x -c "SELECT
pg_stat_reset_shared('io')" -c "checkpoint"
postgres (367444), 1891280 events, 100.0%
syscall calls errors total min avg max
stddev
(msec) (msec) (msec) (msec)
(%)
--------------- -------- ------ -------- --------- --------- ---------
------
sync_file_range 359176 0 4560.670 0.002 0.013 238.955
10.36%
pwrite64 582964 0 2874.634 0.003 0.005 0.156
0.06%
fsync 242 0 251.631 0.001 1.040 42.166
18.81%
openat 317 65 2.171 0.002 0.007 0.068
5.69%
rename 69 0 1.973 0.012 0.029 0.084
5.81%
fdatasync 1 0 1.543 1.543 1.543 1.543
0.00%
unlink 150 137 1.278 0.002 0.009 0.062
10.69%
close 250 0 0.694 0.001 0.003 0.007
3.14%
newfstatat 140 68 0.667 0.001 0.005 0.022
7.26%
write 5 0 0.067 0.005 0.013 0.025
24.55%
lseek 14 0 0.050 0.001 0.004 0.018
33.87%
getdents64 8 0 0.047 0.002 0.006 0.022
39.51%
kill 3 0 0.029 0.008 0.010 0.011
10.18%
epoll_wait 2 1 0.006 0.000 0.003 0.006
100.00%
read 1 0 0.004 0.004 0.004 0.004
0.00%
Log output:
2023-04-24 14:11:59.234 PDT [367444][checkpointer][:0][] LOG: checkpoint
starting: immediate force wait
2023-04-24 14:12:09.236 PDT [367444][checkpointer][:0][] LOG: checkpoint
complete: wrote 595974 buffers (28.4%); 0 WAL file(s) added, 0 removed, 68
recycled; write=9.740 s, sync=0.057 s, total=10.002 s; sync files=27,
longest=0.043 s, average=0.003 s; distance=4467386 kB, estimate=4467386 kB;
lsn=6/E5D33F98, redo lsn=6/E5D33F28
# SELECT writes, write_time, fsyncs, fsync_time FROM pg_stat_io WHERE
backend_type = 'checkpointer';
┌────────┬────────────────────┬────────┬────────────┐
│ writes │ write_time │ fsyncs │ fsync_time │
├────────┼────────────────────┼────────┼────────────┤
│ 595914 │ 4002.1730000000002 │ 24 │ 46.359 │
└────────┴────────────────────┴────────┴────────────┘
Greetings,
Andres Freund