On 8/29/25 16:38, Tomas Vondra wrote: > On 8/29/25 16:26, Tomas Vondra wrote: >> ... >> >> I've seen these failures after changing checksums in both directions, >> both after enabling and disabling. But I've only ever saw this after >> immediate shutdown, never after fast shutdown. (It's interesting the >> pg_checksums failed only after fast shutdowns ...). >> > > Of course, right after I send a message, it fails after a fast shutdown, > contradicting my observation ... > >> Could it be that the redo happens to start from an older position, but >> using the new checksum version? >> > > ... but it also provided more data supporting this hypothesis. I added > logging of pg_current_wal_lsn() before / after changing checksums on the > primary, and I see this: > > 1) LSN before: 14/2B0F26A8 > 2) enable checksums > 3) LSN after: 14/EE335D60 > 4) standby waits for 14/F4E786E8 (higher, likely thanks to pgbench) > 5) standby restarts with -m fast > 6) redo starts at 14/230043B0, which is *before* enabling checksums > > I guess this is the root cause. A bit more detailed log attached. >
I kept stress testing this over the weekend, and I think I found two issues causing the checksum failures, both for a single node and on a standby: 1) no checkpoint in the "disable path" In the "enable" path, a checkpoint it enforced before flipping the state from "inprogress-on" to "on". It's hidden in the ProcessAllDatabases, but it's there. But the "off" path does not do that, probably on the assumption that we'll always see the writes in the WAL order, so that we'll see the XLOG_CHECKSUMS setting checksums=off before seeing any writes without checksums. And in the happy path this works fine - the standby is happy, etc. But what about after a crash / immediate shutdown? Consider a sequence like this: a) we have checksums=on b) write to page P, updating the checksum c) start disabling checksums d) progress to inprogress-off e) progress to off f) write to page P, without checksum update g) the write to P gets evicted (small shared buffers, ...) h) crash / immediate shutdown Recovery starts from a LSN before (a), so we believe checksums=on. We try to redo the write to P, which starts by reading the page from disk, to check the page LSN. We still think checksums=on, and to read the LSN we need to verify the checksum. But the page was modified without the checksum, and evicted. Kabooom! This is not that hard to trigger by hand. Add a long at the end of SetDataChecksumsOff, start a pgbench on a scale larger than shared buffers and call pg_disable_data_checksums(). Once it gets stuck on the sleep, give it more time to dirty and evict some pages, then kill -9. On recovery you should get the same checksum failures. FWIW I've only ever seen failures on fsm/vm forks, which matches what I see in the TAP tests. But isn't it a bit strange? I think the "disable" path needs a checkpoint between inprogress-off and off states, same as the "enable" path. 2) no restart point on the standby The standby has a similar issue, I think. Even if the primary creates all the necessary checkpoints, the standby may not need to create the restart point for them. If you look into xlog_redo, it only "remembers" the checkpoint position, it does not trigger a restart point. Than only happens in XLogPageRead, based on distance from the previous one. So a very similar failure to the primary is possible, even with the extra checkpoint fixing (1). The primary flips checksums in either direction, generating checkpoints, but the standby does not create the restart points. But it applies WAL, and some of the pages without checksums get evicted. And then the standby fails, and goes to some redo position far back, and runs into the same checksum failure when trying to check page LSN. I think the standby needs some logic to force restart point creation when the checksum flag changed. I have an experimental WIP branch at: https://github.com/tvondra/postgres/tree/online-checksums-tap-tweaks It fixes the TAP issues reported earlier (and a couple more), and it does a bunch of additional tweaks: a) A lot of debug messages that helped me to figure this out. This is probably way too much, especially the controlfile updates can be very noisy on a standby. b) Adds a simpler TAP, testing just a single node (should be easier to understand than with failures on standby). c) Adds an explicit checkpoints, to fix (1). It probably adds too many checkpoints, though? AFAICS a checkpoint after the "inprogress" phase should be enough, a checkpoint after the "on/off" can go away. d) Forces creating a restart point on the first checkpoint after XLOG_CHECKSUMS record. It's done in a bit silly way, using a static flag. Maybe there's a more elegant approach, say by comparing the checksum value in ControlFile to the received checkpoint? e) Randomizes a couple more GUC values. This needs more thought, it was done blindly before better understanding how the failures happen (it requires buffers evicted, not hitting max_wal_size, ...). There are more params worth randomizing (e.g. the "fast" flag). Anyway, with (c) and (d) applied, the checksum failures go away. It may not be 100% right (e.g. we could do away with fewer checkpoints), but it seems to be the right direction. I don't have time to cleanup the branch more, I've already spent too much time looking at LSNs advancing in weird ways :-( Hopefully it's good enough to show what needs to be fixed, etc. If there's a new version, I'm happy to rerun the tests on my machines, ofc. However, there still are more bugs. Attached is a log from a crash after hitting the assert into AbsorbChecksumsOffBarrier: Assert((LocalDataChecksumVersion != PG_DATA_CHECKSUM_VERSION) && (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION || LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION)); This happened while flipping checksums to 'off, but the backend already thinks checksum are 'off': LocalDataChecksumVersion==0 I think this implies some bug in setting up LocalDataChecksumVersion after connection, because this is for a query checking the checksum state, executed by the TAP test (in a new connection, right?). I haven't looked into this more, but how come the "off" direction does not need to check InitialDataChecksumTransition? I think the TAP test turned out to be very useful, so far. While investigating on this, I thought about a couple more tweaks to make it detect additional issues (on top of the randomization). - Right now the shutdowns/restarts happen only in very limited places. The checksum flips from on/off or off/on, and then a restart happens. AFAICS it never happens in the "inprogress" phases, right? - The pgbench clients connect once, so there are almost no new connections while flipping checksums. Maybe some of the pgbenches should run with "-C", to open new connections. It was pretty lucky the TAP query hit the assert, this would make it more likely. regards -- Tomas Vondra
# Postmaster PID for node "main" is 1822197 [03:19:48.808](2.752s) ok 3198 - ensure checksums are set to on [03:19:48.843](0.035s) # LSN before: B1/B72958C8 [03:19:48.846](0.003s) # disable_data_checksums fast true [03:19:48.902](0.056s) ok 3199 - ensure data checksums are transitioned to off [03:19:48.920](0.018s) # LSN after: B1/B73696F0 [03:19:50.996](2.076s) ok 3200 - ensure data pages can be read back on primary [03:19:52.997](2.001s) # iteration 718 of 1000 # Running: pg_isready --timeout 180 --host /tmp/QlNkDRW0vA --port 18699 /tmp/QlNkDRW0vA:18699 - accepting connections [03:19:55.542](2.545s) ok 3201 - ensure checksums are set to off [03:19:55.554](0.012s) # LSN before: B1/BC2E1B00 [03:19:55.555](0.001s) # enable_data_checksums delay 0 limit 100 fast true [03:19:55.603](0.048s) ok 3202 - ensure data checksums are transitioned to inprogress-on [03:20:20.561](24.958s) ok 3203 - ensure data checksums are transitioned to on [03:20:20.589](0.027s) # LSN after: B2/2FDA8620 [03:20:22.041](1.452s) ok 3204 - ensure data pages can be read back on primary [03:20:23.042](1.001s) # iteration 719 of 1000 # Running: pg_isready --timeout 180 --host /tmp/QlNkDRW0vA --port 18699 /tmp/QlNkDRW0vA:18699 - accepting connections [03:20:26.418](3.376s) ok 3205 - ensure checksums are set to on [03:20:28.435](2.018s) # LSN before: B2/3569F408 [03:20:28.438](0.003s) # disable_data_checksums fast true ack Broken pipe: write( 13, 'SELECT setting FROM pg_catalog.pg_settings WHERE name = 'data_checksums';' ) at /usr/share/perl5/IPC/Run/IO.pm line 550. # No postmaster PID for node "main" [03:21:02.179](33.741s) # Tests were run but no plan was declared and done_testing() was not seen. [03:21:02.180](0.000s) # Looks like your test exited with 32 just after 3205. 000 s; distance=62367 kB, estimate=71775 kB; lsn=B2/2FCF2CA8, redo lsn=B2/2FCEE3A0, checksums=2 (2) 2025-09-01 03:20:20.524 BST checkpointer[1822201] LOG: checkpoint starting: fast force wait 2025-09-01 03:20:20.533 BST checkpointer[1822201] LOG: CreateCheckPoint ControlFile->data_checksum_version = 1 ControlFile->checkPointCopy.data_checksum_version = 1 2025-09-01 03:20:20.533 BST checkpointer[1822201] LOG: update_controlfile ControlFile->data_checksum_version = 1 ControlFile->checkPointCopy.data_checksum_version = 1 2025-09-01 03:20:20.533 BST checkpointer[1822201] LOG: checkpoint complete: wrote 31 buffers (24.2%), wrote 1 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.009 s; sync files=0, longest=0.000 s, average=0.000 s; distance=53 kB, estimate=64603 kB; lsn=B2/2FD21908, redo lsn=B2/2FCFBA00, checksums=1 (1) 2025-09-01 03:20:24.703 BST checkpointer[1822201] LOG: checkpoints are occurring too frequently (4 seconds apart) 2025-09-01 03:20:24.703 BST checkpointer[1822201] HINT: Consider increasing the configuration parameter "max_wal_size". 2025-09-01 03:20:24.703 BST checkpointer[1822201] LOG: checkpoint starting: wal 2025-09-01 03:20:26.202 BST checkpointer[1822201] LOG: CreateCheckPoint ControlFile->data_checksum_version = 1 ControlFile->checkPointCopy.data_checksum_version = 1 2025-09-01 03:20:26.202 BST checkpointer[1822201] LOG: update_controlfile ControlFile->data_checksum_version = 1 ControlFile->checkPointCopy.data_checksum_version = 1 2025-09-01 03:20:26.202 BST checkpointer[1822201] LOG: checkpoint complete: wrote 4 buffers (3.1%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed, 4 recycled; write=1.498 s, sync=0.001 s, total=1.499 s; sync files=0, longest=0.000 s, average=0.000 s; distance=52249 kB, estimate=63368 kB; lsn=B2/342F3E38, redo lsn=B2/33002080, checksums=1 (1) 2025-09-01 03:20:28.468 BST client backend[1822725] 006_pgbench_single.pl LOG: disable_data_checksums fast 1 2025-09-01 03:20:28.468 BST client backend[1822725] 006_pgbench_single.pl STATEMENT: SELECT pg_disable_data_checksums(true); 2025-09-01 03:20:28.479 BST checkpointer[1822201] LOG: checkpoint starting: fast force wait 2025-09-01 03:20:28.480 BST checkpointer[1822201] LOG: CreateCheckPoint ControlFile->data_checksum_version = 3 ControlFile->checkPointCopy.data_checksum_version = 3 2025-09-01 03:20:28.480 BST checkpointer[1822201] LOG: update_controlfile ControlFile->data_checksum_version = 3 ControlFile->checkPointCopy.data_checksum_version = 3 2025-09-01 03:20:28.481 BST checkpointer[1822201] LOG: checkpoint complete: wrote 40 buffers (31.2%), wrote 4 SLRU buffers; 0 WAL file(s) added, 0 removed, 2 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s, average=0.000 s; distance=39883 kB, estimate=61019 kB; lsn=B2/356F6040, redo lsn=B2/356F4E78, checksums=3 (3) TRAP: failed Assert("(LocalDataChecksumVersion != PG_DATA_CHECKSUM_VERSION) && (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION || LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION)"), File: "xlog.c", Line: 5002, PID: 1822728 postgres: main: debian postgres [local] authentication(ExceptionalCondition+0x84)[0x55564d0428b4] postgres: main: debian postgres [local] authentication(AbsorbChecksumsOffBarrier+0x5c)[0x55564c96f764] postgres: main: debian postgres [local] authentication(ProcessProcSignalBarrier+0x1f0)[0x55564ce0d364] postgres: main: debian postgres [local] authentication(ProcessInterrupts+0x90c)[0x55564ce4f0c0] postgres: main: debian postgres [local] authentication(ClientAuthentication+0x50)[0x55564cbd1930] postgres: main: debian postgres [local] authentication(+0x8eb940)[0x55564d05b940] postgres: main: debian postgres [local] authentication(InitPostgres+0x3f8)[0x55564d05c8b0] postgres: main: debian postgres [local] authentication(PostgresMain+0x2b4)[0x55564ce50588] postgres: main: debian postgres [local] authentication(+0x6d6a74)[0x55564ce46a74] postgres: main: debian postgres [local] authentication(postmaster_child_launch+0x1d4)[0x55564cd33684] postgres: main: debian postgres [local] authentication(+0x5ca9c4)[0x55564cd3a9c4] postgres: main: debian postgres [local] authentication(+0x5c78a4)[0x55564cd378a4] postgres: main: debian postgres [local] authentication(PostmasterMain+0x153c)[0x55564cd37118] postgres: main: debian postgres [local] authentication(main+0x3d8)[0x55564cbe452c] /lib/aarch64-linux-gnu/libc.so.6(+0x27744)[0x7fffa1a47744] /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0x7fffa1a47818] postgres: main: debian postgres [local] authentication(_start+0x30)[0x55564c8562f0] 2025-09-01 03:20:28.513 BST postmaster[1822197] LOG: client backend (PID 1822728) was terminated by signal 6: Aborted 2025-09-01 03:20:28.513 BST postmaster[1822197] LOG: terminating any other active server processes 2025-09-01 03:20:28.515 BST postmaster[1822197] LOG: shutting down because "restart_after_crash" is off 2025-09-01 03:20:28.516 BST postmaster[1822197] LOG: database system is shut down