On 8/29/25 16:26, Tomas Vondra wrote: > ... > > I've seen these failures after changing checksums in both directions, > both after enabling and disabling. But I've only ever saw this after > immediate shutdown, never after fast shutdown. (It's interesting the > pg_checksums failed only after fast shutdowns ...). >
Of course, right after I send a message, it fails after a fast shutdown, contradicting my observation ... > Could it be that the redo happens to start from an older position, but > using the new checksum version? > ... but it also provided more data supporting this hypothesis. I added logging of pg_current_wal_lsn() before / after changing checksums on the primary, and I see this: 1) LSN before: 14/2B0F26A8 2) enable checksums 3) LSN after: 14/EE335D60 4) standby waits for 14/F4E786E8 (higher, likely thanks to pgbench) 5) standby restarts with -m fast 6) redo starts at 14/230043B0, which is *before* enabling checksums I guess this is the root cause. A bit more detailed log attached. regards -- Tomas Vondra
# Postmaster PID for node "standby_1" is 742124 [16:23:56.601](0.642s) ok 350 - ensure checksums are set to off [16:23:56.611](0.010s) ok 351 - ensure checksums are set to off >> LSN before: 14/2B0F26A8 [16:23:56.642](0.031s) ok 352 - ensure data checksums are transitioned to inprogress-on Waiting for replication conn standby_1's replay_lsn to pass 14/2BC7BD70 on main done [16:23:56.674](0.032s) ok 353 - ensure standby has absorbed the inprogress-on barrier [16:23:56.684](0.010s) ok 354 - ensure checksums are on, or in progress, on standby_1 [16:24:16.449](19.765s) ok 355 - ensure data checksums are transitioned to on >> LSN after: 14/EE335D60 [16:24:17.468](1.019s) ok 356 - ensure data checksums are transitioned to on [16:24:17.634](0.166s) ok 357 - ensure data pages can be read back on primary Waiting for replication conn standby_1's write_lsn to pass 14/F4E786E8 on main done ### Stopping node "standby_1" using mode fast # Running: pg_ctl --pgdata /home/tomas/postgres/src/test/modules/test_checksums/tmp_check/t_006_concurrent_pgbench_standby_1_data/pgdata --mode fast stop waiting for server to shut down.... done server stopped # No postmaster PID for node "standby_1" # Running: pg_isready --timeout 180 --host /tmp/gmkA4nVVNY --port 15509 /tmp/gmkA4nVVNY:15509 - accepting connections # Running: pg_isready --timeout 180 --host /tmp/gmkA4nVVNY --port 15510 /tmp/gmkA4nVVNY:15510 - no response [16:24:17.768](0.134s) not ok 358 - no checksum validation errors in standby_1 log [16:24:17.769](0.000s) [16:24:17.769](0.000s) # Failed test 'no checksum validation errors in standby_1 log' # at t/006_concurrent_pgbench.pl line 298. [16:24:17.769](0.000s) # '2025-08-29 16:23:55.976 CEST postmaster[742124] LOG: starting PostgreSQL 19devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14+deb12u1) 12.2.0, 64-bit # 2025-08-29 16:23:55.976 CEST postmaster[742124] LOG: listening on Unix socket "/tmp/gmkA4nVVNY/.s.PGSQL.15510" # 2025-08-29 16:23:55.978 CEST startup[742130] LOG: database system was interrupted while in recovery at log time 2025-08-29 16:23:52 CEST # 2025-08-29 16:23:55.978 CEST startup[742130] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. # 2025-08-29 16:23:55.979 CEST startup[742130] LOG: entering standby mode # 2025-08-29 16:23:55.979 CEST startup[742130] LOG: redo starts at 14/230043B0 # 2025-08-29 16:23:55.991 CEST startup[742130] LOG: page verification failed, calculated checksum 42986 but expected 43570 # 2025-08-29 16:23:55.991 CEST startup[742130] CONTEXT: WAL redo at 14/232FFE00 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 4885280, isCatalogRel: F, nplans: 0, nredirected: 1, ndead: 0, nunused: 0, redirected: [59->66]; blkref #0: rel 1663/5/16 407, blk 167272 # 2025-08-29 16:23:55.991 CEST startup[742130] LOG: invalid page in block 43 of relation "base/5/16407_fsm"; zeroing out page # 2025-08-29 16:23:55.991 CEST startup[742130] CONTEXT: WAL redo at 14/232FFE00 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 4885280, isCatalogRel: F, nplans: 0, nredirected: 1, ndead: 0, nunused: 0, redirected: [59->66]; blkref #0: rel 1663/5/16407, blk 167272 # 2025-08-29 16:23:55.991 CEST startup[742130] WARNING: invalid page in block 43 of relation "base/5/16407_fsm"; zeroing out page # 2025-08-29 16:23:55.991 CEST startup[742130] CONTEXT: WAL redo at 14/232FFE00 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 4885280, isCatalogRel: F, nplans: 0, nredirected: 1, ndead: 0, nunused: 0, redirected: [59->66]; blkref #0: rel 1663/5/16407, blk 167272
