On 8/29/25 16:38, Tomas Vondra wrote:
> On 8/29/25 16:26, Tomas Vondra wrote:
>> ...
>>
>> I've seen these failures after changing checksums in both directions,
>> both after enabling and disabling. But I've only ever saw this after
>> immediate shutdown, never after fast shutdown. (It's interesting the
>> pg_checksums failed only after fast shutdowns ...).
>>
> 
> Of course, right after I send a message, it fails after a fast shutdown,
> contradicting my observation ...
> 
>> Could it be that the redo happens to start from an older position, but
>> using the new checksum version?
>>
> 
> ... but it also provided more data supporting this hypothesis. I added
> logging of pg_current_wal_lsn() before / after changing checksums on the
> primary, and I see this:
> 
> 1) LSN before: 14/2B0F26A8
> 2) enable checksums
> 3) LSN after: 14/EE335D60
> 4) standby waits for 14/F4E786E8 (higher, likely thanks to pgbench)
> 5) standby restarts with -m fast
> 6) redo starts at 14/230043B0, which is *before* enabling checksums
> 
> I guess this is the root cause. A bit more detailed log attached.
> 

I kept stress testing this over the weekend, and I think I found two
issues causing the checksum failures, both for a single node and on a
standby:

1) no checkpoint in the "disable path"

In the "enable" path, a checkpoint it enforced before flipping the state
from "inprogress-on" to "on". It's hidden in the ProcessAllDatabases,
but it's there. But the "off" path does not do that, probably on the
assumption that we'll always see the writes in the WAL order, so that
we'll see the XLOG_CHECKSUMS setting checksums=off before seeing any
writes without checksums.

And in the happy path this works fine - the standby is happy, etc. But
what about after a crash / immediate shutdown? Consider a sequence like
this:

a) we have checksums=on
b) write to page P, updating the checksum
c) start disabling checksums
d) progress to inprogress-off
e) progress to off
f) write to page P, without checksum update
g) the write to P gets evicted (small shared buffers, ...)
h) crash / immediate shutdown

Recovery starts from a LSN before (a), so we believe checksums=on. We
try to redo the write to P, which starts by reading the page from disk,
to check the page LSN. We still think checksums=on, and to read the LSN
we need to verify the checksum. But the page was modified without the
checksum, and evicted. Kabooom!

This is not that hard to trigger by hand. Add a long at the end of
SetDataChecksumsOff, start a pgbench on a scale larger than shared
buffers and call pg_disable_data_checksums(). Once it gets stuck on the
sleep, give it more time to dirty and evict some pages, then kill -9. On
recovery you should get the same checksum failures.

FWIW I've only ever seen failures on fsm/vm forks, which matches what I
see in the TAP tests. But isn't it a bit strange?


I think the "disable" path needs a checkpoint between inprogress-off and
off states, same as the "enable" path.


2) no restart point on the standby

The standby has a similar issue, I think. Even if the primary creates
all the necessary checkpoints, the standby may not need to create the
restart point for them. If you look into xlog_redo, it only "remembers"
the checkpoint position, it does not trigger a restart point. Than only
happens in XLogPageRead, based on distance from the previous one.

So a very similar failure to the primary is possible, even with the
extra checkpoint fixing (1). The primary flips checksums in either
direction, generating checkpoints, but the standby does not create the
restart points. But it applies WAL, and some of the pages without
checksums get evicted.

And then the standby fails, and goes to some redo position far back, and
runs into the same checksum failure when trying to check page LSN.

I think the standby needs some logic to force restart point creation
when the checksum flag changed.


I have an experimental WIP branch at:

https://github.com/tvondra/postgres/tree/online-checksums-tap-tweaks

It fixes the TAP issues reported earlier (and a couple more), and it
does a bunch of additional tweaks:

a) A lot of debug messages that helped me to figure this out. This is
probably way too much, especially the controlfile updates can be very
noisy on a standby.

b) Adds a simpler TAP, testing just a single node (should be easier to
understand than with failures on standby).

c) Adds an explicit checkpoints, to fix (1). It probably adds too many
checkpoints, though? AFAICS a checkpoint after the "inprogress" phase
should be enough, a checkpoint after the "on/off" can go away.

d) Forces creating a restart point on the first checkpoint after
XLOG_CHECKSUMS record. It's done in a bit silly way, using a static
flag. Maybe there's a more elegant approach, say by comparing the
checksum value in ControlFile to the received checkpoint?

e) Randomizes a couple more GUC values. This needs more thought, it was
done blindly before better understanding how the failures happen (it
requires buffers evicted, not hitting max_wal_size, ...). There are more
params worth randomizing (e.g. the "fast" flag).

Anyway, with (c) and (d) applied, the checksum failures go away. It may
not be 100% right (e.g. we could do away with fewer checkpoints), but it
seems to be the right direction.

I don't have time to cleanup the branch more, I've already spent too
much time looking at LSNs advancing in weird ways :-( Hopefully it's
good enough to show what needs to be fixed, etc. If there's a new
version, I'm happy to rerun the tests on my machines, ofc.


However, there still are more bugs. Attached is a log from a crash after
hitting the assert into AbsorbChecksumsOffBarrier:

 Assert((LocalDataChecksumVersion != PG_DATA_CHECKSUM_VERSION) &&
 (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION ||
  LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION));

This happened while flipping checksums to 'off, but the backend already
thinks checksum are 'off':

    LocalDataChecksumVersion==0

I think this implies some bug in setting up LocalDataChecksumVersion
after connection, because this is for a query checking the checksum
state, executed by the TAP test (in a new connection, right?).

I haven't looked into this more, but how come the "off" direction does
not need to check InitialDataChecksumTransition?


I think the TAP test turned out to be very useful, so far. While
investigating on this, I thought about a couple more tweaks to make it
detect additional issues (on top of the randomization).

- Right now the shutdowns/restarts happen only in very limited places.
The checksum flips from on/off or off/on, and then a restart happens.
AFAICS it never happens in the "inprogress" phases, right?

- The pgbench clients connect once, so there are almost no new
connections while flipping checksums. Maybe some of the pgbenches should
run with "-C", to open new connections. It was pretty lucky the TAP
query hit the assert, this would make it more likely.


regards

-- 
Tomas Vondra
# Postmaster PID for node "main" is 1822197
[03:19:48.808](2.752s) ok 3198 - ensure checksums are set to on
[03:19:48.843](0.035s) # LSN before: B1/B72958C8
[03:19:48.846](0.003s) # disable_data_checksums fast true
[03:19:48.902](0.056s) ok 3199 - ensure data checksums are transitioned to off
[03:19:48.920](0.018s) # LSN after: B1/B73696F0
[03:19:50.996](2.076s) ok 3200 - ensure data pages can be read back on primary
[03:19:52.997](2.001s) # iteration 718 of 1000
# Running: pg_isready --timeout 180 --host /tmp/QlNkDRW0vA --port 18699
/tmp/QlNkDRW0vA:18699 - accepting connections
[03:19:55.542](2.545s) ok 3201 - ensure checksums are set to off
[03:19:55.554](0.012s) # LSN before: B1/BC2E1B00
[03:19:55.555](0.001s) # enable_data_checksums delay 0 limit 100 fast true
[03:19:55.603](0.048s) ok 3202 - ensure data checksums are transitioned to inprogress-on
[03:20:20.561](24.958s) ok 3203 - ensure data checksums are transitioned to on
[03:20:20.589](0.027s) # LSN after: B2/2FDA8620
[03:20:22.041](1.452s) ok 3204 - ensure data pages can be read back on primary
[03:20:23.042](1.001s) # iteration 719 of 1000
# Running: pg_isready --timeout 180 --host /tmp/QlNkDRW0vA --port 18699
/tmp/QlNkDRW0vA:18699 - accepting connections
[03:20:26.418](3.376s) ok 3205 - ensure checksums are set to on
[03:20:28.435](2.018s) # LSN before: B2/3569F408
[03:20:28.438](0.003s) # disable_data_checksums fast true
ack Broken pipe: write( 13, 'SELECT setting FROM pg_catalog.pg_settings WHERE name = 'data_checksums';' ) at /usr/share/perl5/IPC/Run/IO.pm line 550.
# No postmaster PID for node "main"
[03:21:02.179](33.741s) # Tests were run but no plan was declared and done_testing() was not seen.
[03:21:02.180](0.000s) # Looks like your test exited with 32 just after 3205.



000 s; distance=62367 kB, estimate=71775 kB; lsn=B2/2FCF2CA8, redo lsn=B2/2FCEE3A0, checksums=2 (2)
2025-09-01 03:20:20.524 BST checkpointer[1822201] LOG:  checkpoint starting: fast force wait
2025-09-01 03:20:20.533 BST checkpointer[1822201] LOG:  CreateCheckPoint ControlFile->data_checksum_version = 1 ControlFile->checkPointCopy.data_checksum_version = 1
2025-09-01 03:20:20.533 BST checkpointer[1822201] LOG:  update_controlfile ControlFile->data_checksum_version = 1  ControlFile->checkPointCopy.data_checksum_version = 1
2025-09-01 03:20:20.533 BST checkpointer[1822201] LOG:  checkpoint complete: wrote 31 buffers (24.2%), wrote 1 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.009 s; sync files=0, longest=0.000 s, average=0.000 s; distance=53 kB, estimate=64603 kB; lsn=B2/2FD21908, redo lsn=B2/2FCFBA00, checksums=1 (1)
2025-09-01 03:20:24.703 BST checkpointer[1822201] LOG:  checkpoints are occurring too frequently (4 seconds apart)
2025-09-01 03:20:24.703 BST checkpointer[1822201] HINT:  Consider increasing the configuration parameter "max_wal_size".
2025-09-01 03:20:24.703 BST checkpointer[1822201] LOG:  checkpoint starting: wal
2025-09-01 03:20:26.202 BST checkpointer[1822201] LOG:  CreateCheckPoint ControlFile->data_checksum_version = 1 ControlFile->checkPointCopy.data_checksum_version = 1
2025-09-01 03:20:26.202 BST checkpointer[1822201] LOG:  update_controlfile ControlFile->data_checksum_version = 1  ControlFile->checkPointCopy.data_checksum_version = 1
2025-09-01 03:20:26.202 BST checkpointer[1822201] LOG:  checkpoint complete: wrote 4 buffers (3.1%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed, 4 recycled; write=1.498 s, sync=0.001 s, total=1.499 s; sync files=0, longest=0.000 s, average=0.000 s; distance=52249 kB, estimate=63368 kB; lsn=B2/342F3E38, redo lsn=B2/33002080, checksums=1 (1)
2025-09-01 03:20:28.468 BST client backend[1822725] 006_pgbench_single.pl LOG:  disable_data_checksums fast 1
2025-09-01 03:20:28.468 BST client backend[1822725] 006_pgbench_single.pl STATEMENT:  SELECT pg_disable_data_checksums(true);
2025-09-01 03:20:28.479 BST checkpointer[1822201] LOG:  checkpoint starting: fast force wait
2025-09-01 03:20:28.480 BST checkpointer[1822201] LOG:  CreateCheckPoint ControlFile->data_checksum_version = 3 ControlFile->checkPointCopy.data_checksum_version = 3
2025-09-01 03:20:28.480 BST checkpointer[1822201] LOG:  update_controlfile ControlFile->data_checksum_version = 3  ControlFile->checkPointCopy.data_checksum_version = 3
2025-09-01 03:20:28.481 BST checkpointer[1822201] LOG:  checkpoint complete: wrote 40 buffers (31.2%), wrote 4 SLRU buffers; 0 WAL file(s) added, 0 removed, 2 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s, average=0.000 s; distance=39883 kB, estimate=61019 kB; lsn=B2/356F6040, redo lsn=B2/356F4E78, checksums=3 (3)
TRAP: failed Assert("(LocalDataChecksumVersion != PG_DATA_CHECKSUM_VERSION) && (LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION || LocalDataChecksumVersion == PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION)"), File: "xlog.c", Line: 5002, PID: 1822728
postgres: main: debian postgres [local] authentication(ExceptionalCondition+0x84)[0x55564d0428b4]
postgres: main: debian postgres [local] authentication(AbsorbChecksumsOffBarrier+0x5c)[0x55564c96f764]
postgres: main: debian postgres [local] authentication(ProcessProcSignalBarrier+0x1f0)[0x55564ce0d364]
postgres: main: debian postgres [local] authentication(ProcessInterrupts+0x90c)[0x55564ce4f0c0]
postgres: main: debian postgres [local] authentication(ClientAuthentication+0x50)[0x55564cbd1930]
postgres: main: debian postgres [local] authentication(+0x8eb940)[0x55564d05b940]
postgres: main: debian postgres [local] authentication(InitPostgres+0x3f8)[0x55564d05c8b0]
postgres: main: debian postgres [local] authentication(PostgresMain+0x2b4)[0x55564ce50588]
postgres: main: debian postgres [local] authentication(+0x6d6a74)[0x55564ce46a74]
postgres: main: debian postgres [local] authentication(postmaster_child_launch+0x1d4)[0x55564cd33684]
postgres: main: debian postgres [local] authentication(+0x5ca9c4)[0x55564cd3a9c4]
postgres: main: debian postgres [local] authentication(+0x5c78a4)[0x55564cd378a4]
postgres: main: debian postgres [local] authentication(PostmasterMain+0x153c)[0x55564cd37118]
postgres: main: debian postgres [local] authentication(main+0x3d8)[0x55564cbe452c]
/lib/aarch64-linux-gnu/libc.so.6(+0x27744)[0x7fffa1a47744]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0x7fffa1a47818]
postgres: main: debian postgres [local] authentication(_start+0x30)[0x55564c8562f0]
2025-09-01 03:20:28.513 BST postmaster[1822197] LOG:  client backend (PID 1822728) was terminated by signal 6: Aborted
2025-09-01 03:20:28.513 BST postmaster[1822197] LOG:  terminating any other active server processes
2025-09-01 03:20:28.515 BST postmaster[1822197] LOG:  shutting down because "restart_after_crash" is off
2025-09-01 03:20:28.516 BST postmaster[1822197] LOG:  database system is shut down

Reply via email to