On 8/27/25 14:42, Tomas Vondra wrote:
> On 8/27/25 14:39, Tomas Vondra wrote:
>> ...
>>
>> And this happened on Friday:
>>
>> commit c13070a27b63d9ce4850d88a63bf889a6fde26f0
>> Author: Alexander Korotkov <[email protected]>
>> Date:   Fri Aug 22 18:44:39 2025 +0300
>>
>>     Revert "Get rid of WALBufMappingLock"
>>
>>     This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683.
>>     It appears that conditional variables are not suitable for use
>>     inside critical sections.  If WaitLatch()/WaitEventSetWaitBlock()
>>     face postmaster death, they exit, releasing all locks instead of
>>     PANIC.  In certain situations, this leads to data corruption.
>>
>>     ...
>>
>> I think it's very likely the checksums were broken by this. After all,
>> that linked thread has subject "VM corruption on standby" and I've only
>> ever seen checksum failures on standby on the _vm fork.
>>
> 
> Forgot to mention - I did try with c13070a27b reverted, and with that I
> can reproduce the checksum failures again (using the fixed TAP test).
> 
> It's not a definitive proof, but it's a hint c13070a27b63 was causing
> the checksum failures.
> 

Unfortunately, it seems I spoke too soon :-( I decided to test this on
multiple machines overnight, and it still fails on the slower ones.

Attached is a patch addressing a couple more issues, to makes the TAP
test work well enough. (Attached as .txt, to not confuse cfbot).

- The pgbench started by IPC::Run::start() needs to be finished, to
release resources. Otherwise it leaks file descriptors (and there's a
bunch of "defunct" pgbench processes), which may be a problem with
increased number of iterations.

- AFAICS the pgbench can't use stdin/stdout/stderr, otherwise the pipes
get broken when the command fails (after restart). I just used /dev/null
instead, the stdout/stderr was not used anyway (or was it?).

- commented out the pg_checksums call, because of the issues mentioned
earlier (I was trying to make it work by remembering the state, but it
seems to not make it into control file before shutdown occasionally)

I increased the number of iterations to 1000+ and ran it on three machines:

- ryzen (new machine from ~2024)
- xeon (old slow machine from ~2016)
- rpi5 (very slow machine)

I haven't seen a single failure on ryzen, after ~3000 iterations. But
both xeon and rpi5 show a number of failures. Xeon has about 35 reports
of 'Failed test', rpi5 and about 10.

My guess is it's something about timing. It works on the "fast" ryzen,
but fails on xeon which is ~3-4x slower. And rpi5, which is even slower.

The other reason why it seems unrelated to the reverted commit is that
it's not just about visibility maps (which was got corrupted). I see
checksum failures on VM and FSM. I think I forgot about the FSM cases,
and by the fact that I saw no failures on the ryzen post revert. But
clearly, other machines still have issues.

Another interesting fact is that the checksum failures happen both on
the primary and the standby, it's not just a standby issue. But again,
this sees to be machine-dependent. On the rpi5 I've only seen standby
issues. The xeon sees failures both on primary/standby (roughly 1:1).

There are more weird things. If I grep for page failures, I see this (a
more detailed log attached):

-----------
# 2025-08-28 22:33:28.195 CEST startup[177466] LOG:  page verification
failed, calculated checksum 25350 but expected 44559
# 2025-08-28 22:33:28.197 CEST startup[177466] LOG:  page verification
failed, calculated checksum 25350 but expected 44559
# 2025-08-28 22:33:28.199 CEST startup[177466] LOG:  page verification
failed, calculated checksum 59909 but expected 53920
# 2025-08-28 22:33:28.201 CEST startup[177466] LOG:  page verification
failed, calculated checksum 59909 but expected 53920
# 2025-08-28 22:33:28.206 CEST startup[177466] LOG:  page verification
failed, calculated checksum 59909 but expected 53920
# 2025-08-28 22:33:28.207 CEST startup[177466] LOG:  page verification
failed, calculated checksum 25350 but expected 44559
-----------

This is right after a single restart, while doing the recovery. The
weird thing is, this is all for just two FSM pages!

-----------
LOG:  invalid page in block 2 of relation "base/5/16410_fsm"; zeroing
out page
LOG:  invalid page in block 2 of relation "base/5/16408_fsm"; zeroing
out page
-----------

And the calculated/expected checksums repeat! It's just different WAL
records hitting the same page, and complaining about the same issue,
after claiming the page was zeroed out. Isn't that weird? How come the
page doesn't "get" the correct checksum after the first redo?

I've seen these failures after changing checksums in both directions,
both after enabling and disabling. But I've only ever saw this after
immediate shutdown, never after fast shutdown. (It's interesting the
pg_checksums failed only after fast shutdowns ...).

Could it be that the redo happens to start from an older position, but
using the new checksum version?


regards

-- 
Tomas Vondra
# 2025-08-28 22:33:28.195 CEST startup[177466] LOG:  page verification failed, calculated checksum 25350 but expected 44559
# 2025-08-28 22:33:28.195 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA04B460 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082091, isCatalogRel: F, nplans: 0, nredirected: 38, ndead: 0, nunused: 126, redirected: [1->102, 4->129, 6->199, 10->198, 12->135, 17->152, 19->99, 23->145, 27->177, 28->168, 32->69, 33->118, 38->127, 43->140, 44->170, 47->119, 48->200, 49->62, 54->201, 57->202, 79->115, 80->185, 90->100, 94->21, 120->107, 125->139, 133->167, 138->164, 141->93, 144->190, 156->187, 159->172, 160->59, 163->193, 173->98, 174->153, 176->181, 182->72], unused: [88, 8, 46, 76, 77, 158, 66, 192, 3, 7, 29, 81, 89, 154, 78, 165, 157, 56, 95, 83, 40, 194, 20, 97, 137, 74, 85, 16, 143, 148, 189, 24, 58, 162, 109, 175, 68, 92, 147, 30, 61, 65, 101, 155, 103, 108, 113, 124, 128, 131, 11, 37, 39, 51, 63, 166, 35, 114, 116, 70, 73, 186, 42, 50, 104, 84, 123, 179, 22, 87, 110, 134, 188, 18, 96, 191, 31, 53, 55, 195, 25, 151, 14, 196, 13, 75, 121, 132, 111, 122, 146, 150, 9, 117, 183, 26, 86, 171, 71, 41, 45, 112, 130, 149, 178, 52, 82, 169, 161, 5, 15, 34, 180, 60, 184, 142, 36, 197, 64, 126, 136, 106, 67, 105, 91, 2]; blkref #0: rel 1663/5/16410, blk 147
# 2025-08-28 22:33:28.195 CEST startup[177466] LOG:  invalid page in block 2 of relation "base/5/16410_fsm"; zeroing out page
# 2025-08-28 22:33:28.195 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA04B460 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082091, isCatalogRel: F, nplans: 0, nredirected: 38, ndead: 0, nunused: 126, redirected: [1->102, 4->129, 6->199, 10->198, 12->135, 17->152, 19->99, 23->145, 27->177, 28->168, 32->69, 33->118, 38->127, 43->140, 44->170, 47->119, 48->200, 49->62, 54->201, 57->202, 79->115, 80->185, 90->100, 94->21, 120->107, 125->139, 133->167, 138->164, 141->93, 144->190, 156->187, 159->172, 160->59, 163->193, 173->98, 174->153, 176->181, 182->72], unused: [88, 8, 46, 76, 77, 158, 66, 192, 3, 7, 29, 81, 89, 154, 78, 165, 157, 56, 95, 83, 40, 194, 20, 97, 137, 74, 85, 16, 143, 148, 189, 24, 58, 162, 109, 175, 68, 92, 147, 30, 61, 65, 101, 155, 103, 108, 113, 124, 128, 131, 11, 37, 39, 51, 63, 166, 35, 114, 116, 70, 73, 186, 42, 50, 104, 84, 123, 179, 22, 87, 110, 134, 188, 18, 96, 191, 31, 53, 55, 195, 25, 151, 14, 196, 13, 75, 121, 132, 111, 122, 146, 150, 9, 117, 183, 26, 86, 171, 71, 41, 45, 112, 130, 149, 178, 52, 82, 169, 161, 5, 15, 34, 180, 60, 184, 142, 36, 197, 64, 126, 136, 106, 67, 105, 91, 2]; blkref #0: rel 1663/5/16410, blk 147
# 2025-08-28 22:33:28.195 CEST startup[177466] WARNING:  invalid page in block 2 of relation "base/5/16410_fsm"; zeroing out page
# 2025-08-28 22:33:28.195 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA04B460 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082091, isCatalogRel: F, nplans: 0, nredirected: 38, ndead: 0, nunused: 126, redirected: [1->102, 4->129, 6->199, 10->198, 12->135, 17->152, 19->99, 23->145, 27->177, 28->168, 32->69, 33->118, 38->127, 43->140, 44->170, 47->119, 48->200, 49->62, 54->201, 57->202, 79->115, 80->185, 90->100, 94->21, 120->107, 125->139, 133->167, 138->164, 141->93, 144->190, 156->187, 159->172, 160->59, 163->193, 173->98, 174->153, 176->181, 182->72], unused: [88, 8, 46, 76, 77, 158, 66, 192, 3, 7, 29, 81, 89, 154, 78, 165, 157, 56, 95, 83, 40, 194, 20, 97, 137, 74, 85, 16, 143, 148, 189, 24, 58, 162, 109, 175, 68, 92, 147, 30, 61, 65, 101, 155, 103, 108, 113, 124, 128, 131, 11, 37, 39, 51, 63, 166, 35, 114, 116, 70, 73, 186, 42, 50, 104, 84, 123, 179, 22, 87, 110, 134, 188, 18, 96, 191, 31, 53, 55, 195, 25, 151, 14, 196, 13, 75, 121, 132, 111, 122, 146, 150, 9, 117, 183, 26, 86, 171, 71, 41, 45, 112, 130, 149, 178, 52, 82, 169, 161, 5, 15, 34, 180, 60, 184, 142, 36, 197, 64, 126, 136, 106, 67, 105, 91, 2]; blkref #0: rel 1663/5/16410, blk 147
# 2025-08-28 22:33:28.197 CEST startup[177466] LOG:  page verification failed, calculated checksum 25350 but expected 44559
# 2025-08-28 22:33:28.197 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA0DA580 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082155, isCatalogRel: F, nplans: 0, nredirected: 36, ndead: 0, nunused: 125, redirected: [1->130, 2->202, 3->178, 4->154, 5->88, 6->192, 7->170, 9->188, 10->133, 11->124, 12->155, 13->201, 14->129, 16->165, 17->200, 18->145, 19->80, 21->194, 22->181, 24->189, 25->190, 26->151, 27->70, 29->67, 32->174, 33->158, 34->152, 35->203, 38->149, 40->69, 54->150, 56->120, 60->132, 62->166, 84->176, 89->182], unused: [104, 48, 63, 116, 125, 128, 141, 42, 53, 82, 163, 137, 75, 110, 122, 139, 168, 61, 77, 94, 119, 185, 31, 198, 36, 76, 100, 177, 184, 20, 30, 65, 93, 159, 49, 169, 105, 193, 71, 44, 83, 195, 199, 78, 97, 127, 161, 23, 46, 47, 90, 108, 79, 102, 123, 157, 156, 87, 140, 142, 134, 73, 186, 50, 109, 72, 92, 113, 167, 171, 187, 59, 148, 41, 106, 172, 183, 66, 143, 28, 64, 112, 57, 131, 96, 126, 138, 179, 107, 146, 68, 95, 98, 180, 39, 55, 74, 160, 45, 101, 103, 114, 135, 164, 51, 58, 115, 111, 144, 121, 99, 86, 85, 81, 136, 196, 52, 117, 153, 173, 162, 43, 91, 118, 175]; blkref #0: rel 1663/5/16410, blk 153
# 2025-08-28 22:33:28.197 CEST startup[177466] LOG:  invalid page in block 2 of relation "base/5/16410_fsm"; zeroing out page
# 2025-08-28 22:33:28.197 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA0DA580 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082155, isCatalogRel: F, nplans: 0, nredirected: 36, ndead: 0, nunused: 125, redirected: [1->130, 2->202, 3->178, 4->154, 5->88, 6->192, 7->170, 9->188, 10->133, 11->124, 12->155, 13->201, 14->129, 16->165, 17->200, 18->145, 19->80, 21->194, 22->181, 24->189, 25->190, 26->151, 27->70, 29->67, 32->174, 33->158, 34->152, 35->203, 38->149, 40->69, 54->150, 56->120, 60->132, 62->166, 84->176, 89->182], unused: [104, 48, 63, 116, 125, 128, 141, 42, 53, 82, 163, 137, 75, 110, 122, 139, 168, 61, 77, 94, 119, 185, 31, 198, 36, 76, 100, 177, 184, 20, 30, 65, 93, 159, 49, 169, 105, 193, 71, 44, 83, 195, 199, 78, 97, 127, 161, 23, 46, 47, 90, 108, 79, 102, 123, 157, 156, 87, 140, 142, 134, 73, 186, 50, 109, 72, 92, 113, 167, 171, 187, 59, 148, 41, 106, 172, 183, 66, 143, 28, 64, 112, 57, 131, 96, 126, 138, 179, 107, 146, 68, 95, 98, 180, 39, 55, 74, 160, 45, 101, 103, 114, 135, 164, 51, 58, 115, 111, 144, 121, 99, 86, 85, 81, 136, 196, 52, 117, 153, 173, 162, 43, 91, 118, 175]; blkref #0: rel 1663/5/16410, blk 153
# 2025-08-28 22:33:28.197 CEST startup[177466] WARNING:  invalid page in block 2 of relation "base/5/16410_fsm"; zeroing out page
# 2025-08-28 22:33:28.197 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA0DA580 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082155, isCatalogRel: F, nplans: 0, nredirected: 36, ndead: 0, nunused: 125, redirected: [1->130, 2->202, 3->178, 4->154, 5->88, 6->192, 7->170, 9->188, 10->133, 11->124, 12->155, 13->201, 14->129, 16->165, 17->200, 18->145, 19->80, 21->194, 22->181, 24->189, 25->190, 26->151, 27->70, 29->67, 32->174, 33->158, 34->152, 35->203, 38->149, 40->69, 54->150, 56->120, 60->132, 62->166, 84->176, 89->182], unused: [104, 48, 63, 116, 125, 128, 141, 42, 53, 82, 163, 137, 75, 110, 122, 139, 168, 61, 77, 94, 119, 185, 31, 198, 36, 76, 100, 177, 184, 20, 30, 65, 93, 159, 49, 169, 105, 193, 71, 44, 83, 195, 199, 78, 97, 127, 161, 23, 46, 47, 90, 108, 79, 102, 123, 157, 156, 87, 140, 142, 134, 73, 186, 50, 109, 72, 92, 113, 167, 171, 187, 59, 148, 41, 106, 172, 183, 66, 143, 28, 64, 112, 57, 131, 96, 126, 138, 179, 107, 146, 68, 95, 98, 180, 39, 55, 74, 160, 45, 101, 103, 114, 135, 164, 51, 58, 115, 111, 144, 121, 99, 86, 85, 81, 136, 196, 52, 117, 153, 173, 162, 43, 91, 118, 175]; blkref #0: rel 1663/5/16410, blk 153
# 2025-08-28 22:33:28.199 CEST startup[177466] LOG:  page verification failed, calculated checksum 59909 but expected 53920
# 2025-08-28 22:33:28.199 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA141F18 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082018, isCatalogRel: F, nplans: 0, nredirected: 2, ndead: 0, nunused: 201, redirected: [1->201, 5->205], unused: [2, 3, 4, 8, 9, 10, 11, 12, 13, 16, 17, 19, 20, 23, 24, 25, 29, 31, 33, 34, 35, 36, 37, 38, 41, 44, 46, 47, 51, 54, 56, 57, 58, 66, 68, 70, 71, 72, 77, 78, 79, 80, 81, 83, 87, 88, 90, 91, 92, 93, 94, 96, 98, 101, 102, 103, 104, 105, 106, 107, 111, 112, 114, 116, 119, 120, 122, 123, 125, 126, 128, 129, 130, 132, 136, 138, 142, 143, 146, 147, 148, 149, 150, 151, 152, 157, 158, 159, 160, 162, 164, 165, 168, 169, 170, 171, 172, 175, 177, 178, 179, 181, 182, 184, 187, 189, 190, 194, 198, 6, 7, 14, 15, 18, 21, 22, 26, 27, 28, 30, 32, 39, 40, 42, 43, 45, 48, 49, 50, 52, 53, 55, 59, 60, 61, 62, 63, 64, 65, 67, 69, 73, 74, 75, 76, 82, 84, 85, 86, 89, 95, 97, 99, 100, 108, 109, 110, 113, 115, 117, 118, 121, 124, 127, 131, 133, 134, 135, 137, 139, 140, 141, 144, 145, 153, 154, 155, 156, 161, 163, 166, 167, 173, 174, 176, 180, 183, 185, 186, 188, 191, 192, 193, 195, 196, 197, 199, 200, 202, 203, 204]; blkref #0: rel 1663/5/16408, blk 9
# 2025-08-28 22:33:28.199 CEST startup[177466] LOG:  invalid page in block 2 of relation "base/5/16408_fsm"; zeroing out page
# 2025-08-28 22:33:28.199 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA141F18 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082018, isCatalogRel: F, nplans: 0, nredirected: 2, ndead: 0, nunused: 201, redirected: [1->201, 5->205], unused: [2, 3, 4, 8, 9, 10, 11, 12, 13, 16, 17, 19, 20, 23, 24, 25, 29, 31, 33, 34, 35, 36, 37, 38, 41, 44, 46, 47, 51, 54, 56, 57, 58, 66, 68, 70, 71, 72, 77, 78, 79, 80, 81, 83, 87, 88, 90, 91, 92, 93, 94, 96, 98, 101, 102, 103, 104, 105, 106, 107, 111, 112, 114, 116, 119, 120, 122, 123, 125, 126, 128, 129, 130, 132, 136, 138, 142, 143, 146, 147, 148, 149, 150, 151, 152, 157, 158, 159, 160, 162, 164, 165, 168, 169, 170, 171, 172, 175, 177, 178, 179, 181, 182, 184, 187, 189, 190, 194, 198, 6, 7, 14, 15, 18, 21, 22, 26, 27, 28, 30, 32, 39, 40, 42, 43, 45, 48, 49, 50, 52, 53, 55, 59, 60, 61, 62, 63, 64, 65, 67, 69, 73, 74, 75, 76, 82, 84, 85, 86, 89, 95, 97, 99, 100, 108, 109, 110, 113, 115, 117, 118, 121, 124, 127, 131, 133, 134, 135, 137, 139, 140, 141, 144, 145, 153, 154, 155, 156, 161, 163, 166, 167, 173, 174, 176, 180, 183, 185, 186, 188, 191, 192, 193, 195, 196, 197, 199, 200, 202, 203, 204]; blkref #0: rel 1663/5/16408, blk 9
# 2025-08-28 22:33:28.199 CEST startup[177466] WARNING:  invalid page in block 2 of relation "base/5/16408_fsm"; zeroing out page
# 2025-08-28 22:33:28.199 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA141F18 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082018, isCatalogRel: F, nplans: 0, nredirected: 2, ndead: 0, nunused: 201, redirected: [1->201, 5->205], unused: [2, 3, 4, 8, 9, 10, 11, 12, 13, 16, 17, 19, 20, 23, 24, 25, 29, 31, 33, 34, 35, 36, 37, 38, 41, 44, 46, 47, 51, 54, 56, 57, 58, 66, 68, 70, 71, 72, 77, 78, 79, 80, 81, 83, 87, 88, 90, 91, 92, 93, 94, 96, 98, 101, 102, 103, 104, 105, 106, 107, 111, 112, 114, 116, 119, 120, 122, 123, 125, 126, 128, 129, 130, 132, 136, 138, 142, 143, 146, 147, 148, 149, 150, 151, 152, 157, 158, 159, 160, 162, 164, 165, 168, 169, 170, 171, 172, 175, 177, 178, 179, 181, 182, 184, 187, 189, 190, 194, 198, 6, 7, 14, 15, 18, 21, 22, 26, 27, 28, 30, 32, 39, 40, 42, 43, 45, 48, 49, 50, 52, 53, 55, 59, 60, 61, 62, 63, 64, 65, 67, 69, 73, 74, 75, 76, 82, 84, 85, 86, 89, 95, 97, 99, 100, 108, 109, 110, 113, 115, 117, 118, 121, 124, 127, 131, 133, 134, 135, 137, 139, 140, 141, 144, 145, 153, 154, 155, 156, 161, 163, 166, 167, 173, 174, 176, 180, 183, 185, 186, 188, 191, 192, 193, 195, 196, 197, 199, 200, 202, 203, 204]; blkref #0: rel 1663/5/16408, blk 9
# 2025-08-28 22:33:28.201 CEST startup[177466] LOG:  page verification failed, calculated checksum 59909 but expected 53920
# 2025-08-28 22:33:28.201 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA1CFBF0 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082318, isCatalogRel: F, nplans: 0, nredirected: 6, ndead: 0, nunused: 198, redirected: [7->199, 40->204, 56->208, 59->188, 85->209, 177->207], unused: [174, 3, 4, 22, 31, 38, 47, 67, 69, 70, 98, 104, 107, 109, 118, 136, 142, 149, 171, 175, 180, 187, 210, 1, 2, 6, 13, 15, 39, 42, 46, 49, 61, 63, 75, 76, 77, 81, 86, 90, 92, 96, 97, 99, 106, 112, 113, 129, 132, 133, 138, 147, 151, 159, 162, 165, 169, 185, 189, 192, 203, 205, 5, 11, 14, 20, 23, 28, 36, 41, 43, 51, 53, 62, 65, 68, 72, 74, 87, 88, 94, 102, 105, 110, 111, 114, 115, 116, 121, 124, 127, 140, 182, 186, 195, 196, 206, 9, 12, 17, 18, 21, 25, 30, 35, 48, 50, 52, 55, 57, 60, 71, 78, 82, 83, 84, 91, 101, 119, 120, 128, 130, 131, 141, 144, 146, 148, 153, 168, 179, 183, 184, 164, 8, 19, 24, 29, 34, 64, 66, 79, 93, 103, 122, 123, 134, 135, 137, 139, 143, 150, 152, 155, 156, 157, 160, 161, 163, 170, 178, 181, 191, 194, 198, 201, 202, 10, 16, 26, 27, 32, 33, 37, 44, 45, 54, 58, 73, 80, 89, 95, 100, 108, 117, 125, 126, 145, 154, 158, 166, 167, 172, 173, 176, 190, 193, 197, 200]; blkref #0: rel 1663/5/16408, blk 126
# 2025-08-28 22:33:28.201 CEST startup[177466] LOG:  invalid page in block 2 of relation "base/5/16408_fsm"; zeroing out page
# 2025-08-28 22:33:28.201 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA1CFBF0 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082318, isCatalogRel: F, nplans: 0, nredirected: 6, ndead: 0, nunused: 198, redirected: [7->199, 40->204, 56->208, 59->188, 85->209, 177->207], unused: [174, 3, 4, 22, 31, 38, 47, 67, 69, 70, 98, 104, 107, 109, 118, 136, 142, 149, 171, 175, 180, 187, 210, 1, 2, 6, 13, 15, 39, 42, 46, 49, 61, 63, 75, 76, 77, 81, 86, 90, 92, 96, 97, 99, 106, 112, 113, 129, 132, 133, 138, 147, 151, 159, 162, 165, 169, 185, 189, 192, 203, 205, 5, 11, 14, 20, 23, 28, 36, 41, 43, 51, 53, 62, 65, 68, 72, 74, 87, 88, 94, 102, 105, 110, 111, 114, 115, 116, 121, 124, 127, 140, 182, 186, 195, 196, 206, 9, 12, 17, 18, 21, 25, 30, 35, 48, 50, 52, 55, 57, 60, 71, 78, 82, 83, 84, 91, 101, 119, 120, 128, 130, 131, 141, 144, 146, 148, 153, 168, 179, 183, 184, 164, 8, 19, 24, 29, 34, 64, 66, 79, 93, 103, 122, 123, 134, 135, 137, 139, 143, 150, 152, 155, 156, 157, 160, 161, 163, 170, 178, 181, 191, 194, 198, 201, 202, 10, 16, 26, 27, 32, 33, 37, 44, 45, 54, 58, 73, 80, 89, 95, 100, 108, 117, 125, 126, 145, 154, 158, 166, 167, 172, 173, 176, 190, 193, 197, 200]; blkref #0: rel 1663/5/16408, blk 126
# 2025-08-28 22:33:28.201 CEST startup[177466] WARNING:  invalid page in block 2 of relation "base/5/16408_fsm"; zeroing out page
# 2025-08-28 22:33:28.201 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA1CFBF0 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082318, isCatalogRel: F, nplans: 0, nredirected: 6, ndead: 0, nunused: 198, redirected: [7->199, 40->204, 56->208, 59->188, 85->209, 177->207], unused: [174, 3, 4, 22, 31, 38, 47, 67, 69, 70, 98, 104, 107, 109, 118, 136, 142, 149, 171, 175, 180, 187, 210, 1, 2, 6, 13, 15, 39, 42, 46, 49, 61, 63, 75, 76, 77, 81, 86, 90, 92, 96, 97, 99, 106, 112, 113, 129, 132, 133, 138, 147, 151, 159, 162, 165, 169, 185, 189, 192, 203, 205, 5, 11, 14, 20, 23, 28, 36, 41, 43, 51, 53, 62, 65, 68, 72, 74, 87, 88, 94, 102, 105, 110, 111, 114, 115, 116, 121, 124, 127, 140, 182, 186, 195, 196, 206, 9, 12, 17, 18, 21, 25, 30, 35, 48, 50, 52, 55, 57, 60, 71, 78, 82, 83, 84, 91, 101, 119, 120, 128, 130, 131, 141, 144, 146, 148, 153, 168, 179, 183, 184, 164, 8, 19, 24, 29, 34, 64, 66, 79, 93, 103, 122, 123, 134, 135, 137, 139, 143, 150, 152, 155, 156, 157, 160, 161, 163, 170, 178, 181, 191, 194, 198, 201, 202, 10, 16, 26, 27, 32, 33, 37, 44, 45, 54, 58, 73, 80, 89, 95, 100, 108, 117, 125, 126, 145, 154, 158, 166, 167, 172, 173, 176, 190, 193, 197, 200]; blkref #0: rel 1663/5/16408, blk 126
# 2025-08-28 22:33:28.206 CEST startup[177466] LOG:  page verification failed, calculated checksum 59909 but expected 53920
# 2025-08-28 22:33:28.206 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA300858 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082445, isCatalogRel: F, nplans: 0, nredirected: 9, ndead: 0, nunused: 193, redirected: [1->157, 2->207, 3->199, 4->205, 7->187, 17->191, 18->192, 21->201, 25->206], unused: [208, 5, 10, 11, 24, 27, 32, 37, 45, 50, 56, 66, 67, 74, 84, 86, 89, 106, 109, 150, 154, 204, 23, 31, 33, 40, 51, 58, 62, 69, 72, 75, 77, 80, 85, 87, 88, 94, 137, 163, 188, 193, 197, 28, 34, 93, 95, 98, 101, 110, 123, 126, 127, 147, 149, 172, 181, 200, 8, 12, 15, 44, 46, 49, 59, 79, 83, 96, 105, 121, 122, 131, 139, 183, 189, 190, 195, 202, 203, 212, 6, 35, 47, 65, 71, 76, 90, 104, 113, 116, 129, 133, 136, 161, 167, 169, 176, 196, 16, 22, 26, 30, 39, 48, 57, 68, 78, 81, 99, 107, 119, 120, 124, 140, 148, 162, 166, 170, 178, 184, 29, 41, 42, 43, 52, 54, 55, 82, 97, 100, 114, 128, 130, 141, 142, 144, 145, 155, 158, 165, 168, 175, 180, 186, 210, 9, 13, 19, 60, 70, 73, 91, 92, 102, 117, 132, 134, 135, 138, 143, 146, 151, 152, 153, 156, 159, 164, 171, 177, 179, 182, 194, 213, 14, 20, 36, 38, 53, 61, 63, 64, 103, 108, 111, 112, 115, 118, 125, 160, 173, 174, 185, 198]; blkref #0: rel 1663/5/16408, blk 4
# 2025-08-28 22:33:28.206 CEST startup[177466] LOG:  invalid page in block 2 of relation "base/5/16408_fsm"; zeroing out page
# 2025-08-28 22:33:28.206 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA300858 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082445, isCatalogRel: F, nplans: 0, nredirected: 9, ndead: 0, nunused: 193, redirected: [1->157, 2->207, 3->199, 4->205, 7->187, 17->191, 18->192, 21->201, 25->206], unused: [208, 5, 10, 11, 24, 27, 32, 37, 45, 50, 56, 66, 67, 74, 84, 86, 89, 106, 109, 150, 154, 204, 23, 31, 33, 40, 51, 58, 62, 69, 72, 75, 77, 80, 85, 87, 88, 94, 137, 163, 188, 193, 197, 28, 34, 93, 95, 98, 101, 110, 123, 126, 127, 147, 149, 172, 181, 200, 8, 12, 15, 44, 46, 49, 59, 79, 83, 96, 105, 121, 122, 131, 139, 183, 189, 190, 195, 202, 203, 212, 6, 35, 47, 65, 71, 76, 90, 104, 113, 116, 129, 133, 136, 161, 167, 169, 176, 196, 16, 22, 26, 30, 39, 48, 57, 68, 78, 81, 99, 107, 119, 120, 124, 140, 148, 162, 166, 170, 178, 184, 29, 41, 42, 43, 52, 54, 55, 82, 97, 100, 114, 128, 130, 141, 142, 144, 145, 155, 158, 165, 168, 175, 180, 186, 210, 9, 13, 19, 60, 70, 73, 91, 92, 102, 117, 132, 134, 135, 138, 143, 146, 151, 152, 153, 156, 159, 164, 171, 177, 179, 182, 194, 213, 14, 20, 36, 38, 53, 61, 63, 64, 103, 108, 111, 112, 115, 118, 125, 160, 173, 174, 185, 198]; blkref #0: rel 1663/5/16408, blk 4
# 2025-08-28 22:33:28.206 CEST startup[177466] WARNING:  invalid page in block 2 of relation "base/5/16408_fsm"; zeroing out page
# 2025-08-28 22:33:28.206 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA300858 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082445, isCatalogRel: F, nplans: 0, nredirected: 9, ndead: 0, nunused: 193, redirected: [1->157, 2->207, 3->199, 4->205, 7->187, 17->191, 18->192, 21->201, 25->206], unused: [208, 5, 10, 11, 24, 27, 32, 37, 45, 50, 56, 66, 67, 74, 84, 86, 89, 106, 109, 150, 154, 204, 23, 31, 33, 40, 51, 58, 62, 69, 72, 75, 77, 80, 85, 87, 88, 94, 137, 163, 188, 193, 197, 28, 34, 93, 95, 98, 101, 110, 123, 126, 127, 147, 149, 172, 181, 200, 8, 12, 15, 44, 46, 49, 59, 79, 83, 96, 105, 121, 122, 131, 139, 183, 189, 190, 195, 202, 203, 212, 6, 35, 47, 65, 71, 76, 90, 104, 113, 116, 129, 133, 136, 161, 167, 169, 176, 196, 16, 22, 26, 30, 39, 48, 57, 68, 78, 81, 99, 107, 119, 120, 124, 140, 148, 162, 166, 170, 178, 184, 29, 41, 42, 43, 52, 54, 55, 82, 97, 100, 114, 128, 130, 141, 142, 144, 145, 155, 158, 165, 168, 175, 180, 186, 210, 9, 13, 19, 60, 70, 73, 91, 92, 102, 117, 132, 134, 135, 138, 143, 146, 151, 152, 153, 156, 159, 164, 171, 177, 179, 182, 194, 213, 14, 20, 36, 38, 53, 61, 63, 64, 103, 108, 111, 112, 115, 118, 125, 160, 173, 174, 185, 198]; blkref #0: rel 1663/5/16408, blk 4
# 2025-08-28 22:33:28.207 CEST startup[177466] LOG:  page verification failed, calculated checksum 25350 but expected 44559
# 2025-08-28 22:33:28.207 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA35A3D0 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082474, isCatalogRel: F, nplans: 0, nredirected: 41, ndead: 0, nunused: 117, redirected: [1->217, 4->214, 6->167, 7->202, 14->180, 17->177, 18->186, 19->216, 23->199, 24->79, 29->205, 31->151, 37->210, 38->30, 39->118, 41->87, 43->122, 60->52, 63->35, 67->34, 86->215, 90->150, 100->113, 102->195, 106->132, 107->172, 111->130, 117->191, 120->91, 121->98, 124->46, 125->145, 127->160, 134->181, 147->76, 158->126, 161->154, 170->209, 174->208, 184->188, 185->54], unused: [183, 26, 115, 155, 144, 82, 101, 159, 192, 141, 12, 65, 189, 175, 206, 71, 162, 164, 105, 32, 55, 84, 171, 42, 74, 77, 201, 94, 178, 179, 207, 213, 59, 136, 139, 51, 194, 28, 203, 211, 70, 131, 137, 89, 93, 128, 157, 193, 96, 21, 85, 73, 16, 169, 83, 149, 197, 25, 138, 143, 166, 3, 13, 47, 78, 5, 11, 36, 99, 56, 62, 57, 114, 182, 45, 66, 68, 75, 88, 110, 61, 198, 20, 48, 187, 81, 140, 53, 64, 8, 148, 95, 129, 142, 204, 33, 108, 212, 2, 40, 109, 44, 104, 153, 50, 97, 146, 200, 22, 135, 103, 9, 196, 27, 49, 173, 112]; blkref #0: rel 1663/5/16410, blk 149
# 2025-08-28 22:33:28.207 CEST startup[177466] LOG:  invalid page in block 2 of relation "base/5/16410_fsm"; zeroing out page
# 2025-08-28 22:33:28.207 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA35A3D0 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082474, isCatalogRel: F, nplans: 0, nredirected: 41, ndead: 0, nunused: 117, redirected: [1->217, 4->214, 6->167, 7->202, 14->180, 17->177, 18->186, 19->216, 23->199, 24->79, 29->205, 31->151, 37->210, 38->30, 39->118, 41->87, 43->122, 60->52, 63->35, 67->34, 86->215, 90->150, 100->113, 102->195, 106->132, 107->172, 111->130, 117->191, 120->91, 121->98, 124->46, 125->145, 127->160, 134->181, 147->76, 158->126, 161->154, 170->209, 174->208, 184->188, 185->54], unused: [183, 26, 115, 155, 144, 82, 101, 159, 192, 141, 12, 65, 189, 175, 206, 71, 162, 164, 105, 32, 55, 84, 171, 42, 74, 77, 201, 94, 178, 179, 207, 213, 59, 136, 139, 51, 194, 28, 203, 211, 70, 131, 137, 89, 93, 128, 157, 193, 96, 21, 85, 73, 16, 169, 83, 149, 197, 25, 138, 143, 166, 3, 13, 47, 78, 5, 11, 36, 99, 56, 62, 57, 114, 182, 45, 66, 68, 75, 88, 110, 61, 198, 20, 48, 187, 81, 140, 53, 64, 8, 148, 95, 129, 142, 204, 33, 108, 212, 2, 40, 109, 44, 104, 153, 50, 97, 146, 200, 22, 135, 103, 9, 196, 27, 49, 173, 112]; blkref #0: rel 1663/5/16410, blk 149
# 2025-08-28 22:33:28.207 CEST startup[177466] WARNING:  invalid page in block 2 of relation "base/5/16410_fsm"; zeroing out page
# 2025-08-28 22:33:28.207 CEST startup[177466] CONTEXT:  WAL redo at 2C/FA35A3D0 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 11082474, isCatalogRel: F, nplans: 0, nredirected: 41, ndead: 0, nunused: 117, redirected: [1->217, 4->214, 6->167, 7->202, 14->180, 17->177, 18->186, 19->216, 23->199, 24->79, 29->205, 31->151, 37->210, 38->30, 39->118, 41->87, 43->122, 60->52, 63->35, 67->34, 86->215, 90->150, 100->113, 102->195, 106->132, 107->172, 111->130, 117->191, 120->91, 121->98, 124->46, 125->145, 127->160, 134->181, 147->76, 158->126, 161->154, 170->209, 174->208, 184->188, 185->54], unused: [183, 26, 115, 155, 144, 82, 101, 159, 192, 141, 12, 65, 189, 175, 206, 71, 162, 164, 105, 32, 55, 84, 171, 42, 74, 77, 201, 94, 178, 179, 207, 213, 59, 136, 139, 51, 194, 28, 203, 211, 70, 131, 137, 89, 93, 128, 157, 193, 96, 21, 85, 73, 16, 169, 83, 149, 197, 25, 138, 143, 166, 3, 13, 47, 78, 5, 11, 36, 99, 56, 62, 57, 114, 182, 45, 66, 68, 75, 88, 110, 61, 198, 20, 48, 187, 81, 140, 53, 64, 8, 148, 95, 129, 142, 204, 33, 108, 212, 2, 40, 109, 44, 104, 153, 50, 97, 146, 200, 22, 135, 103, 9, 196, 27, 49, 173, 112]; blkref #0: rel 1663/5/16410, blk 149
From 57bb79b1bc8faac646131336abcc1596711c5f32 Mon Sep 17 00:00:00 2001
From: tomas <tomas>
Date: Thu, 28 Aug 2025 22:10:25 +0200
Subject: [PATCH] TAP fixes

---
 .../t/006_concurrent_pgbench.pl               | 88 ++++++++++++++-----
 1 file changed, 68 insertions(+), 20 deletions(-)

diff --git a/src/test/modules/test_checksums/t/006_concurrent_pgbench.pl 
b/src/test/modules/test_checksums/t/006_concurrent_pgbench.pl
index b33ca6e0c26..374eac7e6a3 100644
--- a/src/test/modules/test_checksums/t/006_concurrent_pgbench.pl
+++ b/src/test/modules/test_checksums/t/006_concurrent_pgbench.pl
@@ -23,11 +23,14 @@ my $node_primary_loglocation = 0;
 my $node_standby_1;
 my $node_standby_1_loglocation = 0;
 
+my $pgbench_primary = undef;
+my $pgbench_standby = undef;
+
 # The number of full test iterations which will be performed. The exact number
 # of tests performed and the wall time taken is non-deterministic as the test
 # performs a lot of randomized actions, but 50 iterations will be a long test
 # run regardless.
-my $TEST_ITERATIONS = 50;
+my $TEST_ITERATIONS = 1000;
 
 # Variables which record the current state of the cluster
 my $data_checksum_state = 'off';
@@ -55,7 +58,7 @@ if ($ENV{enable_injection_points} ne 'yes')
 # whether to turn things off during testing.
 sub cointoss
 {
-       return int(rand(2) == 1);
+       return int(rand() < 0.5);
 }
 
 # Helper for injecting random sleeps here and there in the testrun. The sleep
@@ -73,11 +76,16 @@ sub background_ro_pgbench
 {
        my ($port, $stdin, $stdout, $stderr) = @_;
 
-       my $pgbench_primary = IPC::Run::start(
-               [ 'pgbench', '-p', $port, '-S', '-T', '600', '-c', '10', 
'postgres' ],
-               '<' => \$stdin,
-               '>' => \$stdout,
-               '2>' => \$stderr,
+       if ($pgbench_standby)
+       {
+               $pgbench_standby->finish;
+       }
+
+       $pgbench_standby = IPC::Run::start(
+               [ 'pgbench', '-n', '-p', $port, '-S', '-T', '600', '-c', '10', 
'postgres' ],
+               '<' => '/dev/null',
+               '>' => '/dev/null',
+               '2>' => '/dev/null',
                IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default));
 }
 
@@ -87,11 +95,16 @@ sub background_rw_pgbench
 {
        my ($port, $stdin, $stdout, $stderr) = @_;
 
-       my $pgbench_primary = IPC::Run::start(
+       if ($pgbench_primary)
+       {
+               $pgbench_primary->finish;
+       }
+
+       $pgbench_primary = IPC::Run::start(
                [ 'pgbench', '-p', $port, '-T', '600', '-c', '10', 'postgres' ],
-               '<' => \$stdin,
-               '>' => \$stdout,
-               '2>' => \$stderr,
+               '<' => '/dev/null',
+               '>' => '/dev/null',
+               '2>' => '/dev/null',
                IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default));
 }
 
@@ -224,6 +237,9 @@ background_rw_pgbench(
        $node_primary->port, $pgb_primary_stdin,
        $pgb_primary_stdout, $pgb_primary_stderr);
 
+my $primary_shutdown_clean = 0;
+my $standby_shutdown_clean = 0;
+
 # Main test suite. This loop will start a pgbench run on the cluster and while
 # that's running flip the state of data checksums concurrently. It will then
 # randomly restart thec cluster (in fast or immediate) mode and then check for
@@ -246,9 +262,11 @@ for (my $i = 0; $i < $TEST_ITERATIONS; $i++)
                $node_primary_loglocation = -s $node_primary->logfile;
 
                # If data checksums are enabled, take the opportunity to verify 
them
-               # while the cluster is offline
-               $node_primary->checksum_verify_offline()
-                 unless $data_checksum_state eq 'off';
+               # while the cluster is offline (but only if stopped in a clean 
way,
+               # not after immediate shutdown)
+               #$node_primary->checksum_verify_offline()
+               #  unless $data_checksum_state eq 'off' or 
!$primary_shutdown_clean;
+
                random_sleep();
                $node_primary->start;
                # Start a pgbench in the background against the primary
@@ -270,9 +288,11 @@ for (my $i = 0; $i < $TEST_ITERATIONS; $i++)
                $node_standby_1_loglocation = -s $node_standby_1->logfile;
 
                # If data checksums are enabled, take the opportunity to verify 
them
-               # while the cluster is offline
-               $node_standby_1->checksum_verify_offline()
-                 unless $data_checksum_state eq 'off';
+               # while the cluster is offline (but only if stopped in a clean 
way,
+               # not after immediate shutdown)
+               #$node_standby_1->checksum_verify_offline()
+               #  unless $data_checksum_state eq 'off' or 
!$standby_shutdown_clean;
+
                random_sleep();
                $node_standby_1->start;
                # Start a select-only pgbench in the background on the standby
@@ -287,13 +307,41 @@ for (my $i = 0; $i < $TEST_ITERATIONS; $i++)
        my $result = $node_primary->safe_psql('postgres',
                "SELECT count(*) FROM t WHERE a > 1");
        is($result, '100000', 'ensure data pages can be read back on primary');
+
        random_sleep();
+
        $node_primary->wait_for_catchup($node_standby_1, 'write');
 
-       # Potentially powercycle the cluster
-       $node_primary->stop($stop_modes[ int(rand(100)) ]) if cointoss();
        random_sleep();
-       $node_standby_1->stop($stop_modes[ int(rand(100)) ]) if cointoss();
+
+       # Potentially powercycle the cluster (the nodes independently)
+       # XXX should maybe try stopping nodes in the opposite order too?
+       if (cointoss())
+       {
+               my $mode = $stop_modes[ int(rand(100)) ];
+               $node_primary->stop($mode);
+               $primary_shutdown_clean = ($mode eq 'fast');
+       }
+
+       random_sleep();
+
+       if (cointoss())
+       {
+               my $mode = $stop_modes[ int(rand(100)) ];
+               $node_standby_1->stop($mode);
+               $standby_shutdown_clean = ($mode eq 'fast');
+       }
+}
+
+# make sure the nodes are running
+if (!$node_primary->is_alive)
+{
+       $node_primary->start;
+}
+
+if (!$node_standby_1->is_alive)
+{
+        $node_standby_1->start;
 }
 
 # Testrun is over, ensure that data reads back as expected and perform a final
-- 
2.39.5

Reply via email to