Hi, On 2022-04-07 13:40:30 -0400, Tom Lane wrote: > Michael Paquier <mich...@paquier.xyz> writes: > > Add TAP test for archive_cleanup_command and recovery_end_command > > grassquit just showed a non-reproducible failure in this test [1]:
I was just staring at that as well. > # Postmaster PID for node "standby" is 291160 > ok 1 - check content from archives > not ok 2 - archive_cleanup_command executed on checkpoint > > # Failed test 'archive_cleanup_command executed on checkpoint' > # at t/002_archiving.pl line 74. > > This test is sending a CHECKPOINT command to the standby and > expecting it to run the archive_cleanup_command, but it looks > like the standby did not actually run any checkpoint: > > 2022-04-07 16:11:33.060 UTC [291806][not initialized][:0] LOG: connection > received: host=[local] > 2022-04-07 16:11:33.078 UTC [291806][client backend][2/15:0] LOG: connection > authorized: user=bf database=postgres application_name=002_archiving.pl > 2022-04-07 16:11:33.084 UTC [291806][client backend][2/16:0] LOG: statement: > CHECKPOINT > 2022-04-07 16:11:33.092 UTC [291806][client backend][:0] LOG: disconnection: > session time: 0:00:00.032 user=bf database=postgres host=[local] > > I am suspicious that the reason is that ProcessUtility does not > ask for a forced checkpoint when in recovery: > > RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT | > (RecoveryInProgress() ? 0 : CHECKPOINT_FORCE)); > > The trouble with this theory is that this test has been there for > nearly six months and this is the first such failure (I scraped the > buildfarm logs to be sure). Seems like failures should be a lot > more common than that. > I wondered if the recent pg_stats changes could have affected this, but I > don't really see how. I don't really see either. It's a bit more conceivable that the recovery prefetching changes could affect the timing sufficiently? It's also possible that it requires an animal of a certain speed to happen - we didn't have an -fsanitize=address animal until recently. I guess we'll have to wait and see what the frequency of the problem is? Greetings, Andres Freund