Hi On 2022-04-16 12:13:09 -0700, Andres Freund wrote: > What confuses me so far is what already had generated stats before > reaching pgstat_reset_after_failure() (so that the bug could even be hit > in t/025_stuck_on_old_timeline.pl).
I see part of a problem - in archiver stats. Even in 14 (and presumably before), we do work that can generate archiver stats (e.g. ReadCheckpointRecord()) before pgstat_reset_all(). It's not the end of the world, but doesn't seem great. But since archiver stats are fixed-numbered stats (and thus not in the hash table), they'd not trigger the backtrace we saw here. One thing that's interesting is that the failing tests have: 2022-04-15 12:07:48.828 UTC [675922][walreceiver][:0] FATAL: could not link file "pg_wal/xlogtemp.675922" to "pg_wal/00000002.history": File exists which I haven't seen locally. Looks like we have some race between startup process and walreceiver? That seems not great. I'm a bit confused that walreceiver and archiving are both active at the same time in the first place - that doesn't seem right as things are set up currently. Greetings, Andres Freund