On Wed, Mar 17, 2021 at 07:30:04PM -0700, Andres Freund wrote: > I suspect it might be easier to reproduce the issue with smaller WAL > segments, a short checkpoint_timeout, and multiple jobs generating WAL > and then sleeping for random amounts of time. Not sure if that's the > sole ingredient, but consider what happens there's processes that > XLogWrite()s some WAL and then sleeps. Typically such a process' > openLogFile will still point to the WAL segment. And they may still do > that when the next checkpoint finishes and we recycle the WAL file.
Yep. That's basically the kind of scenarios I have been testing to stress the recycling/removing, with pgbench putting some load into the server. This has worked for me. Once. But I have little idea why it gets easier to reproduce in the environments of others, so there may be an OS-version dependency in the equation here. > I wonder if we actually fail to unlink() the file in > durable_link_or_rename(), and then end up recycling the same old file > into multiple "future" positions in the WAL stream. You actually mean durable_rename_excl() as of 13~, right? Yeah, this matches my impression that it is a two-step failure: - Failure in one of the steps of durable_rename_excl(). - Fallback to segment removal, where we get the complain about renaming. > 1) and 2) seems problematic for restore_command use. I wonder if there's > a chance that some of the reports ended up hitting 3), and that windows > doesn't handle that well. Yeap. I was thinking about 3) being the actual problem while going through those docs two days ago. > If you manage to reproduce, could you check what the link count of the > all the segments is? Apparently sysinternal's findlinks can do that. > > Or perhaps even better, add an error check that the number of links of > WAL segments is 1 in a bunch of places (recycling, opening them, closing > them, maybe?). > > Plus error reporting for unlink failures, of course. Yep, that's actually something I wrote for my own setups, with log_checkpoints enabled to catch all concurrent checkpoint activity and some LOGs. Still no luck unfortunately :( -- Michael
signature.asc
Description: PGP signature