Hi, On 2019-05-14 13:23:28 +0900, Michael Paquier wrote: > On Mon, May 13, 2019 at 10:37:35AM -0700, Andres Freund wrote: > > Ugh, this is all such a mess. But, isn't this broken independently of > > the smgrimmedsync() issue? In a basebackup case, the basebackup could > > have included the main fork, but not the init fork, and the reverse. WAL > > replay *solely* needs to be able to recover from that. At the very > > least we'd have to do the cleanup step after becoming consistent, not > > just before recovery even started. > > Yes, the logic using smgrimmedsync() is race-prone and weaker than the > index AMs in my opinion, even if the failure window is limited (I > think that this is mentioned upthread a bit).
How's it limited? On a large database a base backup easily can take *days*. And e.g. VM and FSM can easily have inodes that are much newer than the the main/init forks, so typical base-backups (via OS/glibc readdir) will sort them at a later point (or it'll be hashed, in which case it's entirely random), so the window between when the different forks are copied are large. > What's actually the reason preventing us from delaying the > checkpointer like the index AMs for the logging of heap init fork? I'm not following. What do you mean by "delaying the checkpointer"? Greetings, Andres Freund