On Mon, Mar 22, 2021 at 4:31 AM Stephen Frost <sfr...@snowman.net> wrote: > Presuming that we do add to the documentation the language to document > what's assumed (and already done by modern backup tools) that they're > fsync'ing everything they're restoring, do we/can we have an option > which those tools could set that explicitly tells PG "everything in the > cluster has been fsync'd already, you don't need to do anything extra"? > Perhaps also/seperately one for WAL that's restored with restore command > if we think that's necessary?
In the earlier thread, we did contemplate recovery_init_sync_method=none, but it has the problem that after recovery completes you have a cluster running with a setting that is really bad if you eventually crash again and run crash recovery again, this time with a dirty kernel page cache. I was one of the people voting against that feature, but I also wrote a strawman patch just for the visceral experience of every cell in my body suddenly whispering "yeah, nah, I'm not committing that" as I wrote the weaselwordage for the manual. In that thread we also contemplated safe ways for a basebackup-type tool to promise that data has been sync'd, to avoid that problem with the GUC. Maybe something like: the backup label file could contain a "SYNC_DONE" message. But then what if someone copies the whole directory to a new location, how can you invalidate the promise? This is another version of the question of whether it's our problem or the user's to worry about buffering of pgdata files that they copy around with unknown tools. If it's our problem, maybe something like: "SYNC_DONE for pgdata_inode=1234, hostname=x, ..." is enough to detect cases where we still believe the claim. But there's probably a long tail of weird ways for whatever you come up with to be deceived. In the case of the recovery_init_sync_method=wal patch proposed in *this* thread, here's the thing: there's not much to gain by trying to skip the sync, anyway! For the WAL, you'll be opening those files soon anyway to replay them, and if they're already sync'd then fsync() will return quickly. For the relfile data, you'll be opening all relfiles that are referenced by the WAL soon anyway, and syncing them if required. So that just leaves relfiles that are not referenced in the WAL you're about to replay. Whether we have a duty to sync those too is that central question again, and one of the things I'm trying to get an answer to with this thread. The recovery_init_sync_method=wal patch only makes sense if the answer is "no". > Otherwise, just in general, agree with doing this to address the risks > discussed around regular crash recovery. We have some pretty clear "if > the DB was doing recovery and was interrupted, you need to restore from > backup" messages today in xlog.c, and this patch didn't seem to change > that? Am I missing something or isn't the idea here that these changes > would make it so you aren't going to end up with corruption in those > cases? Specifically looking at- > > xlog.c:6509- > case DB_IN_CRASH_RECOVERY: > ereport(LOG, > (errmsg("database system was interrupted while in > recovery at %s", > str_time(ControlFile->time)), > errhint("This probably means that some data is corrupted > and" > " you will have to use the last backup for > recovery."))); > break; > > case DB_IN_ARCHIVE_RECOVERY: > ereport(LOG, > (errmsg("database system was interrupted while in > recovery at log time %s", > str_time(ControlFile->checkPointCopy.time)), > errhint("If this has occurred more than once some data > might be corrupted" > " and you might need to choose an earlier > recovery target."))); > break; Maybe I missed your point... but I don't think anything changes here? If recovery is crashing in some deterministic way (not just because you lost power the first time, but rather because a particular log record hits the same bug or gets confused by the same corruption and implodes every time) it'll probably keep doing so, and our sync algorithm doesn't seem to make a difference to that.