On Mon, Mar 22, 2021 at 4:31 AM Stephen Frost <sfr...@snowman.net> wrote:
> Presuming that we do add to the documentation the language to document
> what's assumed (and already done by modern backup tools) that they're
> fsync'ing everything they're restoring, do we/can we have an option
> which those tools could set that explicitly tells PG "everything in the
> cluster has been fsync'd already, you don't need to do anything extra"?
> Perhaps also/seperately one for WAL that's restored with restore command
> if we think that's necessary?

In the earlier thread, we did contemplate
recovery_init_sync_method=none, but it has the problem that after
recovery completes you have a cluster running with a setting that is
really bad if you eventually crash again and run crash recovery again,
this time with a dirty kernel page cache.  I was one of the people
voting against that feature, but I also wrote a strawman patch just
for the visceral experience of every cell in my body suddenly
whispering "yeah, nah, I'm not committing that" as I wrote the
weaselwordage for the manual.

In that thread we also contemplated safe ways for a basebackup-type
tool to promise that data has been sync'd, to avoid that problem with
the GUC.  Maybe something like: the backup label file could contain a
"SYNC_DONE" message.  But then what if someone copies the whole
directory to a new location, how can you invalidate the promise?  This
is another version of the question of whether it's our problem or the
user's to worry about buffering of pgdata files that they copy around
with unknown tools.  If it's our problem, maybe something like:
"SYNC_DONE for pgdata_inode=1234, hostname=x, ..." is enough to detect
cases where we still believe the claim.  But there's probably a long
tail of weird ways for whatever you come up with to be deceived.

In the case of the recovery_init_sync_method=wal patch proposed in
*this* thread, here's the thing: there's not much to gain by trying to
skip the sync, anyway!  For the WAL, you'll be opening those files
soon anyway to replay them, and if they're already sync'd then fsync()
will return quickly.  For the relfile data, you'll be opening all
relfiles that are referenced by the WAL soon anyway, and syncing them
if required.  So that just leaves relfiles that are not referenced in
the WAL you're about to replay.  Whether we have a duty to sync those
too is that central question again, and one of the things I'm trying
to get an answer to with this thread.  The
recovery_init_sync_method=wal patch only makes sense if the answer is
"no".

> Otherwise, just in general, agree with doing this to address the risks
> discussed around regular crash recovery.  We have some pretty clear "if
> the DB was doing recovery and was interrupted, you need to restore from
> backup" messages today in xlog.c, and this patch didn't seem to change
> that?  Am I missing something or isn't the idea here that these changes
> would make it so you aren't going to end up with corruption in those
> cases?  Specifically looking at-
>
> xlog.c:6509-
>         case DB_IN_CRASH_RECOVERY:
>             ereport(LOG,
>                     (errmsg("database system was interrupted while in 
> recovery at %s",
>                             str_time(ControlFile->time)),
>                      errhint("This probably means that some data is corrupted 
> and"
>                              " you will have to use the last backup for 
> recovery.")));
>             break;
>
>         case DB_IN_ARCHIVE_RECOVERY:
>             ereport(LOG,
>                     (errmsg("database system was interrupted while in 
> recovery at log time %s",
>                             str_time(ControlFile->checkPointCopy.time)),
>                      errhint("If this has occurred more than once some data 
> might be corrupted"
>                              " and you might need to choose an earlier 
> recovery target.")));
>             break;

Maybe I missed your point... but I don't think anything changes here?
If recovery is crashing in some deterministic way (not just because
you lost power the first time, but rather because a particular log
record hits the same bug or gets confused by the same corruption and
implodes every time) it'll probably keep doing so, and our sync
algorithm doesn't seem to make a difference to that.


Reply via email to