Re: The danger of deleting backup_label

David Steele Mon, 16 Oct 2023 10:00:34 -0700

On 10/16/23 12:25, Robert Haas wrote:

On Mon, Oct 16, 2023 at 11:45 AM David Steele <[email protected]> wrote:

Hmmm, the reason to back patch this is that it would fix [1], which sure
looks like a problem to me even if it is not a "bug". We can certainly
require backup software to retry pg_control until the checksum is valid
but that seems like a pretty big ask, even considering how complicated
backup is.


That seems like a problem with pg_control not being written atomically
when the standby server is updating it during recovery, rather than a
problem with backup_label not being used at the start of recovery.
Unless I'm confused.

You are not confused. But the fact that it practically can be read astorn at all on the standby is a function of how rapidly it is beingwritten to update min recovery point. Writing it atomically via a tempfile may affect performance in this area, but only during the backup,which may cause recovery to run more slowly during a backup.

I don't have proof of this because I don't have any hosts large enoughto test the theory.

Right now the user can remove backup_label and get a "successful"
restore and not realize that they have just corrupted their cluster.
This is independent of the backup/restore tool doing all the right things.


I don't think it's independent of that at all.

I think it is. Imagine the user does backup/recovery using theirfavorite tool and everything is fine. But due to some misconfigurationor a problem in the WAL archive, they get this error:


FATAL:  could not locate required checkpoint record

2023-10-16 16:42:50.132 UTC HINT: If you are restoring from a backup,touch "/home/dev/test/data/recovery.signal" or"/home/dev/test/data/standby.signal" and add required recovery options.If you are not restoring from a backup, try removing the file"/home/dev/test/data/backup_label".Be careful: removing "/home/dev/test/data/backup_label" willresult in a corrupt cluster if restoring from a backup.

I did this by setting restore_command=false, but it could just as easilybe the actual restore command that returns false due to a variety ofreasons. The user has no idea what "could not locate required checkpointrecord" means and if there is enough automation they may not evenrealize they just restored from a backup.

After some agonizing (we hope) they decide to delete backup_label and,wow, it just works! So now they merrily go on their way with a corruptedcluster. They also remember for the next time that deleting backup_labelis definitely a good procedure.

The idea behind this patch is that deleting backup_label would produce ahard error because pg_control would be missing as well (if the backupsoftware did its job). If both pg_control and backup_label are present(but pg_control has not been loaded with the contents of backup_label,i.e. it is the running copy from the backup cluster) we can also error.

It's not perfect, because they could backup (or restore) pg_control butnot backup_label, but we are narrowing the cases where something can gowrong and they have silent corruption, especially if theirbackup/restore software follows the directions.


Regards,
-David

Re: The danger of deleting backup_label

Reply via email to