Re: [HACKERS] Too strict check when starting from a basebackup taken off a standby
On 2014-12-16 18:37:48 +0200, Heikki Linnakangas wrote: On 12/11/2014 04:21 PM, Marco Nenciarini wrote: Il 11/12/14 12:38, Andres Freund ha scritto: On December 11, 2014 9:56:09 AM CET, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/11/2014 05:45 AM, Andres Freund wrote: Yeah. I was not able to reproduce this, but I'm clearly missing something, since both you and Sergey have seen this happening. Can you write a script to reproduce? Not right now, I only have my mobile... Its quite easy though. Create a pg-basebackup from a standby. Create a recovery.conf with a broken primary conninfo. Start. Shutdown. Fix conninfo. Start. Just tested it. There steps are not sufficient to reproduce the issue on a test installation. I suppose because, on small test datadir, the checkpoint location and the redo location on the pg_control are the same present in the backup_label. To trigger this bug you need to have at least a restartpoint happened on standby between the start and the end of the backup. you could simulate it issuing a checkpoint on master, a checkpoint on standby (to force a restartpoint), then copying the pg_control from the standby. This way I've been able to reproduce it. Ok, got it. I was able to reproduce this by using pg_basebackup --max-rate=1024, and issuing CHECKPOINT in the standby while the backup was running. FWIW, I can reproduce it without any such hangups. I've just tested it using my local scripts: # create primary $ reinit-pg-dev-master $ run-pg-dev-master # create first standby $ reinit-pg-dev-master-standby $ run-pg-dev-master-standby # create 2nd standby $ pg_basebackup -h /tmp/ -p 5441 -D /tmp/tree --write-recovery-conf $ PGHOST=frakbar run-pg-dev-master-standby -D /tmp/tree LOG: creating missing WAL directory pg_xlog/archive_status LOG: entering standby mode FATAL: could not connect to the primary server: could not translate host name frakbar to address: Name or service not known $ PGHOST=/tmp run-pg-dev-master-standby -D /tmp/tree LOG: started streaming WAL from primary at 0/200 on timeline 1 FATAL: backup_label contains data inconsistent with control file HINT: This means that the backup is corrupted and you will have to use another backup for recovery. After the fix I just pushed that sequence works. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Too strict check when starting from a basebackup taken off a standby
On 12/11/2014 05:45 AM, Andres Freund wrote: A customer recently reported getting backup_label contains data inconsistent with control file after taking a basebackup from a standby and starting it with a typo in primary_conninfo. When starting postgres from a basebackup StartupXLOG() has the follow code to deal with backup labels: if (haveBackupLabel) { ControlFile-backupStartPoint = checkPoint.redo; ControlFile-backupEndRequired = backupEndRequired; if (backupFromStandby) { if (dbstate_at_startup != DB_IN_ARCHIVE_RECOVERY) ereport(FATAL, (errmsg(backup_label contains data inconsistent with control file), errhint(This means that the backup is corrupted and you will have to use another backup for recovery.))); ControlFile-backupEndPoint = ControlFile-minRecoveryPoint; } } while I'm not enthusiastic about the error message, that bit of code looks sane at first glance. We certainly expect the control file to indicate we're in recovery. Since we're unlinking the backup label shortly afterwards we'd normally not expect to hit that case after a shutdown in recovery. Check. The problem is that after reading the backup label we also have to read the corresponding checkpoing from pg_xlog. If primary_conninfo and/or restore_command are misconfigured and can't restore files that can only be fixed by shutting down the cluster and fixing up recovery.conf - which sets DB_SHUTDOWNED_IN_RECOVERY in the control file. No it doesn't. The state is set to DB_SHUTDOWNED_IN_RECOVERY in CreateRestartPoint(). If you shut down the server before it has even read the initial checkpoint record, it will not attempt to create a restartpoint nor update the control file. The easiest solution seems to be to simply also allow that as a state in the above check. It might be nicer to not allow a ShutdownXLOG to modify the control file et al at that stage, but I think that'd end up being more invasive. A short search shows that that also looks like a credible explanation for #12128... Yeah. I was not able to reproduce this, but I'm clearly missing something, since both you and Sergey have seen this happening. Can you write a script to reproduce? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Too strict check when starting from a basebackup taken off a standby
On December 11, 2014 9:56:09 AM CET, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/11/2014 05:45 AM, Andres Freund wrote: A customer recently reported getting backup_label contains data inconsistent with control file after taking a basebackup from a standby and starting it with a typo in primary_conninfo. When starting postgres from a basebackup StartupXLOG() has the follow code to deal with backup labels: if (haveBackupLabel) { ControlFile-backupStartPoint = checkPoint.redo; ControlFile-backupEndRequired = backupEndRequired; if (backupFromStandby) { if (dbstate_at_startup != DB_IN_ARCHIVE_RECOVERY) ereport(FATAL, (errmsg(backup_label contains data inconsistent with control file), errhint(This means that the backup is corrupted and you will have to use another backup for recovery.))); ControlFile-backupEndPoint = ControlFile-minRecoveryPoint; } } while I'm not enthusiastic about the error message, that bit of code looks sane at first glance. We certainly expect the control file to indicate we're in recovery. Since we're unlinking the backup label shortly afterwards we'd normally not expect to hit that case after a shutdown in recovery. Check. The problem is that after reading the backup label we also have to read the corresponding checkpoing from pg_xlog. If primary_conninfo and/or restore_command are misconfigured and can't restore files that can only be fixed by shutting down the cluster and fixing up recovery.conf - which sets DB_SHUTDOWNED_IN_RECOVERY in the control file. No it doesn't. The state is set to DB_SHUTDOWNED_IN_RECOVERY in CreateRestartPoint(). If you shut down the server before it has even read the initial checkpoint record, it will not attempt to create a restartpoint nor update the control file. Yes, it does. There's a shortcut that just sets the state in the control file and then exits. The easiest solution seems to be to simply also allow that as a state in the above check. It might be nicer to not allow a ShutdownXLOG to modify the control file et al at that stage, but I think that'd end up being more invasive. A short search shows that that also looks like a credible explanation for #12128... Yeah. I was not able to reproduce this, but I'm clearly missing something, since both you and Sergey have seen this happening. Can you write a script to reproduce? Not right now, I only have my mobile... Its quite easy though. Create a pg-basebackup from a standby. Create a recovery.conf with a broken primary conninfo. Start. Shutdown. Fix conninfo. Start. Andres -- Please excuse brevity and formatting - I am writing this on my mobile phone. Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Too strict check when starting from a basebackup taken off a standby
Il 11/12/14 12:38, Andres Freund ha scritto: On December 11, 2014 9:56:09 AM CET, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/11/2014 05:45 AM, Andres Freund wrote: Yeah. I was not able to reproduce this, but I'm clearly missing something, since both you and Sergey have seen this happening. Can you write a script to reproduce? Not right now, I only have my mobile... Its quite easy though. Create a pg-basebackup from a standby. Create a recovery.conf with a broken primary conninfo. Start. Shutdown. Fix conninfo. Start. Just tested it. There steps are not sufficient to reproduce the issue on a test installation. I suppose because, on small test datadir, the checkpoint location and the redo location on the pg_control are the same present in the backup_label. To trigger this bug you need to have at least a restartpoint happened on standby between the start and the end of the backup. you could simulate it issuing a checkpoint on master, a checkpoint on standby (to force a restartpoint), then copying the pg_control from the standby. This way I've been able to reproduce it. Regards, Marco -- Marco Nenciarini - 2ndQuadrant Italy PostgreSQL Training, Services and Support marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it signature.asc Description: OpenPGP digital signature