Re: [HACKERS] Too strict check when starting from a basebackup taken off a standby

2014-12-18 Thread Andres Freund
On 2014-12-16 18:37:48 +0200, Heikki Linnakangas wrote:
 On 12/11/2014 04:21 PM, Marco Nenciarini wrote:
 Il 11/12/14 12:38, Andres Freund ha scritto:
 On December 11, 2014 9:56:09 AM CET, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:
 On 12/11/2014 05:45 AM, Andres Freund wrote:
 
 Yeah. I was not able to reproduce this, but I'm clearly missing
 something, since both you and Sergey have seen this happening. Can you
 write a script to reproduce?
 
 Not right now, I only have my mobile... Its quite easy though. Create a 
 pg-basebackup from a standby. Create a recovery.conf with a broken primary 
 conninfo. Start. Shutdown. Fix conninfo. Start.
 
 
 Just tested it. There steps are not sufficient to reproduce the issue on
 a test installation. I suppose because, on small test datadir, the
 checkpoint location and the redo location on the pg_control are the same
 present in the backup_label.
 
 To trigger this bug you need to have at least a restartpoint happened on
 standby between the start and the end of the backup.
 
 you could simulate it issuing a checkpoint on master, a checkpoint on
 standby (to force a restartpoint), then copying the pg_control from the
 standby.
 
 This way I've been able to reproduce it.
 
 Ok, got it. I was able to reproduce this by using pg_basebackup
 --max-rate=1024, and issuing CHECKPOINT in the standby while the backup
 was running.

FWIW, I can reproduce it without any such hangups. I've just tested it
using my local scripts:
# create primary
$ reinit-pg-dev-master
$ run-pg-dev-master
# create first standby
$ reinit-pg-dev-master-standby
$ run-pg-dev-master-standby
# create 2nd standby
$ pg_basebackup -h /tmp/ -p 5441 -D /tmp/tree --write-recovery-conf
$ PGHOST=frakbar run-pg-dev-master-standby -D /tmp/tree
LOG:  creating missing WAL directory pg_xlog/archive_status
LOG:  entering standby mode
FATAL:  could not connect to the primary server: could not translate host name 
frakbar to address: Name or service not known
$ PGHOST=/tmp run-pg-dev-master-standby -D /tmp/tree
LOG:  started streaming WAL from primary at 0/200 on timeline 1
FATAL:  backup_label contains data inconsistent with control file
HINT:  This means that the backup is corrupted and you will have to use another 
backup for recovery.

After the fix I just pushed that sequence works.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Too strict check when starting from a basebackup taken off a standby

2014-12-11 Thread Heikki Linnakangas

On 12/11/2014 05:45 AM, Andres Freund wrote:

A customer recently reported getting backup_label contains data
inconsistent with control file after taking a basebackup from a standby
and starting it with a typo in primary_conninfo.

When starting postgres from a basebackup StartupXLOG() has the follow
code to deal with backup labels:
 if (haveBackupLabel)
 {
 ControlFile-backupStartPoint = checkPoint.redo;
 ControlFile-backupEndRequired = backupEndRequired;

 if (backupFromStandby)
 {
 if (dbstate_at_startup != DB_IN_ARCHIVE_RECOVERY)
 ereport(FATAL,
 (errmsg(backup_label contains data inconsistent with 
control file),
  errhint(This means that the backup is corrupted and 
you will 
have to use another backup for recovery.)));
 ControlFile-backupEndPoint = ControlFile-minRecoveryPoint;
 }
 }

while I'm not enthusiastic about the error message, that bit of code
looks sane at first glance. We certainly expect the control file to
indicate we're in recovery. Since we're unlinking the backup label
shortly afterwards we'd normally not expect to hit that case after a
shutdown in recovery.


Check.


The problem is that after reading the backup label we also have to read
the corresponding checkpoing from pg_xlog. If primary_conninfo and/or
restore_command are misconfigured and can't restore files that can only
be fixed by shutting down the cluster and fixing up recovery.conf -
which sets DB_SHUTDOWNED_IN_RECOVERY in the control file.


No it doesn't. The state is set to DB_SHUTDOWNED_IN_RECOVERY in 
CreateRestartPoint(). If you shut down the server before it has even 
read the initial checkpoint record, it will not attempt to create a 
restartpoint nor update the control file.



The easiest solution seems to be to simply also allow that as a state in
the above check. It might be nicer to not allow a ShutdownXLOG to modify
the control file et al at that stage, but I think that'd end up being
more invasive.

A short search shows that that also looks like a credible explanation
for #12128...


Yeah. I was not able to reproduce this, but I'm clearly missing 
something, since both you and Sergey have seen this happening. Can you 
write a script to reproduce?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Too strict check when starting from a basebackup taken off a standby

2014-12-11 Thread Andres Freund
On December 11, 2014 9:56:09 AM CET, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:
On 12/11/2014 05:45 AM, Andres Freund wrote:
 A customer recently reported getting backup_label contains data
 inconsistent with control file after taking a basebackup from a
standby
 and starting it with a typo in primary_conninfo.

 When starting postgres from a basebackup StartupXLOG() has the follow
 code to deal with backup labels:
  if (haveBackupLabel)
  {
  ControlFile-backupStartPoint = checkPoint.redo;
  ControlFile-backupEndRequired = backupEndRequired;

  if (backupFromStandby)
  {
  if (dbstate_at_startup != DB_IN_ARCHIVE_RECOVERY)
  ereport(FATAL,
  (errmsg(backup_label contains data
inconsistent with control file),
   errhint(This means that the backup is
corrupted and you will 
 have to use another backup for
recovery.)));
  ControlFile-backupEndPoint =
ControlFile-minRecoveryPoint;
  }
  }

 while I'm not enthusiastic about the error message, that bit of code
 looks sane at first glance. We certainly expect the control file to
 indicate we're in recovery. Since we're unlinking the backup label
 shortly afterwards we'd normally not expect to hit that case after a
 shutdown in recovery.

Check.

 The problem is that after reading the backup label we also have to
read
 the corresponding checkpoing from pg_xlog. If primary_conninfo and/or
 restore_command are misconfigured and can't restore files that can
only
 be fixed by shutting down the cluster and fixing up recovery.conf -
 which sets DB_SHUTDOWNED_IN_RECOVERY in the control file.

No it doesn't. The state is set to DB_SHUTDOWNED_IN_RECOVERY in 
CreateRestartPoint(). If you shut down the server before it has even 
read the initial checkpoint record, it will not attempt to create a 
restartpoint nor update the control file.

Yes, it does. There's a shortcut that just sets the state in the control file 
and then exits.

 The easiest solution seems to be to simply also allow that as a state
in
 the above check. It might be nicer to not allow a ShutdownXLOG to
modify
 the control file et al at that stage, but I think that'd end up being
 more invasive.

 A short search shows that that also looks like a credible explanation
 for #12128...

Yeah. I was not able to reproduce this, but I'm clearly missing 
something, since both you and Sergey have seen this happening. Can you 
write a script to reproduce?

Not right now, I only have my mobile... Its quite easy though. Create a 
pg-basebackup from a standby. Create a recovery.conf with a broken primary 
conninfo. Start. Shutdown. Fix conninfo. Start.

Andres

-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Too strict check when starting from a basebackup taken off a standby

2014-12-11 Thread Marco Nenciarini
Il 11/12/14 12:38, Andres Freund ha scritto:
 On December 11, 2014 9:56:09 AM CET, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:
 On 12/11/2014 05:45 AM, Andres Freund wrote:

 Yeah. I was not able to reproduce this, but I'm clearly missing 
 something, since both you and Sergey have seen this happening. Can you 
 write a script to reproduce?
 
 Not right now, I only have my mobile... Its quite easy though. Create a 
 pg-basebackup from a standby. Create a recovery.conf with a broken primary 
 conninfo. Start. Shutdown. Fix conninfo. Start.
 

Just tested it. There steps are not sufficient to reproduce the issue on
a test installation. I suppose because, on small test datadir, the
checkpoint location and the redo location on the pg_control are the same
present in the backup_label.

To trigger this bug you need to have at least a restartpoint happened on
standby between the start and the end of the backup.

you could simulate it issuing a checkpoint on master, a checkpoint on
standby (to force a restartpoint), then copying the pg_control from the
standby.

This way I've been able to reproduce it.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature


[HACKERS] Too strict check when starting from a basebackup taken off a standby

2014-12-10 Thread Andres Freund
Hi,

A customer recently reported getting backup_label contains data
inconsistent with control file after taking a basebackup from a standby
and starting it with a typo in primary_conninfo.

When starting postgres from a basebackup StartupXLOG() has the follow
code to deal with backup labels:
if (haveBackupLabel)
{
ControlFile-backupStartPoint = checkPoint.redo;
ControlFile-backupEndRequired = backupEndRequired;

if (backupFromStandby)
{
if (dbstate_at_startup != DB_IN_ARCHIVE_RECOVERY)
ereport(FATAL,
(errmsg(backup_label contains data inconsistent 
with control file),
 errhint(This means that the backup is corrupted 
and you will 
   have to use another backup for recovery.)));
ControlFile-backupEndPoint = ControlFile-minRecoveryPoint;
}
}

while I'm not enthusiastic about the error message, that bit of code
looks sane at first glance. We certainly expect the control file to
indicate we're in recovery. Since we're unlinking the backup label
shortly afterwards we'd normally not expect to hit that case after a
shutdown in recovery.

The problem is that after reading the backup label we also have to read
the corresponding checkpoing from pg_xlog. If primary_conninfo and/or
restore_command are misconfigured and can't restore files that can only
be fixed by shutting down the cluster and fixing up recovery.conf -
which sets DB_SHUTDOWNED_IN_RECOVERY in the control file.

The easiest solution seems to be to simply also allow that as a state in
the above check. It might be nicer to not allow a ShutdownXLOG to modify
the control file et al at that stage, but I think that'd end up being
more invasive.

A short search shows that that also looks like a credible explanation
for #12128...

I plan to relax that check unless somebody comes up with a different 
better plan.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers