Re: [HACKERS] Online base backup from the hot-standby

Steve Singer Tue, 27 Sep 2011 16:11:11 -0700

On 11-09-26 10:56 PM, Fujii Masao wrote:


Looks weired. Though the WAL record starting from 0/6000298 was read
successfully, then re-fetch of the same record fails at the end of recovery.
One possible cause is the corruption of archived WAL file. What
restore_command on the standby and archive_command on the master
are you using? Could you confirm that there is no chance to overwrite
archive WAL files in your environment?

I tried to reproduce this problem several times, but I could not. Could
you provide the test case which reproduces the problem?

This is the test procedure I'm trying today, I wasn't able to reproducethe crash. What I was doing the other day was similar but I can't speakto unintentional differences.



I have my master server
data
port=5439
wal_level=hot_standby
archive_mode=on
archive_command='cp -i %p /usr/local/pgsql92git/archive/%f'
hot_standby=on

I then run
select pg_start_backup('foo');
$ rm -r ../data2
$ cp -r ../data ../data2
$ rm ../data2/postmaster.pid
select pg_stop_backup();
I edit data2/postgresql.conf so
port=5438
I commented out archive_mode and archive_command (or at least today I did)
recovery.conf is

standby_mode='on'
primary_conninfo='host=127.0.0.1 port=5439 user=ssinger dbname=test'
restore_command='cp /usr/local/pgsql92git/archive/%f %p'

I then start up the second cluster. On it I run

select pg_start_backup('1');
$ rm -r ../data3
$ rm -r ../archive2
$ cp -r ../data2 ../data3
$ cp ../data2/global/pg_control ../data3/global

select pg_stop_backup();
I edit ../data2/postgresql.conf
port=5437
archive_mode=on
                                # (change requires restart)
archive_command='cp -i %p /usr/local/pgsql92git/archive2/%f'

recovery.conf is

standby_mode='on'
primary_conninfo='host=127.0.0.1 port=5439 user=ssinger dbname=test'
restore_command='cp /usr/local/pgsql92git/archive/%f %p'
trigger_file='/tmp/3'

$ postgres -D ../data3

The first time I did this postgres came up quickly.

$ touch /tmp/3

worked fine.

I then stopped data3
$ rm -r ../data3
on data 2 I run
pg_start_backup('1')
$ cp -r ../data2 ../data3
$ cp ../data2/global/pg_control ../data3/global
select pg_stop_backup() # on data2
$ rm ../data3/postmaster.pid
vi ../data3/postgresql.conf # same changes as above for data3
vi ../data3/recovery.conf # same as above for data 3
postgres -D ../data3

This time I got
./postgres -D ../data3

LOG: database system was interrupted while in recovery at log time2011-09-27 22:04:17 GMTHINT: If this has occurred more than once some data might be corruptedand you might need to choose an earlier recovery target.

LOG:  entering standby mode

cp: cannot stat`/usr/local/pgsql92git/archive/00000001000000000000000C': No such fileor directory

LOG:  redo starts at 0/C000020
LOG:  record with incorrect prev-link 0/9000058 at 0/C0000B0

cp: cannot stat`/usr/local/pgsql92git/archive/00000001000000000000000C': No such fileor directory

LOG:  streaming replication successfully connected to primary
FATAL:  the database system is starting up
FATAL:  the database system is starting up
LOG:  consistent recovery state reached at 0/C0000E8
LOG:  database system is ready to accept read only connections

In order to get the database to come in read only mode I manually issueda checkpoint on the master (data) shortly after the checkpoint commandthe data3 instance went to read only mode.


then

touch /tmp/3

trigger file found: /tmp/3
FATAL:  terminating walreceiver process due to administrator command

cp: cannot stat`/usr/local/pgsql92git/archive/00000001000000000000000C': No such fileor directory

LOG:  record with incorrect prev-link 0/9000298 at 0/C0002F0

cp: cannot stat`/usr/local/pgsql92git/archive/00000001000000000000000C': No such fileor directory

LOG:  redo done at 0/C000298

cp: cannot stat`/usr/local/pgsql92git/archive/00000001000000000000000C': No such fileor directorycp: cannot stat `/usr/local/pgsql92git/archive/00000002.history': Nosuch file or directory

LOG:  selected new timeline ID: 2

cp: cannot stat `/usr/local/pgsql92git/archive/00000001.history': Nosuch file or directory

LOG:  archive recovery complete
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

It looks like data3 is still pulling files with the recovery commandafter it sees the touch file (is this expected behaviour?)

$ grep archive ../data3/postgresql.conf
#wal_level = minimal            # minimal, archive, or hot_standby
#archive_mode = off        # allows archiving to be done
archive_mode=on
archive_command='cp -i %p /usr/local/pgsql92git/archive2/%f'

I have NOT been able to make postgres crash during a recovery (today).It is *possible* that on some of my runs the other day I had skippedchanging the archive command on data3 to write to archive2 instead ofarchive.

I have also today not been able to get it to attempt to restore the sameWAL file twice.

If a base backup is in progress on a recovery database and that recovery
database is promoted to master, following the promotion (if you don't
restart the postmaster).  I see
select pg_stop_backup();
ERROR:  database system status mismatches between pg_start_backup() and
pg_stop_backup()

If you restart the postmaster this goes away.  When the postmaster leaves
recovery mode I think it should abort an existing base backup so
pg_stop_backup() will say no backup in progress,

I don't think that it's good idea to cancel the backup when promoting
the standby.
Because if we do so, we need to handle correctly the case where cancel of backup
and pg_start_backup/pg_stop_backup are performed at the same time. We can
simply do that by protecting those whole operations including pg_start_backup's
checkpoint by the lwlock. But I don't think that it's worth
introducing new lwlock
only for that. And it's not good to take a lwlock through
time-consuming checkpoint
operation. Of course we can avoid such a lwlock, but which would require more
complicated code.

or give an error message on
pg_stop_backup() saying that the base backup won't be usable.  The above
error doesn't really tell the user why there is a mismatch.

What about the following error message?

ERROR:  pg_stop_backup() was executed during normal processing though
pg_start_backup() was executed during recovery
HINT:  The database backup will not be usable.

Or, you have better idea?

I like that error message better. It tells me what is going on versuscomplaining about a state mismatch.

In my testing a few times I got into a situation where a standby server
coming from a recovery target took a while to finish recovery (this is on a
database with no activity).  Then when i tried promoting that server to
master I got

LOG:  trigger file found: /tmp/3
FATAL:  terminating walreceiver process due to administrator command
LOG:  restored log file "000000010000000000000009" from archive
LOG:  restored log file "000000010000000000000009" from archive
LOG:  redo done at 0/90000E8
LOG:  restored log file "000000010000000000000009" from archive
PANIC:  unexpected pageaddr 0/6000000 in log file 0, segment 9, offset 0
LOG:  startup process (PID 1804) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes

It is *possible* I mixed up the order of a step somewhere since my testing
isn't script based. A standby server that 'looks' okay but can't actually be
promoted is dangerous.

Looks the same problem as the above. Another weired point is that
the same archived WAL file is restored two times before redo is done.
I'm not sure why this happens... Could you provide the test case which
reproduces this problem? Will diagnose.

Regards,



--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Online base backup from the hot-standby

Reply via email to