[HACKERS] Probable problem with pg_standby

Detlef Ulherr Tue, 04 Nov 2008 03:38:28 -0800

Hi,

First to introduce myself, I am working in Sun Cluster engineering and Iam responsible for the integration (the agent) between PostgreSQL andSun Cluster. The PostgreSQL agent provides a feature which uses WAL fileshipping and pg_standby as a replacement for shared storage.

Let's talk about the problem now. Whenever the primary database serverselects a new timeline, the standby server which is running pg_standbystops applying logs to its database. It comes even worse, after awhile pg_standby terminates the recovery mode and now we have primaryand standby accepting requests. there was no trigger file created, nor asignal sent manually to pg_standby.


Here is some debugging output of pg_standby.

running restore         : OK
removing "/pgs/83_walarchives/0000001A00000000000000B5"
LOG:  restored log file "0000001F00000000000000D4" from archive
LOG:  record with zero length at 0/D4000060
LOG:  redo done at 0/D4000020

Trigger file            : /pgs/data/failover
Waiting for WAL file    : 0000001F00000000000000D4
WAL file path           : /pgs/83_walarchives/0000001F00000000000000D4
Restoring to...         : pg_xlog/RECOVERYXLOG
Sleep interval          : 5 seconds
Max wait interval       : 0 forever

Command for restore : cp"/pgs/83_walarchives/0000001F00000000000000D4" "pg_xlog/RECOVERYXLOG"

Keep archive history    : 0000001F00000000000000B6 and later
running restore         : OK
LOG:  restored log file "0000001F00000000000000D4" from archive

Trigger file            : /pgs/data/failover
Waiting for WAL file    : 00000020.history
WAL file path           : /pgs/83_walarchives/00000020.history
Restoring to...         : pg_xlog/RECOVERYHISTORY
Sleep interval          : 5 seconds
Max wait interval       : 0 forever

Command for restore : cp "/pgs/83_walarchives/00000020.history""pg_xlog/RECOVERYHISTORY"

Keep archive history    : No cleanup required

running restore : OKLOG: restored log file "00000020.history"from archive


Trigger file            : /pgs/data/failover
Waiting for WAL file    : 00000021.history
WAL file path           : /pgs/83_walarchives/00000021.history
Restoring to...         : pg_xlog/RECOVERYHISTORY
Sleep interval          : 5 seconds
Max wait interval       : 0 forever

Command for restore : cp "/pgs/83_walarchives/00000021.history""pg_xlog/RECOVERYHISTORY"

Keep archive history    : No cleanup required

running restore :cp: cannot access/pgs/83_walarchives/00000021.history

cp: cannot access /pgs/83_walarchives/00000021.history
cp: cannot access /pgs/83_walarchives/00000021.history
not restored            : history file not found
LOG:  selected new timeline ID: 33

Trigger file            : /pgs/data/failover
Waiting for WAL file    : 0000001F.history
WAL file path           : /pgs/83_walarchives/0000001F.history
Restoring to...         : pg_xlog/RECOVERYHISTORY
Sleep interval          : 5 seconds
Max wait interval       : 0 forever

Command for restore : cp "/pgs/83_walarchives/0000001F.history""pg_xlog/RECOVERYHISTORY"

Keep archive history    : No cleanup required

running restore : OKLOG: restored log file "0000001F.history"from archive

LOG:  archive recovery complete
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections

And here are the corresponding logs from the primary database server.

LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
building file list ... done
0000001D00000000000000D1

sent 16779397 bytes  received 42 bytes  6711775.60 bytes/sec
total size is 16777216  speedup is 1.00
building file list ... done
0000001F.history

sent 2248 bytes  received 42 bytes  4580.00 bytes/sec
total size is 2119  speedup is 0.93
building file list ... done
0000001F00000000000000D2

sent 16779397 bytes  received 42 bytes  11186292.67 bytes/sec
total size is 16777216  speedup is 1.00
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
LOG:  database system was shut down at 2008-10-29 14:07:40 CET
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
building file list ... done
0000001F00000000000000D3

sent 16779397 bytes  received 42 bytes  11186292.67 bytes/sec
total size is 16777216  speedup is 1.00
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
building file list ... done
0000001D00000000000000D2

sent 16779397 bytes  received 42 bytes  11186292.67 bytes/sec
total size is 16777216  speedup is 1.00
building file list ... done
0000001E00000000000000D2

sent 16779397 bytes  received 42 bytes  6711775.60 bytes/sec
total size is 16777216  speedup is 1.00
LOG:  database system was shut down at 2008-10-29 14:10:59 CET
LOG:  starting archive recovery
LOG:  restore_command = 'cp /pgs/83_walarchives/%f %p'
cp: cannot access /pgs/83_walarchives/0000001F.history
cp: cannot access /pgs/83_walarchives/0000001F00000000000000D4
LOG:  automatic recovery in progress
LOG:  record with zero length at 0/D4000060
LOG:  redo is not required
cp: cannot access /pgs/83_walarchives/0000001F00000000000000D4
cp: cannot access /pgs/83_walarchives/00000020.history
LOG:  selected new timeline ID: 32
cp: cannot access /pgs/83_walarchives/0000001F.history
LOG:  archive recovery complete
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
building file list ... done
0000001D00000000000000D3

sent 16779397 bytes  received 42 bytes  6711775.60 bytes/sec
total size is 16777216  speedup is 1.00
building file list ... done
00000020.history

All I did was forcing the primary in a recovery to generate a newtimeline. The installed version was 8.3.4, but the problem is the samewith earlier versions as well. It occurred in 8.2 also. this problem isreproducible all the times. For my agent code I implemented a workaroundwhich guarantees that during a resilvering process the primary and thestandby start at t the same timeline. But my feeling is that the standbyshould go to the same timeline as the primary when he receives thehistory file without disruption, and by all means it should never stopthe recovery unmotivated. This will make a full synchronizationnecessary and in times of larger databases, this may cause major downtimes.


Kind regards

Detlef



--

*****************************************************************************
Detlef Ulherr
Staff Engineer                                  Tel: (++49 6103) 752-248
Availability Engineering                        Fax: (++49 6103) 752-167

Sun Microsystems GmbHAmperestr. 6 mailto:[EMAIL PROTECTED]

63225 Langen                                    http://www.sun.de/
*****************************************************************************

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551
Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

*****************************************************************************



--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Probable problem with pg_standby

Reply via email to