Re: [HACKERS] pg_rewind test race condition..?
On Wed, Apr 29, 2015 at 10:03 PM, Stephen Frost wrote: > Another thought would be to provide an option to pg_rewind to have it do > an explicit checkpoint before it reads the control file.. I'm not > against having it simply always do it as I don't see pg_rewind being a > commonly run thing, but I know some environments have very heavy > checkpoints and that might not be ideal. Waking an old thread. It seems to me that such an option would still be useful when rewinding using a stream so as an inconsistent control file is not read. The default should definitely be to not issue this checkpoint, but in some environments I believe that this could really matter, say two failovers followed by two rewinds done consequently. Thoughts? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind test race condition..?
Hi, On Wed, Apr 29, 2015, at 01:36 AM, Heikki Linnakangas wrote: > The problem seems to be that when the standby is promoted, it's a > so-called "fast promotion", where it writes an end-of-recovery record > and starts accepting queries before creating a real checkpoint. > pg_rewind looks at the TLI in the latest checkpoint, as it's in the > control file, but that isn't updated until the checkpoint completes. I > don't see it on my laptop normally, but I can reproduce it if I insert a > "sleep(5)" in StartupXLog, just before it requests the checkpoint: > > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -7173,7 +7173,10 @@ StartupXLOG(void) >* than is appropriate now that we're not in standby mode anymore. >*/ > if (fast_promoted) > + { > + sleep(5); > RequestCheckpoint(CHECKPOINT_FORCE); > + } > } > > The simplest fix would be to force a checkpoint in the regression test, > before running pg_rewind. It's a bit of a cop out, since you'd still get > the same issue when you tried to do the same thing in the real world. It > should be rare in practice - you'd not normally run pg_rewind > immediately after promoting the standby - but a better error message at > least would be nice.. I think we came across this issue in production. We run a Python daemon called 'Patroni' (https://github.com/zalando/patroni) in order to automate failovers. It stops the current master before promotion of one of the replicas, and runs pg_rewind on the former master to make it "compatible" with the new one. Naturally, since pg_rewind is launched automatically, it is called right after detecting that the new master is running. What we get sometimes is "source and target cluster are on the same timeline". Per the explanation from Heikki, it happens when pg_rewind connects to the new master after promotion but before the first checkpoint with the new timeline. Since "fast promotion" is the only option for pg_ctl promote, I think it might be beneficial to add an option for pg_rewind to issue a checkpoint (I worked around this problem by adding a checkpoint call to Patroni, another way would be to create the 'fallback_promote' flle and do the 'slow' promote without relying on pg_ctl promote functionality). There is another weird case I cannot explain easily. If I do a single promotion/rewind, while executing an insert that takes quite some time (insert into test select id from generate_series(1, 1000) id) and actually interrupting it by the Patroni promotion (which does a checkpoint and pg_ctl -mf stop on the former master, and pg_ctl promote on the replica candidate), I'm getting the following (postgresql0 is the rewound node, note that the latest checkpoint location there is lower than the prior checkpoint one): $ pg_controldata data/postgresql0 pg_control version number:942 Catalog version number: 201511071 Database system identifier: 6225948429980199864 Database cluster state: in archive recovery pg_control last modified: Tue Dec 8 17:20:16 2015 Latest checkpoint location: 0/628 Prior checkpoint location:0/6000138 Latest checkpoint's REDO location:0/628 Latest checkpoint's REDO WAL file:00010006 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0/944 Latest checkpoint's NextOID: 16390 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID:931 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTs: 0 Latest checkpoint's newestCommitTs: 0 Time of latest checkpoint:Tue Dec 8 17:20:04 2015 Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/6020580 Min recovery ending loc's timeline: 2 Backup start location:0/0 Backup end location: 0/0 End-of-backup record required:no wal_level setting:hot_standby wal_log_hints setting:on max_connections setting: 100 max_worker_processes setting: 8 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment:16777216 Maximum length of identifiers:64 Maximum columns in an index: 32 Maximum size of a TOAST chunk:1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float
Re: [HACKERS] pg_rewind test race condition..?
Heikki, * Heikki Linnakangas (hlinn...@iki.fi) wrote: > On 04/29/2015 06:03 AM, Stephen Frost wrote: > >Forcing a checkpoint in the regression tests and then providing a better > >error message sounds reasonable to me. I agree that it's very unlikely > >to happen in the real world, even when you're bouncing between systems > >for upgrades, etc, you're unlikely to do it fast enough for this issue > >to exhibit itself, and a better error message would help any users who > >manage to run into this (perhaps during their own testing). > > I've committed this simple fix for now. Looks good, I'm no longer seeing the failures. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] pg_rewind test race condition..?
On 04/29/2015 06:03 AM, Stephen Frost wrote: * Heikki Linnakangas (hlinn...@iki.fi) wrote: --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7173,7 +7173,10 @@ StartupXLOG(void) * than is appropriate now that we're not in standby mode anymore. */ if (fast_promoted) + { + sleep(5); RequestCheckpoint(CHECKPOINT_FORCE); + } } The simplest fix would be to force a checkpoint in the regression test, before running pg_rewind. It's a bit of a cop out, since you'd still get the same issue when you tried to do the same thing in the real world. It should be rare in practice - you'd not normally run pg_rewind immediately after promoting the standby - but a better error message at least would be nice.. Forcing a checkpoint in the regression tests and then providing a better error message sounds reasonable to me. I agree that it's very unlikely to happen in the real world, even when you're bouncing between systems for upgrades, etc, you're unlikely to do it fast enough for this issue to exhibit itself, and a better error message would help any users who manage to run into this (perhaps during their own testing). I've committed this simple fix for now. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewind test race condition..?
Heikki, * Heikki Linnakangas (hlinn...@iki.fi) wrote: > The problem seems to be that when the standby is promoted, it's a > so-called "fast promotion", where it writes an end-of-recovery > record and starts accepting queries before creating a real > checkpoint. pg_rewind looks at the TLI in the latest checkpoint, as > it's in the control file, but that isn't updated until the > checkpoint completes. I don't see it on my laptop normally, but I > can reproduce it if I insert a "sleep(5)" in StartupXLog, just > before it requests the checkpoint: Ah, interesting. > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -7173,7 +7173,10 @@ StartupXLOG(void) >* than is appropriate now that we're not in standby mode anymore. >*/ > if (fast_promoted) > + { > + sleep(5); > RequestCheckpoint(CHECKPOINT_FORCE); > + } > } > > The simplest fix would be to force a checkpoint in the regression > test, before running pg_rewind. It's a bit of a cop out, since you'd > still get the same issue when you tried to do the same thing in the > real world. It should be rare in practice - you'd not normally run > pg_rewind immediately after promoting the standby - but a better > error message at least would be nice.. Forcing a checkpoint in the regression tests and then providing a better error message sounds reasonable to me. I agree that it's very unlikely to happen in the real world, even when you're bouncing between systems for upgrades, etc, you're unlikely to do it fast enough for this issue to exhibit itself, and a better error message would help any users who manage to run into this (perhaps during their own testing). Another thought would be to provide an option to pg_rewind to have it do an explicit checkpoint before it reads the control file.. I'm not against having it simply always do it as I don't see pg_rewind being a commonly run thing, but I know some environments have very heavy checkpoints and that might not be ideal. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] pg_rewind test race condition..?
On 04/28/2015 11:02 AM, Stephen Frost wrote: Heikki, Not sure if anyone else is seeing this, but I'm getting regression test failures when running the pg_rewind tests pretty consistently with 'make check'. Specifically with "basic remote", I'm getting: source and target cluster are on the same timeline Failure, exiting in regress_log/pg_rewind_log_basic_remote. If I throw a "sleep(5);" into t/001_basic.pl before the call to RewindTest::run_pg_rewind($test_mode); then everything works fine. The problem seems to be that when the standby is promoted, it's a so-called "fast promotion", where it writes an end-of-recovery record and starts accepting queries before creating a real checkpoint. pg_rewind looks at the TLI in the latest checkpoint, as it's in the control file, but that isn't updated until the checkpoint completes. I don't see it on my laptop normally, but I can reproduce it if I insert a "sleep(5)" in StartupXLog, just before it requests the checkpoint: --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7173,7 +7173,10 @@ StartupXLOG(void) * than is appropriate now that we're not in standby mode anymore. */ if (fast_promoted) + { + sleep(5); RequestCheckpoint(CHECKPOINT_FORCE); + } } The simplest fix would be to force a checkpoint in the regression test, before running pg_rewind. It's a bit of a cop out, since you'd still get the same issue when you tried to do the same thing in the real world. It should be rare in practice - you'd not normally run pg_rewind immediately after promoting the standby - but a better error message at least would be nice.. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_rewind test race condition..?
Heikki, Not sure if anyone else is seeing this, but I'm getting regression test failures when running the pg_rewind tests pretty consistently with 'make check'. Specifically with "basic remote", I'm getting: source and target cluster are on the same timeline Failure, exiting in regress_log/pg_rewind_log_basic_remote. If I throw a "sleep(5);" into t/001_basic.pl before the call to RewindTest::run_pg_rewind($test_mode); then everything works fine. ./configure --silent --prefix="$INSTALL" --with-openssl --with-tcl --with-tclconfig=/usr/lib/tcl8.6 --with-perl --enable-debug --enable-cassert --enable-tap-tests --with-gssapi Full unsuccessful pg_rewind_log_basic_remote output: - waiting for server to startLOG: database system was shut down at 2015-04-28 13:46:34 EDT LOG: database system is ready to accept connections done server started waiting for server to startLOG: database system was interrupted; last known up at 2015-04-28 13:46:35 EDT LOG: entering standby mode LOG: redo starts at 0/228 LOG: consistent recovery state reached at 0/2F8 LOG: database system is ready to accept read only connections LOG: started streaming WAL from primary at 0/300 on timeline 1 done server started server promoting LOG: received promote request FATAL: terminating walreceiver process due to administrator command LOG: invalid record length at 0/30EB410 LOG: redo done at 0/30EB3A0 LOG: last completed transaction was at log time 2015-04-28 13:46:37.540952-04 LOG: selected new timeline ID: 2 LOG: archive recovery complete LOG: database system is ready to accept connections waiting for server to shut downLOG: received fast shutdown request LOG: aborting any active transactions LOG: shutting down LOG: database system is shut down done server stopped source and target cluster are on the same timeline Failure, exiting waiting for server to startLOG: database system was shut down at 2015-04-28 13:46:39 EDT LOG: entering standby mode LOG: consistent recovery state reached at 0/3311438 LOG: invalid record length at 0/3311438 LOG: database system is ready to accept read only connections LOG: fetching timeline history file for timeline 2 from primary server LOG: started streaming WAL from primary at 0/300 on timeline 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 at 0/30EB410. LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/3311438 LOG: restarted WAL streaming at 0/300 on timeline 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 at 0/30EB410. LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/3311438 done server started LOG: received immediate shutdown request LOG: received immediate shutdown request - Full successful pg_rewind_log_basic_remote output: - waiting for server to startLOG: database system was shut down at 2015-04-28 13:54:04 EDT LOG: database system is ready to accept connections done server started waiting for server to startLOG: database system was interrupted; last known up at 2015-04-28 13:54:05 EDT LOG: entering standby mode LOG: redo starts at 0/228 LOG: consistent recovery state reached at 0/2F8 LOG: database system is ready to accept read only connections LOG: started streaming WAL from primary at 0/300 on timeline 1 done server started server promoting LOG: received promote request FATAL: terminating walreceiver process due to administrator command LOG: invalid record length at 0/30EB410 LOG: redo done at 0/30EB3A0 LOG: last completed transaction was at log time 2015-04-28 13:54:07.547533-04 LOG: selected new timeline ID: 2 LOG: archive recovery complete LOG: database system is ready to accept connections waiting for server to shut downLOG: received fast shutdown request LOG: aborting any active transactions LOG: shutting down LOG: database system is shut down done server stopped The servers diverged at WAL position 0/30EB410 on timeline 1. Rewinding from last common checkpoint at 0/30EB3A0 on timeline 1 Done! waiting for server to startLOG: database system was interrupted while in recovery at log time 2015-04-28 13:54:08 EDT HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. LOG: entering standby mode LOG: redo starts at 0/30EB368 LOG: consistent recovery state reached at 0/3113788 LOG: invalid record length at 0/3113788 LOG: database system is ready to accept read only connections LOG: started streaming WAL from primary at 0/300 on timeline 2 done server started LOG: received immediate shutdown reques