Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
Hi, On 2017-06-05 15:30:38 +0900, Michael Paquier wrote: > + * This will trigger walsenders to send the remaining WAL, prevent them from > + * accepting further commands. After that they'll wait till the last WAL is > + * written. > s/prevent/preventing/? > I would rephrase the last sentence a bit: > "After that each WAL sender will wait until the end-of-checkpoint > record has been flushed on the receiver side." I didn't like your proposed phrasing much, but I aggree that what I had wasn't good either. Tried to improve it. Thanks for the review. I pushed this series, this should resolve the issue in this thread entirely, and should fix a good chunk of the issues in the 'walsender and parallelism' thread. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Tue, Jun 6, 2017 at 9:47 AM, Andres Freund wrote: > On 2017-06-05 15:30:38 +0900, Michael Paquier wrote: >> I think that it would be interesting to be able to >> trigger a feedback message using SIGHUP in WAL receivers, refactoring >> at the same time SIGHUP handling for WAL receivers. It is possible for >> example to abuse SIGHUP in autovacuum for cost parameters. > > Could you clarify a bit here, I can't follow? Do you think it's > actually a good idea to combine that with the largely mechanical patch? Sort of. The thought here is to be able to trigger XLogWalRcvSendReply() using a SIGHUP, even if force_reply is not enforced. But looking again at the code, XLogWalRcvSendReply() is processed only if data is received so sending multiple times the same message to server would be pointless. Still, don't you think that it would be helpful to wake up the WAL receiver at will on SIGHUP by setting its latch? XLogWalRcvSendHSFeedback() could be triggered at will using that. ProcessWalRcvInterrupts() could include the checks for SIGHUP by the way... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 2017-06-05 15:30:38 +0900, Michael Paquier wrote: > I have looked at all those patches. The set looks solid to me. Thanks! > Here are some comments about 0003. > + /* > +* Have WalSndLoop() terminate the connection in an orderly > +* manner, after writing out all the pending data. > +*/ > + if (got_STOPPING) > + got_SIGUSR2 = true; > I think that for correctness the state of the WAL sender should be > switched to WALSNDSTATE_STOPPING in XLogSendLogical() as well. No, that would be wrong. If we switched here, checkpointer would finish waiting, even though XLogSendLogical() might get called again. That e.g. could happen the TCP socket was full, and XLogSendLogical() gets called again. > A more appropriate name would be ConfigReloadPending perhaps? Hm, ok. > 0005 looks like a fine one-liner to me. > > For 0006, you could include as well the removal of worker_spi_sighup() > in the refactoring. Ok. I'll leave that patch for now, since I think it's probably better to apply it only to master once v10 branched off. > I think that it would be interesting to be able to > trigger a feedback message using SIGHUP in WAL receivers, refactoring > at the same time SIGHUP handling for WAL receivers. It is possible for > example to abuse SIGHUP in autovacuum for cost parameters. Could you clarify a bit here, I can't follow? Do you think it's actually a good idea to combine that with the largely mechanical patch? - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Mon, Jun 5, 2017 at 10:29 AM, Andres Freund wrote: > On 2017-06-02 17:20:23 -0700, Andres Freund wrote: >> Attached is a *preliminary* patch series implementing this. I've first >> reverted the previous patch, as otherwise backpatchable versions of the >> necessary patches would get too complicated, due to the signals used and >> such. That makes sense. > I went again through this, and the only real thing I found that there > was a leftover prototype in walsender.h. I've in interim worked on > backpatch versions of that series, annoying conflicts, but nothing > really problematic. The only real difference is adding SetLatch() calls > to HandleWalSndInitStopping() < 9.6, and guarding SetLatch with an if < > 9.5. > > As an additional patch (based on one by Petr), even though it more > belongs to > http://archives.postgresql.org/message-id/20170421014030.fdzvvvbrz4nckrow%40alap3.anarazel.de > attached is a patch unifying SIGHUP between normal and walsender > backends. This needs to be backpatched all the way. I've also attached > a second patch, again based on Petr's, that unifies SIGHUP handling > across all the remaining backends, but that's something that probably > more appropriate for v11, although I'm still tempted to commit it > earlier. I have looked at all those patches. The set looks solid to me. 0001 and 0002 are straight-forward things. It makes sense to unify the SIGUSR1 handling. Here are some comments about 0003. + * This will trigger walsenders to send the remaining WAL, prevent them from + * accepting further commands. After that they'll wait till the last WAL is + * written. s/prevent/preventing/? I would rephrase the last sentence a bit: "After that each WAL sender will wait until the end-of-checkpoint record has been flushed on the receiver side." + /* +* Have WalSndLoop() terminate the connection in an orderly +* manner, after writing out all the pending data. +*/ + if (got_STOPPING) + got_SIGUSR2 = true; I think that for correctness the state of the WAL sender should be switched to WALSNDSTATE_STOPPING in XLogSendLogical() as well. About 0004... This definitely meritates a backpatch, PostgresMain() is taken by WAL senders as well when executing queries. - if (got_SIGHUP) + if (ConfigRereadPending) { - got_SIGHUP = false; + ConfigRereadPending = false; A more appropriate name would be ConfigReloadPending perhaps? 0005 looks like a fine one-liner to me. For 0006, you could include as well the removal of worker_spi_sighup() in the refactoring. I think that it would be interesting to be able to trigger a feedback message using SIGHUP in WAL receivers, refactoring at the same time SIGHUP handling for WAL receivers. It is possible for example to abuse SIGHUP in autovacuum for cost parameters. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
Hi, On 2017-06-05 10:31:12 +0900, Michael Paquier wrote: > On Mon, Jun 5, 2017 at 10:29 AM, Andres Freund wrote: > > Michael, Peter, Fujii, is either of you planning to review this? I'm > > planning to commit this tomorrow morning PST, unless somebody protest > > till then. > > Yes, I am. It would be nice if you could let me 24 hours to look at it > in details. Sure. Could you let me know when you're done? Noah, I might thus not be able to resolve most of "Query handling in Walsender is somewhat broken" by tomorrow, but it might end up being Tuesday. Even after that there'll be a remaining item after all these. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Mon, Jun 5, 2017 at 10:29 AM, Andres Freund wrote: > Michael, Peter, Fujii, is either of you planning to review this? I'm > planning to commit this tomorrow morning PST, unless somebody protest > till then. Yes, I am. It would be nice if you could let me 24 hours to look at it in details. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 2017-06-02 17:20:23 -0700, Andres Freund wrote: > Attached is a *preliminary* patch series implementing this. I've first > reverted the previous patch, as otherwise backpatchable versions of the > necessary patches would get too complicated, due to the signals used and > such. I went again through this, and the only real thing I found that there was a leftover prototype in walsender.h. I've in interim worked on backpatch versions of that series, annoying conflicts, but nothing really problematic. The only real difference is adding SetLatch() calls to HandleWalSndInitStopping() < 9.6, and guarding SetLatch with an if < 9.5. As an additional patch (based on one by Petr), even though it more belongs to http://archives.postgresql.org/message-id/20170421014030.fdzvvvbrz4nckrow%40alap3.anarazel.de attached is a patch unifying SIGHUP between normal and walsender backends. This needs to be backpatched all the way. I've also attached a second patch, again based on Petr's, that unifies SIGHUP handling across all the remaining backends, but that's something that probably more appropriate for v11, although I'm still tempted to commit it earlier. Michael, Peter, Fujii, is either of you planning to review this? I'm planning to commit this tomorrow morning PST, unless somebody protest till then. - Andres >From 39f95c9e85811d6759a29b293adc97567d895d69 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 2 Jun 2017 14:14:34 -0700 Subject: [PATCH 1/6] Revert "Prevent panic during shutdown checkpoint" This reverts commit 086221cf6b1727c2baed4703c582f657b7c5350e, which was made to master only. The approach implemented in the above commit has some issues. While those could easily be fixed incrementally, doing so would make backpatching considerably harder, so instead first revert this patch. Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxp...@alap3.anarazel.de --- doc/src/sgml/monitoring.sgml| 5 - src/backend/access/transam/xlog.c | 6 -- src/backend/postmaster/postmaster.c | 7 +- src/backend/replication/walsender.c | 143 src/include/replication/walsender.h | 1 - src/include/replication/walsender_private.h | 3 +- 6 files changed, 24 insertions(+), 141 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 79ca45a156..5640c0d84a 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -1690,11 +1690,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i backup: This WAL sender is sending a backup. - - - stopping: This WAL sender is stopping. - - diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 399822d3fe..35ee7d1cb6 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8324,12 +8324,6 @@ ShutdownXLOG(int code, Datum arg) ereport(IsPostmasterEnvironment ? LOG : NOTICE, (errmsg("shutting down"))); - /* - * Wait for WAL senders to be in stopping state. This prevents commands - * from writing new WAL. - */ - WalSndWaitStopping(); - if (RecoveryInProgress()) CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); else diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 35b4ec88d3..5c79b1e40d 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2918,7 +2918,7 @@ reaper(SIGNAL_ARGS) * Waken walsenders for the last time. No regular backends * should be around anymore. */ -SignalChildren(SIGINT); +SignalChildren(SIGUSR2); pmState = PM_SHUTDOWN_2; @@ -3656,9 +3656,7 @@ PostmasterStateMachine(void) /* * If we get here, we are proceeding with normal shutdown. All * the regular children are gone, and it's time to tell the - * checkpointer to do a shutdown checkpoint. All WAL senders - * are told to switch to a stopping state so that the shutdown - * checkpoint can go ahead. + * checkpointer to do a shutdown checkpoint. */ Assert(Shutdown > NoShutdown); /* Start the checkpointer if not running */ @@ -3667,7 +3665,6 @@ PostmasterStateMachine(void) /* And tell it to shut down */ if (CheckpointerPID != 0) { - SignalSomeChildren(SIGUSR2, BACKEND_TYPE_WALSND); signal_child(CheckpointerPID, SIGUSR2); pmState = PM_SHUTDOWN; } diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 49cce38880..aa705e5b35 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -24,14 +24,11 @@ * are treated as not a crash but approximately normal termination; * the walsender will exit quickly without sending any more XLOG records. * - * If the server is shut down, p
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 2017-06-01 17:29:12 -0700, Andres Freund wrote: > On 2017-06-02 08:38:51 +0900, Michael Paquier wrote: > > On Fri, Jun 2, 2017 at 7:05 AM, Andres Freund wrote: > > > I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler. > > > Normally INT is used cancel interrupts, and since walsender is now also > > > working as a normal backend, this overlap is bad. Even for plain > > > walsender backend this seems bad, because now non-superusers replication > > > users will terminate replication connections when they do > > > pg_cancel_backend(). For replication=dbname users it's especially bad > > > because there can be completely legitimate uses of pg_cancel_backend(). > > > > Signals for WAL senders are set in WalSndSignals() which uses SIG_IGN > > for SIGINT now in ~9.6, and StatementCancelHandler does not get set up > > for a non-am_walsender backend. Am I missing something? > > Yes, but nothing in those observeration actually addresses my point? > > Some points: > > 1) 086221cf6b1727c2baed4703c582f657b7c5350e changes things so walsender >backends use SIGINT for WalSndLastCycleHandler(), which is now >triggerable by pg_cancel_backend(). Especially for logical rep >walsenders it's not absurd sending that. > 2) Walsenders now can run normal queries. > 3) Afaict 086221cf6b1727c2baed4703c582f657b7c5350e doesn't really >address the PANIC problem for database connected walsenders at all, >because it doesn't even cancel non-replication commands. I.e. an >already running query can just continue to run. Which afaict just >entirely breaks shutdown. If the connection is idle, or running a >query, we'll just wait forever. > 4) the whole logic introduced in the above commit doesn't actually >appear to handle logical decoding senders properly - wasn't the whole >issue at hand that those can write WAL in some case? But >nevertheless WalSndWaitForWal() does a >WalSndSetState(WALSNDSTATE_STOPPING); *and then continues decoding >and waiting* - which seems to obviate the entire point of that commit. > > I'm working on a patch rejiggering things so: > > a) upon shutdown checkpointer (so we can use procsignal), not >postmaster, sends PROCSIG_WALSND_INIT_STOPPING to all walsenders (so >we don't have to use up a normal signal handler) > b) Upon reception walsenders immediately exit if !replication_active, >otherwise sets got_STOPPING > c) Logical decoding will finish if got_STOPPING is sent, *WITHOUT*, as >currently done, moving to WALSNDSTATE_STOPPING. Not yet quite sure >how to best handle the WALSNDSTATE_STOPPING transition in WalSndLoop(). > d) Once all remaining walsenders are in stopping state, postmaster sends >SIGUSR2 to trigger shutdown (basically as before) > > Does that seem to make sense? Attached is a *preliminary* patch series implementing this. I've first reverted the previous patch, as otherwise backpatchable versions of the necessary patches would get too complicated, due to the signals used and such. This also fixes several of the issues from the somewhat related thread at http://archives.postgresql.org/message-id/20170421014030.fdzvvvbrz4nckrow%40alap3.anarazel.de I'm not perfectly happy with the use of XLogBackgroundFlush() but we don't currently expose anything else to flush all pending WAL afaics - it's not too bad either. Without that we can end up waiting forever while if the last XLogInserts are done by an asynchronously committing backend or the relevant backends exited before getting to flush out their records, because walwriter has already been shut down at that point. Comments? - Andres >From 187f99cdf98886be954cd1edda275c51b83da5ef Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 2 Jun 2017 14:14:34 -0700 Subject: [PATCH 1/4] Revert "Prevent panic during shutdown checkpoint" This reverts commit 086221cf6b1727c2baed4703c582f657b7c5350e, which was made to master only. The approach implemented in the above commit has some issues. While this could easily be fixed incrementally, doing so would make backpatching considerably harder, so instead first revert this patch. Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxp...@alap3.anarazel.de --- doc/src/sgml/monitoring.sgml| 5 - src/backend/access/transam/xlog.c | 6 -- src/backend/postmaster/postmaster.c | 7 +- src/backend/replication/walsender.c | 143 src/include/replication/walsender.h | 1 - src/include/replication/walsender_private.h | 3 +- 6 files changed, 24 insertions(+), 141 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 79ca45a156..5640c0d84a 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -1690,11 +1690,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i backup: This WAL sender is sending a backup.
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Thu, Jun 1, 2017 at 6:05 PM, Andres Freund wrote: > I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler. > Normally INT is used cancel interrupts, and since walsender is now also > working as a normal backend, this overlap is bad. Yep, that's bad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 2017-06-02 10:05:21 +0900, Michael Paquier wrote: > On Fri, Jun 2, 2017 at 9:29 AM, Andres Freund wrote: > > On 2017-06-02 08:38:51 +0900, Michael Paquier wrote: > >> On Fri, Jun 2, 2017 at 7:05 AM, Andres Freund wrote: > >> > I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler. > >> > Normally INT is used cancel interrupts, and since walsender is now also > >> > working as a normal backend, this overlap is bad. Even for plain > >> > walsender backend this seems bad, because now non-superusers replication > >> > users will terminate replication connections when they do > >> > pg_cancel_backend(). For replication=dbname users it's especially bad > >> > because there can be completely legitimate uses of pg_cancel_backend(). > >> > >> Signals for WAL senders are set in WalSndSignals() which uses SIG_IGN > >> for SIGINT now in ~9.6, and StatementCancelHandler does not get set up > >> for a non-am_walsender backend. Am I missing something? > > > > Yes, but nothing in those observeration actually addresses my point? > > I am still confused by your previous email, which, at least it seems > to me, implies that logical WAL senders have been working correctly > with query cancellations. Now SIGINT is just ignored, which means that > pg_cancel_backend() has never worked for WAL senders until now, and > this behavior has not changed with 086221c. So there is no new > breakage introduced by this commit. I get your point to reuse SIGINT > for query cancellations though, but that's a new feature. The issue is that the commit made a non-existant feature (pg_cancel_backend() to walsenders) into a broken one (pg_cancel_backend terminates walsenders). Additionally v10 added something new (walsenders executing SQL), and that will need at least some signal handling fixes - hard to do if e.g. SIGINT is reused for something else. > > a) upon shutdown checkpointer (so we can use procsignal), not > >postmaster, sends PROCSIG_WALSND_INIT_STOPPING to all walsenders (so > >we don't have to use up a normal signal handler) > > You'll need a second one that wakes up the latch of the WAL senders to > send more WAL records. Don't think so, procsignal_sigusr1_handler serves quite well for that. There's nearby discussion that we need to do so anyway, to fix recovery conflict interrupts, parallelism interrupts and such. > > b) Upon reception walsenders immediately exit if !replication_active, > >otherwise sets got_STOPPING > > Okay, that's what happens now anyway, any new replication command > received results in an error. I actually prefer the way of doing in > HEAD, which at least reports an error. Err, no. What happens right now is that plainly nothing is done if a connection is idle or busy executing things. Only if a new command is sent we error out - that makes very little sense. > > c) Logical decoding will finish if got_STOPPING is sent, *WITHOUT*, as > >currently done, moving to WALSNDSTATE_STOPPING. Not yet quite sure > >how to best handle the WALSNDSTATE_STOPPING transition in WalSndLoop(). > > Wouldn't it make sense to have the logical receivers be able to > receive WAL up to the end of checkpoint record? Yea, that's what I'm doing. For that we really only need to change the check in WalSndWaitForWal() check of got_SIGINT to got_STOPPING, and add a XLogSendLogical() check in the WalSndCaughtUp if() that sets got_SIGUSR2 *without* setting WALSNDSTATE_STOPPING (otherwise we'd possibly continue to trigger wal records until the send buffer is emptied). - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Fri, Jun 2, 2017 at 9:29 AM, Andres Freund wrote: > On 2017-06-02 08:38:51 +0900, Michael Paquier wrote: >> On Fri, Jun 2, 2017 at 7:05 AM, Andres Freund wrote: >> > I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler. >> > Normally INT is used cancel interrupts, and since walsender is now also >> > working as a normal backend, this overlap is bad. Even for plain >> > walsender backend this seems bad, because now non-superusers replication >> > users will terminate replication connections when they do >> > pg_cancel_backend(). For replication=dbname users it's especially bad >> > because there can be completely legitimate uses of pg_cancel_backend(). >> >> Signals for WAL senders are set in WalSndSignals() which uses SIG_IGN >> for SIGINT now in ~9.6, and StatementCancelHandler does not get set up >> for a non-am_walsender backend. Am I missing something? > > Yes, but nothing in those observeration actually addresses my point? I am still confused by your previous email, which, at least it seems to me, implies that logical WAL senders have been working correctly with query cancellations. Now SIGINT is just ignored, which means that pg_cancel_backend() has never worked for WAL senders until now, and this behavior has not changed with 086221c. So there is no new breakage introduced by this commit. I get your point to reuse SIGINT for query cancellations though, but that's a new feature. > Some points: > > 1) 086221cf6b1727c2baed4703c582f657b7c5350e changes things so walsender >backends use SIGINT for WalSndLastCycleHandler(), which is now >triggerable by pg_cancel_backend(). Especially for logical rep >walsenders it's not absurd sending that. > 2) Walsenders now can run normal queries. > 3) Afaict 086221cf6b1727c2baed4703c582f657b7c5350e doesn't really >address the PANIC problem for database connected walsenders at all, >because it doesn't even cancel non-replication commands. I.e. an >already running query can just continue to run. Which afaict just >entirely breaks shutdown. If the connection is idle, or running a >query, we'll just wait forever. > 4) the whole logic introduced in the above commit doesn't actually >appear to handle logical decoding senders properly - wasn't the whole >issue at hand that those can write WAL in some case? But >nevertheless WalSndWaitForWal() does a >WalSndSetState(WALSNDSTATE_STOPPING); *and then continues decoding >and waiting* - which seems to obviate the entire point of that commit. > > I'm working on a patch rejiggering things so: > > a) upon shutdown checkpointer (so we can use procsignal), not >postmaster, sends PROCSIG_WALSND_INIT_STOPPING to all walsenders (so >we don't have to use up a normal signal handler) You'll need a second one that wakes up the latch of the WAL senders to send more WAL records. > b) Upon reception walsenders immediately exit if !replication_active, >otherwise sets got_STOPPING Okay, that's what happens now anyway, any new replication command received results in an error. I actually prefer the way of doing in HEAD, which at least reports an error. > c) Logical decoding will finish if got_STOPPING is sent, *WITHOUT*, as >currently done, moving to WALSNDSTATE_STOPPING. Not yet quite sure >how to best handle the WALSNDSTATE_STOPPING transition in WalSndLoop(). Wouldn't it make sense to have the logical receivers be able to receive WAL up to the end of checkpoint record? > d) Once all remaining walsenders are in stopping state, postmaster sends >SIGUSR2 to trigger shutdown (basically as before) OK. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 2017-06-02 08:38:51 +0900, Michael Paquier wrote: > On Fri, Jun 2, 2017 at 7:05 AM, Andres Freund wrote: > > I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler. > > Normally INT is used cancel interrupts, and since walsender is now also > > working as a normal backend, this overlap is bad. Even for plain > > walsender backend this seems bad, because now non-superusers replication > > users will terminate replication connections when they do > > pg_cancel_backend(). For replication=dbname users it's especially bad > > because there can be completely legitimate uses of pg_cancel_backend(). > > Signals for WAL senders are set in WalSndSignals() which uses SIG_IGN > for SIGINT now in ~9.6, and StatementCancelHandler does not get set up > for a non-am_walsender backend. Am I missing something? Yes, but nothing in those observeration actually addresses my point? Some points: 1) 086221cf6b1727c2baed4703c582f657b7c5350e changes things so walsender backends use SIGINT for WalSndLastCycleHandler(), which is now triggerable by pg_cancel_backend(). Especially for logical rep walsenders it's not absurd sending that. 2) Walsenders now can run normal queries. 3) Afaict 086221cf6b1727c2baed4703c582f657b7c5350e doesn't really address the PANIC problem for database connected walsenders at all, because it doesn't even cancel non-replication commands. I.e. an already running query can just continue to run. Which afaict just entirely breaks shutdown. If the connection is idle, or running a query, we'll just wait forever. 4) the whole logic introduced in the above commit doesn't actually appear to handle logical decoding senders properly - wasn't the whole issue at hand that those can write WAL in some case? But nevertheless WalSndWaitForWal() does a WalSndSetState(WALSNDSTATE_STOPPING); *and then continues decoding and waiting* - which seems to obviate the entire point of that commit. I'm working on a patch rejiggering things so: a) upon shutdown checkpointer (so we can use procsignal), not postmaster, sends PROCSIG_WALSND_INIT_STOPPING to all walsenders (so we don't have to use up a normal signal handler) b) Upon reception walsenders immediately exit if !replication_active, otherwise sets got_STOPPING c) Logical decoding will finish if got_STOPPING is sent, *WITHOUT*, as currently done, moving to WALSNDSTATE_STOPPING. Not yet quite sure how to best handle the WALSNDSTATE_STOPPING transition in WalSndLoop(). d) Once all remaining walsenders are in stopping state, postmaster sends SIGUSR2 to trigger shutdown (basically as before) Does that seem to make sense? - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Fri, Jun 2, 2017 at 7:05 AM, Andres Freund wrote: > I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler. > Normally INT is used cancel interrupts, and since walsender is now also > working as a normal backend, this overlap is bad. Even for plain > walsender backend this seems bad, because now non-superusers replication > users will terminate replication connections when they do > pg_cancel_backend(). For replication=dbname users it's especially bad > because there can be completely legitimate uses of pg_cancel_backend(). Signals for WAL senders are set in WalSndSignals() which uses SIG_IGN for SIGINT now in ~9.6, and StatementCancelHandler does not get set up for a non-am_walsender backend. Am I missing something? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 2017-05-05 10:50:11 -0400, Peter Eisentraut wrote: > On 5/5/17 01:26, Michael Paquier wrote: > > The only code path doing HOT-pruning and generating WAL is > > heap_page_prune(). Do you think that we need to worry about FPWs as > > well? > > > > Attached is an updated patch, which also forbids the run of any > > replication commands when the stopping state is reached. > > I have committed this without the HOT pruning change. That can be > considered separately, and I think it could use another round of > thinking about it. I'm a unhappy how this is reusing SIGINT for WalSndLastCycleHandler. Normally INT is used cancel interrupts, and since walsender is now also working as a normal backend, this overlap is bad. Even for plain walsender backend this seems bad, because now non-superusers replication users will terminate replication connections when they do pg_cancel_backend(). For replication=dbname users it's especially bad because there can be completely legitimate uses of pg_cancel_backend(). - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Fri, May 26, 2017 at 4:47 PM, Peter Eisentraut wrote: > On 5/26/17 14:16, Michael Paquier wrote: >> So, now that the last round of minor releases has happened and that >> some dust has settled on this patch, shouldn't there be a backpatch? >> If yes, do you need patches for all branches? This problems goes down >> to 9.2 anyway as BASE_BACKUP can generate end-of-backup records. > > Yes, this could be backpatched now. It looks like it will need a bit of > fiddling to get it into all the backbranches. If you want to give it a > closer look, go ahead please. Attached are patches for 9.2~9.6. There are a couple of conflicts across each version. Particularly in 9.2, I have made the choice to not rename walsender_ready_to_stop to got_SIGINT as this is used as well in basebackup.c to make clearer the code. In 9.3~ the use of this flag is located only within walsender.c. -- Michael walsender-shutdown-96.patch Description: Binary data walsender-shutdown-95.patch Description: Binary data walsender-shutdown-93.patch Description: Binary data walsender-shutdown-94.patch Description: Binary data walsender-shutdown-92.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 5/26/17 14:16, Michael Paquier wrote: > So, now that the last round of minor releases has happened and that > some dust has settled on this patch, shouldn't there be a backpatch? > If yes, do you need patches for all branches? This problems goes down > to 9.2 anyway as BASE_BACKUP can generate end-of-backup records. Yes, this could be backpatched now. It looks like it will need a bit of fiddling to get it into all the backbranches. If you want to give it a closer look, go ahead please. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Sat, May 6, 2017 at 6:40 AM, Michael Paquier wrote: > Agreed. Just adding an ERROR message in XLogInsert() is not going to > help much as this leads also to PANIC for critical sections :( > So a patch really needs to be a no-op for all WAL-related operations > within the WAL sender, and that will be quite invasive I am afraid. > >> I will move the open item to "Older Bugs" now, because the user >> experience regression, so to speak, in version 10 has been addressed. >> (This could be a backpatching candidate, but I am not planning on it for >> next week's releases in any case.) > > No issues with all that. So, now that the last round of minor releases has happened and that some dust has settled on this patch, shouldn't there be a backpatch? If yes, do you need patches for all branches? This problems goes down to 9.2 anyway as BASE_BACKUP can generate end-of-backup records. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Fri, May 5, 2017 at 11:50 PM, Peter Eisentraut wrote: > On 5/5/17 01:26, Michael Paquier wrote: >> The only code path doing HOT-pruning and generating WAL is >> heap_page_prune(). Do you think that we need to worry about FPWs as >> well? >> >> Attached is an updated patch, which also forbids the run of any >> replication commands when the stopping state is reached. > > I have committed this without the HOT pruning change. That can be > considered separately, and I think it could use another round of > thinking about it. Agreed. Just adding an ERROR message in XLogInsert() is not going to help much as this leads also to PANIC for critical sections :( So a patch really needs to be a no-op for all WAL-related operations within the WAL sender, and that will be quite invasive I am afraid. > I will move the open item to "Older Bugs" now, because the user > experience regression, so to speak, in version 10 has been addressed. > (This could be a backpatching candidate, but I am not planning on it for > next week's releases in any case.) No issues with all that. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 5/5/17 01:26, Michael Paquier wrote: > The only code path doing HOT-pruning and generating WAL is > heap_page_prune(). Do you think that we need to worry about FPWs as > well? > > Attached is an updated patch, which also forbids the run of any > replication commands when the stopping state is reached. I have committed this without the HOT pruning change. That can be considered separately, and I think it could use another round of thinking about it. I will move the open item to "Older Bugs" now, because the user experience regression, so to speak, in version 10 has been addressed. (This could be a backpatching candidate, but I am not planning on it for next week's releases in any case.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Fri, May 5, 2017 at 5:33 PM, Pavan Deolasee wrote: > > > On Fri, May 5, 2017 at 10:56 AM, Michael Paquier > wrote: >> >> On Wed, May 3, 2017 at 12:25 AM, Peter Eisentraut >> >> >> >>> Can we prevent HOT pruning during logical decoding? >> >> >> >> It does not sound much difficult to do, couldn't you just make it a >> >> no-op with am_walsender? >> > >> > That's my hope. >> >> The only code path doing HOT-pruning and generating WAL is >> heap_page_prune(). Do you think that we need to worry about FPWs as >> well? > > > IMO the check should go inside heap_page_prune_opt(). Do we need to worry > about wal_log_hints or checksums producing WAL because of hint bit updates? > While I haven't read the thread, I am assuming if HOT pruning can happen, > surely hint bits can get set too. Yeah, that's as well what I am worrying about. Experts of logical decoding will correct me, but it seems to me that we have to cover all the cases where heap scans can generate WAL. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Fri, May 5, 2017 at 10:56 AM, Michael Paquier wrote: > On Wed, May 3, 2017 at 12:25 AM, Peter Eisentraut > > > >>> Can we prevent HOT pruning during logical decoding? > >> > >> It does not sound much difficult to do, couldn't you just make it a > >> no-op with am_walsender? > > > > That's my hope. > > The only code path doing HOT-pruning and generating WAL is > heap_page_prune(). Do you think that we need to worry about FPWs as > well? > IMO the check should go inside heap_page_prune_opt(). Do we need to worry about wal_log_hints or checksums producing WAL because of hint bit updates? While I haven't read the thread, I am assuming if HOT pruning can happen, surely hint bits can get set too. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Wed, May 3, 2017 at 12:25 AM, Peter Eisentraut wrote: > On 5/2/17 10:08, Michael Paquier wrote: >> On Tue, May 2, 2017 at 9:30 PM, Peter Eisentraut >> wrote: >>> On 5/2/17 03:11, Petr Jelinek wrote: logical decoding can theoretically do HOT pruning (even if the chance is really small) so it's not safe to start logical replication either. >>> >>> This seems a bit impossible to resolve. On the one hand, we want to >>> allow streaming until after the shutdown checkpoint. On the other hand, >>> streaming itself might produce new WAL. >> >> It would be nice to split things into two: >> - patch 1 adding the signal handling that wins a backpatch. >> - patch 2 fixing the side cases with logical decoding. > > The side cases with logical decoding are also not new and would need > backpatching, AIUI. Okay, I thought that there was some new concept part of logical replication here. >>> Can we prevent HOT pruning during logical decoding? >> >> It does not sound much difficult to do, couldn't you just make it a >> no-op with am_walsender? > > That's my hope. The only code path doing HOT-pruning and generating WAL is heap_page_prune(). Do you think that we need to worry about FPWs as well? Attached is an updated patch, which also forbids the run of any replication commands when the stopping state is reached. -- Michael From c8a44b6f84926712b7b6f2b36b4f13b0d1b41977 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Fri, 5 May 2017 14:10:16 +0900 Subject: [PATCH] Prevent panic during shutdown checkpoint When the checkpointer writes the shutdown checkpoint, it checks afterwards whether any WAL has been written since it started and throws a PANIC if so. At that point, only walsenders are still active, so one might think this could not happen, but WAL senders can generate WAL in the context of certain replication commands that can be run during the shutdown checkpoint: - certain variants of CREATE_REPLICATION_SLOT. - BASE_BACKUP and the backend end WAL record. - START_REPLICATION and logical decoding, able to do HOT-pruning. To fix this, divide the walsender shutdown into two phases. First, the postmaster sends a SIGUSR2 signal to all walsenders. The walsenders then put themselves into the "stopping" state. In this state, they reject any commands that might generate WAL. The checkpointer waits for all walsenders to reach this state before proceeding with the shutdown checkpoint. After the shutdown checkpoint is done, the postmaster sends SIGINT (previously unused) to the walsenders. This triggers the existing shutdown behavior of sending out the shutdown checkpointer and then terminating. Author: Michael Paquier Reported-by: Fujii Masao --- doc/src/sgml/monitoring.sgml| 5 + src/backend/access/heap/pruneheap.c | 9 ++ src/backend/access/transam/xlog.c | 6 ++ src/backend/postmaster/postmaster.c | 7 +- src/backend/replication/walsender.c | 143 src/include/replication/walsender.h | 1 + src/include/replication/walsender_private.h | 3 +- 7 files changed, 150 insertions(+), 24 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 2a83671b53..80d12b26d7 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -1690,6 +1690,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i backup: This WAL sender is sending a backup. + + + stopping: This WAL sender is stopping. + + diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index d69a266c36..d510649b18 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -22,6 +22,7 @@ #include "catalog/catalog.h" #include "miscadmin.h" #include "pgstat.h" +#include "replication/walsender.h" #include "storage/bufmgr.h" #include "utils/snapmgr.h" #include "utils/rel.h" @@ -189,6 +190,14 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, PruneState prstate; /* + * Do nothing in the presence of a WAL sender. This code path can be + * taken when doing logical decoding, and it is better to avoid WAL + * generation as this interferes with shutdown checkpoints. + */ + if (am_walsender) + return ndeleted; + + /* * Our strategy is to scan the page and make lists of items to change, * then apply the changes within a critical section. This keeps as much * logic as possible out of the critical section, and also ensures that diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index a89d99838a..5d6f8b75b8 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8325,6 +8325,12 @@ ShutdownXLOG(int code, Datum arg) ereport(IsPostmasterEnvironment ? LOG : NOTICE, (errmsg("shutting d
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 5/2/17 10:08, Michael Paquier wrote: > On Tue, May 2, 2017 at 9:30 PM, Peter Eisentraut > wrote: >> On 5/2/17 03:11, Petr Jelinek wrote: >>> logical decoding can theoretically >>> do HOT pruning (even if the chance is really small) so it's not safe to >>> start logical replication either. >> >> This seems a bit impossible to resolve. On the one hand, we want to >> allow streaming until after the shutdown checkpoint. On the other hand, >> streaming itself might produce new WAL. > > It would be nice to split things into two: > - patch 1 adding the signal handling that wins a backpatch. > - patch 2 fixing the side cases with logical decoding. The side cases with logical decoding are also not new and would need backpatching, AIUI. >> Can we prevent HOT pruning during logical decoding? > > It does not sound much difficult to do, couldn't you just make it a > no-op with am_walsender? That's my hope. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Tue, May 2, 2017 at 9:30 PM, Peter Eisentraut wrote: > On 5/2/17 03:11, Petr Jelinek wrote: >> logical decoding can theoretically >> do HOT pruning (even if the chance is really small) so it's not safe to >> start logical replication either. > > This seems a bit impossible to resolve. On the one hand, we want to > allow streaming until after the shutdown checkpoint. On the other hand, > streaming itself might produce new WAL. It would be nice to split things into two: - patch 1 adding the signal handling that wins a backpatch. - patch 2 fixing the side cases with logical decoding. > Can we prevent HOT pruning during logical decoding? It does not sound much difficult to do, couldn't you just make it a no-op with am_walsender? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Tue, May 2, 2017 at 9:27 PM, Peter Eisentraut wrote: > On 5/2/17 03:43, Michael Paquier wrote: >>> I don't think the code covers all because a) the SQL queries are not >>> covered at all that I can see and b) logical decoding can theoretically >>> do HOT pruning (even if the chance is really small) so it's not safe to >>> start logical replication either. >> >> Ahhh. So START_REPLICATION can also now generate WAL. Good to know. > > And just looking at pg_current_wal_location(), running BASE_BACKUP also > advances the WAL. Indeed. I forgot the backup end record and the segment switch. We are good for a backpatch down to 9.2 here. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 5/2/17 03:11, Petr Jelinek wrote: > logical decoding can theoretically > do HOT pruning (even if the chance is really small) so it's not safe to > start logical replication either. This seems a bit impossible to resolve. On the one hand, we want to allow streaming until after the shutdown checkpoint. On the other hand, streaming itself might produce new WAL. Can we prevent HOT pruning during logical decoding? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 5/2/17 03:43, Michael Paquier wrote: >> I don't think the code covers all because a) the SQL queries are not >> covered at all that I can see and b) logical decoding can theoretically >> do HOT pruning (even if the chance is really small) so it's not safe to >> start logical replication either. > > Ahhh. So START_REPLICATION can also now generate WAL. Good to know. And just looking at pg_current_wal_location(), running BASE_BACKUP also advances the WAL. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Tue, May 2, 2017 at 4:11 PM, Petr Jelinek wrote: > On 02/05/17 05:35, Michael Paquier wrote: >> On Tue, May 2, 2017 at 7:07 AM, Peter Eisentraut >> wrote: >>> On 4/25/17 21:47, Michael Paquier wrote: Attached is an updated patch to reflect that. >>> >>> I edited this a bit, here is a new version. >> >> Thanks, looks fine for me. >> >>> A variant approach would be to prohibit *all* new commands after >>> entering the "stopping" state, just let running commands run. That way >>> we don't have to pick which individual commands are at risk. I'm not >>> sure that we have covered everything here. >> >> It seems to me that everything is covered. We are taking about >> creation and dropping of slots here, where standby snapshots can be >> created and SQL queries can be run when doing a tablesync meaning that >> FPWs could be taken in the context of the WAL sender. Blocking all >> commands would be surely safer I agree, but I see no reason to block >> things more than necessary. >> > > I don't think the code covers all because a) the SQL queries are not > covered at all that I can see and b) logical decoding can theoretically > do HOT pruning (even if the chance is really small) so it's not safe to > start logical replication either. Ahhh. So START_REPLICATION can also now generate WAL. Good to know. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 02/05/17 05:35, Michael Paquier wrote: > On Tue, May 2, 2017 at 7:07 AM, Peter Eisentraut > wrote: >> On 4/25/17 21:47, Michael Paquier wrote: >>> Attached is an updated patch to reflect that. >> >> I edited this a bit, here is a new version. > > Thanks, looks fine for me. > >> A variant approach would be to prohibit *all* new commands after >> entering the "stopping" state, just let running commands run. That way >> we don't have to pick which individual commands are at risk. I'm not >> sure that we have covered everything here. > > It seems to me that everything is covered. We are taking about > creation and dropping of slots here, where standby snapshots can be > created and SQL queries can be run when doing a tablesync meaning that > FPWs could be taken in the context of the WAL sender. Blocking all > commands would be surely safer I agree, but I see no reason to block > things more than necessary. > I don't think the code covers all because a) the SQL queries are not covered at all that I can see and b) logical decoding can theoretically do HOT pruning (even if the chance is really small) so it's not safe to start logical replication either. I wonder if this whole prevent thing should just be called unconditionally on walsender that's connected to database (am_db_walsender), because in the WAL logging will only happen there - CREATE_REPLICATION_SLOT PHYSICAL will not WAL log and CREATE_REPLICATION_SLOT LOGICAL can't be run without being connected to db, neither can logical decoding and SQL queries, so that coves all. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Tue, May 2, 2017 at 7:07 AM, Peter Eisentraut wrote: > On 4/25/17 21:47, Michael Paquier wrote: >> Attached is an updated patch to reflect that. > > I edited this a bit, here is a new version. Thanks, looks fine for me. > A variant approach would be to prohibit *all* new commands after > entering the "stopping" state, just let running commands run. That way > we don't have to pick which individual commands are at risk. I'm not > sure that we have covered everything here. It seems to me that everything is covered. We are taking about creation and dropping of slots here, where standby snapshots can be created and SQL queries can be run when doing a tablesync meaning that FPWs could be taken in the context of the WAL sender. Blocking all commands would be surely safer I agree, but I see no reason to block things more than necessary. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 4/25/17 21:47, Michael Paquier wrote: > Attached is an updated patch to reflect that. I edited this a bit, here is a new version. A variant approach would be to prohibit *all* new commands after entering the "stopping" state, just let running commands run. That way we don't have to pick which individual commands are at risk. I'm not sure that we have covered everything here. More reviews please. Also, this is a possible backpatching candidate. Also, if someone has a test script for reproducing the original issue, please share it, or run it against this patch. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services v3-0001-Prevent-panic-during-shutdown-checkpoint.patch Description: invalid/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Wed, Apr 26, 2017 at 3:17 AM, Peter Eisentraut wrote: > On 4/21/17 00:11, Michael Paquier wrote: >> Hmm. I have been actually looking at this solution and I am having >> doubts regarding its robustness. In short this would need to be >> roughly a two-step process: >> - In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to >> make it call ShutdownXLOG(). Prior doing that, a first signal should >> be sent to all the WAL senders with >> SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be >> used. >> - At reception of this signal, all WAL senders switch to a stopping >> state, refusing commands that can generate WAL. >> - Checkpointer looks at the state of all WAL senders, looping with a >> sleep call of a couple of ms, refusing to launch the shutdown >> checkpoint as long as all WAL senders have not switched to the >> stopping state. >> - In reaper(), once checkpointer is confirmed as stopped, signal again >> the WAL senders, and tell them to perform the last loop. > > Yeah that looks like a reasonable approach. > > I'm not sure why in your patch you process got_SIGUSR2 in > WalSndErrorCleanup() instead of in the main loop. Yes I was hesitating about this one when hacking it. Thinking an extra time, the similar check in StartReplication() should also not use got_SIGUSR2 to give the WAL sender a chance to do more work while the shutdown checkpoint is running as it could take minutes. Attached is an updated patch to reflect that. -- Michael walsender-chkpt-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 4/21/17 00:11, Michael Paquier wrote: > Hmm. I have been actually looking at this solution and I am having > doubts regarding its robustness. In short this would need to be > roughly a two-step process: > - In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to > make it call ShutdownXLOG(). Prior doing that, a first signal should > be sent to all the WAL senders with > SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be > used. > - At reception of this signal, all WAL senders switch to a stopping > state, refusing commands that can generate WAL. > - Checkpointer looks at the state of all WAL senders, looping with a > sleep call of a couple of ms, refusing to launch the shutdown > checkpoint as long as all WAL senders have not switched to the > stopping state. > - In reaper(), once checkpointer is confirmed as stopped, signal again > the WAL senders, and tell them to perform the last loop. Yeah that looks like a reasonable approach. I'm not sure why in your patch you process got_SIGUSR2 in WalSndErrorCleanup() instead of in the main loop. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Sun, Apr 23, 2017 at 10:15 AM, Petr Jelinek wrote: > On 21/04/17 06:11, Michael Paquier wrote: >> On Fri, Apr 21, 2017 at 12:29 AM, Peter Eisentraut >> wrote: >> Hmm. I have been actually looking at this solution and I am having >> doubts regarding its robustness. In short this would need to be >> roughly a two-step process: >> - In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to >> make it call ShutdownXLOG(). Prior doing that, a first signal should >> be sent to all the WAL senders with >> SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be >> used. >> - At reception of this signal, all WAL senders switch to a stopping >> state, refusing commands that can generate WAL. >> - Checkpointer looks at the state of all WAL senders, looping with a >> sleep call of a couple of ms, refusing to launch the shutdown >> checkpoint as long as all WAL senders have not switched to the >> stopping state. >> - In reaper(), once checkpointer is confirmed as stopped, signal again >> the WAL senders, and tell them to perform the last loop. OK, I have been hacking that, finishing with the attached. In the attached I am using SIGUSR2 to instruct the WAL senders to prepare for stopping, and SIGINT to handle the last WAL flush loop. The shutdown checkpoint moves on only if all active WAL senders are marked with a STOPPING state. Reviews as welcome. >> After that, I got a second, more simple idea. >> CheckpointerShmem->ckpt_flags holds the information about checkpoints >> currently running, so we could have the WAL senders look at this data >> and prevent any commands generating WAL. The checkpointer may be >> already stopped at the moment the WAL senders finish their loop, so we >> need also to check if the checkpointer is running or not on those code >> paths. Such safeguards may actually be enough for the problem of this >> thread. Thoughts? >> > > Hmm but how do we handle statements that are already in progress by the > time ckpt_flags changes? Yup, this does not handle well race conditions. -- Michael walsender-chkpt-v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 21/04/17 06:11, Michael Paquier wrote: > On Fri, Apr 21, 2017 at 12:29 AM, Peter Eisentraut > wrote: >> On 4/20/17 07:52, Petr Jelinek wrote: >>> On 20/04/17 05:57, Michael Paquier wrote: 2nd thoughts here... Ah now I see your point. True that there is no way to ensure that an unwanted command is not running when SIGUSR2 is received as the shutdown checkpoint may have already begun. Here is an idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and the shutdown checkpoint does not run as long as all WAL senders still running do not reach such a state. >>> >>> +1 to this solution >> >> Michael, can you attempt to supply a patch? > > Hmm. I have been actually looking at this solution and I am having > doubts regarding its robustness. In short this would need to be > roughly a two-step process: > - In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to > make it call ShutdownXLOG(). Prior doing that, a first signal should > be sent to all the WAL senders with > SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be > used. > - At reception of this signal, all WAL senders switch to a stopping > state, refusing commands that can generate WAL. > - Checkpointer looks at the state of all WAL senders, looping with a > sleep call of a couple of ms, refusing to launch the shutdown > checkpoint as long as all WAL senders have not switched to the > stopping state. > - In reaper(), once checkpointer is confirmed as stopped, signal again > the WAL senders, and tell them to perform the last loop. > > After that, I got a second, more simple idea. > CheckpointerShmem->ckpt_flags holds the information about checkpoints > currently running, so we could have the WAL senders look at this data > and prevent any commands generating WAL. The checkpointer may be > already stopped at the moment the WAL senders finish their loop, so we > need also to check if the checkpointer is running or not on those code > paths. Such safeguards may actually be enough for the problem of this > thread. Thoughts? > Hmm but how do we handle statements that are already in progress by the time ckpt_flags changes? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Fri, Apr 21, 2017 at 12:29 AM, Peter Eisentraut wrote: > On 4/20/17 07:52, Petr Jelinek wrote: >> On 20/04/17 05:57, Michael Paquier wrote: >>> 2nd thoughts here... Ah now I see your point. True that there is no >>> way to ensure that an unwanted command is not running when SIGUSR2 is >>> received as the shutdown checkpoint may have already begun. Here is an >>> idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and >>> the shutdown checkpoint does not run as long as all WAL senders still >>> running do not reach such a state. >> >> +1 to this solution > > Michael, can you attempt to supply a patch? Hmm. I have been actually looking at this solution and I am having doubts regarding its robustness. In short this would need to be roughly a two-step process: - In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to make it call ShutdownXLOG(). Prior doing that, a first signal should be sent to all the WAL senders with SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be used. - At reception of this signal, all WAL senders switch to a stopping state, refusing commands that can generate WAL. - Checkpointer looks at the state of all WAL senders, looping with a sleep call of a couple of ms, refusing to launch the shutdown checkpoint as long as all WAL senders have not switched to the stopping state. - In reaper(), once checkpointer is confirmed as stopped, signal again the WAL senders, and tell them to perform the last loop. After that, I got a second, more simple idea. CheckpointerShmem->ckpt_flags holds the information about checkpoints currently running, so we could have the WAL senders look at this data and prevent any commands generating WAL. The checkpointer may be already stopped at the moment the WAL senders finish their loop, so we need also to check if the checkpointer is running or not on those code paths. Such safeguards may actually be enough for the problem of this thread. Thoughts? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 4/20/17 07:52, Petr Jelinek wrote: > On 20/04/17 05:57, Michael Paquier wrote: >> 2nd thoughts here... Ah now I see your point. True that there is no >> way to ensure that an unwanted command is not running when SIGUSR2 is >> received as the shutdown checkpoint may have already begun. Here is an >> idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and >> the shutdown checkpoint does not run as long as all WAL senders still >> running do not reach such a state. > > +1 to this solution Michael, can you attempt to supply a patch? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 20/04/17 05:57, Michael Paquier wrote: > On Thu, Apr 20, 2017 at 12:40 PM, Michael Paquier > wrote: >> On Thu, Apr 20, 2017 at 4:57 AM, Peter Eisentraut >> wrote: >>> I think the problem with a signal-based solution is that there is no >>> feedback. Ideally, you would wait for all walsenders to acknowledge the >>> receipt of SIGUSR2 (or similar) and only then proceed with the shutdown >>> checkpoint. >> >> Are you sure that it is necessary to go to such extent? Why wouldn't >> it be enough to prevent any replication commands generating WAL to run >> when the WAL sender knows that the postmaster is in shutdown mode? > > 2nd thoughts here... Ah now I see your point. True that there is no > way to ensure that an unwanted command is not running when SIGUSR2 is > received as the shutdown checkpoint may have already begun. Here is an > idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and > the shutdown checkpoint does not run as long as all WAL senders still > running do not reach such a state. > +1 to this solution -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Thu, Apr 20, 2017 at 12:40 PM, Michael Paquier wrote: > On Thu, Apr 20, 2017 at 4:57 AM, Peter Eisentraut > wrote: >> I think the problem with a signal-based solution is that there is no >> feedback. Ideally, you would wait for all walsenders to acknowledge the >> receipt of SIGUSR2 (or similar) and only then proceed with the shutdown >> checkpoint. > > Are you sure that it is necessary to go to such extent? Why wouldn't > it be enough to prevent any replication commands generating WAL to run > when the WAL sender knows that the postmaster is in shutdown mode? 2nd thoughts here... Ah now I see your point. True that there is no way to ensure that an unwanted command is not running when SIGUSR2 is received as the shutdown checkpoint may have already begun. Here is an idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and the shutdown checkpoint does not run as long as all WAL senders still running do not reach such a state. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Thu, Apr 20, 2017 at 4:57 AM, Peter Eisentraut wrote: > On 4/19/17 01:45, Michael Paquier wrote: >> On Tue, Apr 18, 2017 at 3:27 AM, Peter Eisentraut >> wrote: >>> I'd imagine the postmaster would tell the walsender that it has started >>> shutdown, and then the walsender would reject $certain_things. But I >>> don't see an existing way for the walsender to know that shutdown has >>> been initiated. SIGINT is still free ... >> >> The WAL sender receives SIGUSR2 from the postmaster when shutdown is >> initiated, so why not just rely on that and issue an ERROR when a >> client attempts to create or drop a new slot, setting up >> walsender_ready_to_stop unconditionally? It seems to me that the issue >> here is the delay between the moment SIGTERM is acknowledged by the >> WAL sender and the moment CREATE_SLOT is treated. An idea with the >> attached... > > I think the problem with a signal-based solution is that there is no > feedback. Ideally, you would wait for all walsenders to acknowledge the > receipt of SIGUSR2 (or similar) and only then proceed with the shutdown > checkpoint. Are you sure that it is necessary to go to such extent? Why wouldn't it be enough to prevent any replication commands generating WAL to run when the WAL sender knows that the postmaster is in shutdown mode? -- Michael VMware vCenter Server www.vmware.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 4/19/17 01:45, Michael Paquier wrote: > On Tue, Apr 18, 2017 at 3:27 AM, Peter Eisentraut > wrote: >> I'd imagine the postmaster would tell the walsender that it has started >> shutdown, and then the walsender would reject $certain_things. But I >> don't see an existing way for the walsender to know that shutdown has >> been initiated. SIGINT is still free ... > > The WAL sender receives SIGUSR2 from the postmaster when shutdown is > initiated, so why not just rely on that and issue an ERROR when a > client attempts to create or drop a new slot, setting up > walsender_ready_to_stop unconditionally? It seems to me that the issue > here is the delay between the moment SIGTERM is acknowledged by the > WAL sender and the moment CREATE_SLOT is treater. An idea with the > attached... I think the problem with a signal-based solution is that there is no feedback. Ideally, you would wait for all walsenders to acknowledge the receipt of SIGUSR2 (or similar) and only then proceed with the shutdown checkpoint. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Tue, Apr 18, 2017 at 3:27 AM, Peter Eisentraut wrote: > I'd imagine the postmaster would tell the walsender that it has started > shutdown, and then the walsender would reject $certain_things. But I > don't see an existing way for the walsender to know that shutdown has > been initiated. SIGINT is still free ... The WAL sender receives SIGUSR2 from the postmaster when shutdown is initiated, so why not just rely on that and issue an ERROR when a client attempts to create or drop a new slot, setting up walsender_ready_to_stop unconditionally? It seems to me that the issue here is the delay between the moment SIGTERM is acknowledged by the WAL sender and the moment CREATE_SLOT is treater. An idea with the attached... > The alternative of shutting down logical walsenders earlier also doesn't > look straightforward, since the postmaster doesn't know directly what > kind of walsender a certain process is. So you'd also need additional > signal types or something there. Yup, but is a switchover between a publisher and a subscriber something that can happen? -- Michael walsender-shutdown-fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 4/17/17 12:30, Andres Freund wrote: So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot() and which generates WAL record about snapshot of running transactions. >>> >>> Erroring out in these cases sounds easy enough. Wonder if there's not a >>> bigger problem with WAL records generated e.g. by HOT pruning or such, >>> during decoding. Not super likely, but would probably hit exactly the >>> same, no? >> >> Sounds possible, yes. Sounds like that's going to be nontrivial to fix >> though. >> >> Another problem is that queries can run on walsender now. But that >> should be possible to detect and shutdown just like backend. > > This sounds like a case for s/PANIC/ERROR|FATAL/ to me... I'd imagine the postmaster would tell the walsender that it has started shutdown, and then the walsender would reject $certain_things. But I don't see an existing way for the walsender to know that shutdown has been initiated. SIGINT is still free ... The alternative of shutting down logical walsenders earlier also doesn't look straightforward, since the postmaster doesn't know directly what kind of walsender a certain process is. So you'd also need additional signal types or something there. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 2017-04-17 18:28:16 +0200, Petr Jelinek wrote: > On 17/04/17 18:02, Andres Freund wrote: > > On 2017-04-15 02:33:59 +0900, Fujii Masao wrote: > >> On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek > >> wrote: > >>> On 12/04/17 15:55, Fujii Masao wrote: > Hi, > > When I shut down the publisher while I repeated creating and dropping > the subscription in the subscriber, the publisher emitted the following > PANIC error during shutdown checkpoint. > > PANIC: concurrent transaction log activity while database system is > shutting down > > The cause of this problem is that walsender for logical replication can > generate WAL records even during shutdown checkpoint. > > Firstly walsender keeps running until shutdown checkpoint finishes > so that all the WAL including shutdown checkpoint record can be > replicated to the standby. This was safe because previously walsender > could not generate WAL records. However this assumption became > invalid because of logical replication. That is, currenty walsender for > logical replication can generate WAL records, for example, by executing > CREATE_REPLICATION_SLOT command. This is an oversight in > logical replication patch, I think. > >>> > >>> Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree > >>> that the issue with walsender still exist (since we now allow normal SQL > >>> to run there) but I think it's important to identify what exactly causes > >>> the WAL activity in your case > >> > >> At least in my case, the following CREATE_REPLICATION_SLOT command > >> generated WAL record. > >> > >> BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ; > >> CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput > >> USE_SNAPSHOT; > >> > >> Here is the pg_waldump output of the WAL record that > >> CREATE_REPLICATION_SLOT > >> generated. > >> > >> rmgr: Standby len (rec/tot): 24/50, tx: 0, > >> lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692 > >> latestCompletedXid 691 oldestRunningXid 692 > >> > >> So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot() > >> and which generates WAL record about snapshot of running transactions. > > > > Erroring out in these cases sounds easy enough. Wonder if there's not a > > bigger problem with WAL records generated e.g. by HOT pruning or such, > > during decoding. Not super likely, but would probably hit exactly the > > same, no? > > > > Sounds possible, yes. Sounds like that's going to be nontrivial to fix > though. > > Another problem is that queries can run on walsender now. But that > should be possible to detect and shutdown just like backend. This sounds like a case for s/PANIC/ERROR|FATAL/ to me... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 17/04/17 18:02, Andres Freund wrote: > On 2017-04-15 02:33:59 +0900, Fujii Masao wrote: >> On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek >> wrote: >>> On 12/04/17 15:55, Fujii Masao wrote: Hi, When I shut down the publisher while I repeated creating and dropping the subscription in the subscriber, the publisher emitted the following PANIC error during shutdown checkpoint. PANIC: concurrent transaction log activity while database system is shutting down The cause of this problem is that walsender for logical replication can generate WAL records even during shutdown checkpoint. Firstly walsender keeps running until shutdown checkpoint finishes so that all the WAL including shutdown checkpoint record can be replicated to the standby. This was safe because previously walsender could not generate WAL records. However this assumption became invalid because of logical replication. That is, currenty walsender for logical replication can generate WAL records, for example, by executing CREATE_REPLICATION_SLOT command. This is an oversight in logical replication patch, I think. >>> >>> Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree >>> that the issue with walsender still exist (since we now allow normal SQL >>> to run there) but I think it's important to identify what exactly causes >>> the WAL activity in your case >> >> At least in my case, the following CREATE_REPLICATION_SLOT command >> generated WAL record. >> >> BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ; >> CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput USE_SNAPSHOT; >> >> Here is the pg_waldump output of the WAL record that CREATE_REPLICATION_SLOT >> generated. >> >> rmgr: Standby len (rec/tot): 24/50, tx: 0, >> lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692 >> latestCompletedXid 691 oldestRunningXid 692 >> >> So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot() >> and which generates WAL record about snapshot of running transactions. > > Erroring out in these cases sounds easy enough. Wonder if there's not a > bigger problem with WAL records generated e.g. by HOT pruning or such, > during decoding. Not super likely, but would probably hit exactly the > same, no? > Sounds possible, yes. Sounds like that's going to be nontrivial to fix though. Another problem is that queries can run on walsender now. But that should be possible to detect and shutdown just like backend. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 2017-04-15 02:33:59 +0900, Fujii Masao wrote: > On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek > wrote: > > On 12/04/17 15:55, Fujii Masao wrote: > >> Hi, > >> > >> When I shut down the publisher while I repeated creating and dropping > >> the subscription in the subscriber, the publisher emitted the following > >> PANIC error during shutdown checkpoint. > >> > >> PANIC: concurrent transaction log activity while database system is > >> shutting down > >> > >> The cause of this problem is that walsender for logical replication can > >> generate WAL records even during shutdown checkpoint. > >> > >> Firstly walsender keeps running until shutdown checkpoint finishes > >> so that all the WAL including shutdown checkpoint record can be > >> replicated to the standby. This was safe because previously walsender > >> could not generate WAL records. However this assumption became > >> invalid because of logical replication. That is, currenty walsender for > >> logical replication can generate WAL records, for example, by executing > >> CREATE_REPLICATION_SLOT command. This is an oversight in > >> logical replication patch, I think. > > > > Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree > > that the issue with walsender still exist (since we now allow normal SQL > > to run there) but I think it's important to identify what exactly causes > > the WAL activity in your case > > At least in my case, the following CREATE_REPLICATION_SLOT command > generated WAL record. > > BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ; > CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput USE_SNAPSHOT; > > Here is the pg_waldump output of the WAL record that CREATE_REPLICATION_SLOT > generated. > > rmgr: Standby len (rec/tot): 24/50, tx: 0, > lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692 > latestCompletedXid 691 oldestRunningXid 692 > > So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot() > and which generates WAL record about snapshot of running transactions. Erroring out in these cases sounds easy enough. Wonder if there's not a bigger problem with WAL records generated e.g. by HOT pruning or such, during decoding. Not super likely, but would probably hit exactly the same, no? - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 14/04/17 21:05, Peter Eisentraut wrote: > On 4/14/17 14:23, Petr Jelinek wrote: >> Ah yes looking at the code, it does exactly that (on master only). Means >> that backport will be necessary. > > I think these two sentences are contradicting each other. > Hehe, didn't realize master will be taken as master branch, I meant master as in not standby :) -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 4/14/17 14:23, Petr Jelinek wrote: > Ah yes looking at the code, it does exactly that (on master only). Means > that backport will be necessary. I think these two sentences are contradicting each other. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 14/04/17 19:33, Fujii Masao wrote: > On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek > wrote: >> On 12/04/17 15:55, Fujii Masao wrote: >>> Hi, >>> >>> When I shut down the publisher while I repeated creating and dropping >>> the subscription in the subscriber, the publisher emitted the following >>> PANIC error during shutdown checkpoint. >>> >>> PANIC: concurrent transaction log activity while database system is >>> shutting down >>> >>> The cause of this problem is that walsender for logical replication can >>> generate WAL records even during shutdown checkpoint. >>> >>> Firstly walsender keeps running until shutdown checkpoint finishes >>> so that all the WAL including shutdown checkpoint record can be >>> replicated to the standby. This was safe because previously walsender >>> could not generate WAL records. However this assumption became >>> invalid because of logical replication. That is, currenty walsender for >>> logical replication can generate WAL records, for example, by executing >>> CREATE_REPLICATION_SLOT command. This is an oversight in >>> logical replication patch, I think. >> >> Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree >> that the issue with walsender still exist (since we now allow normal SQL >> to run there) but I think it's important to identify what exactly causes >> the WAL activity in your case > > At least in my case, the following CREATE_REPLICATION_SLOT command > generated WAL record. > > BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ; > CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput USE_SNAPSHOT; > > Here is the pg_waldump output of the WAL record that CREATE_REPLICATION_SLOT > generated. > > rmgr: Standby len (rec/tot): 24/50, tx: 0, > lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692 > latestCompletedXid 691 oldestRunningXid 692 > > So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot() > and which generates WAL record about snapshot of running transactions. > Ah yes looking at the code, it does exactly that (on master only). Means that backport will be necessary. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Fri, Apr 14, 2017 at 10:33 PM, Petr Jelinek wrote: > On 12/04/17 15:55, Fujii Masao wrote: >> Hi, >> >> When I shut down the publisher while I repeated creating and dropping >> the subscription in the subscriber, the publisher emitted the following >> PANIC error during shutdown checkpoint. >> >> PANIC: concurrent transaction log activity while database system is >> shutting down >> >> The cause of this problem is that walsender for logical replication can >> generate WAL records even during shutdown checkpoint. >> >> Firstly walsender keeps running until shutdown checkpoint finishes >> so that all the WAL including shutdown checkpoint record can be >> replicated to the standby. This was safe because previously walsender >> could not generate WAL records. However this assumption became >> invalid because of logical replication. That is, currenty walsender for >> logical replication can generate WAL records, for example, by executing >> CREATE_REPLICATION_SLOT command. This is an oversight in >> logical replication patch, I think. > > Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree > that the issue with walsender still exist (since we now allow normal SQL > to run there) but I think it's important to identify what exactly causes > the WAL activity in your case At least in my case, the following CREATE_REPLICATION_SLOT command generated WAL record. BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ; CREATE_REPLICATION_SLOT testslot TEMPORARY LOGICAL pgoutput USE_SNAPSHOT; Here is the pg_waldump output of the WAL record that CREATE_REPLICATION_SLOT generated. rmgr: Standby len (rec/tot): 24/50, tx: 0, lsn: 0/01601438, prev 0/01601400, desc: RUNNING_XACTS nextXid 692 latestCompletedXid 691 oldestRunningXid 692 So I guess that CREATE_REPLICATION_SLOT code calls LogStandbySnapshot() and which generates WAL record about snapshot of running transactions. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 12/04/17 15:55, Fujii Masao wrote: > Hi, > > When I shut down the publisher while I repeated creating and dropping > the subscription in the subscriber, the publisher emitted the following > PANIC error during shutdown checkpoint. > > PANIC: concurrent transaction log activity while database system is > shutting down > > The cause of this problem is that walsender for logical replication can > generate WAL records even during shutdown checkpoint. > > Firstly walsender keeps running until shutdown checkpoint finishes > so that all the WAL including shutdown checkpoint record can be > replicated to the standby. This was safe because previously walsender > could not generate WAL records. However this assumption became > invalid because of logical replication. That is, currenty walsender for > logical replication can generate WAL records, for example, by executing > CREATE_REPLICATION_SLOT command. This is an oversight in > logical replication patch, I think. Hmm, but CREATE_REPLICATION_SLOT should not generate WAL afaik. I agree that the issue with walsender still exist (since we now allow normal SQL to run there) but I think it's important to identify what exactly causes the WAL activity in your case - if it's one of the replication commands and not SQL then we'll need to backpatch any solution we come up with to 9.4, if it's not replication command, we only need to fix 10. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Fri, Apr 14, 2017 at 3:03 AM, Fujii Masao wrote: > On Thu, Apr 13, 2017 at 12:36 PM, Michael Paquier > wrote: >> On Thu, Apr 13, 2017 at 12:28 PM, Fujii Masao wrote: >>> On Thu, Apr 13, 2017 at 5:25 AM, Peter Eisentraut >>> wrote: On 4/12/17 09:55, Fujii Masao wrote: > To fix this issue, we should terminate walsender for logical replication > before shutdown checkpoint starts. Of course walsender for physical > replication still needs to keep running until shutdown checkpoint ends, > though. Can we turn it into a kind of read-only or no-new-commands mode instead, so it can keep streaming but not accept any new actions? >>> >>> So we make walsenders switch to that mode and wait for all the >>> already-ongoing >>> their "write" commands to finish, and then we start a shutdown checkpoint? >>> This is an idea, but seems a bit complicated. ISTM that it's simpler to >>> terminate only walsenders for logical rep before shutdown checkpoint. >> >> Perhaps my memory is failing me here... But in clean shutdowns we do >> shut down WAL senders after the checkpoint has completed so as we are >> sure that they have flushed the LSN corresponding to the checkpoint >> ending, right? > > Yes. > >> Why introducing an inconsistency for logical workers? >> It seems to me that logical workers should fail under the same rules. > > Could you tell me why? You think that even walsender for logical rep > needs to stream the shutdown checkpoint WAL record to the subscriber? > I was not thinking that's true. For physical replication, the property to wait that standbys have flushed the LSN of the shutdown checkpoint can be important for switchovers. For example, with a primary and a standby, it is possible to stop cleanly the master, promote the standby, and then connect back to the cluster the old primary as a standby to the now-new primary with the guarantee that both are in a consistent state. It seems to me that having similar guarantees for logical replication is important. Now, I have not reviewed the code of logirep in details at the level of Peter, Petr or yourself... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Thu, Apr 13, 2017 at 12:36 PM, Michael Paquier wrote: > On Thu, Apr 13, 2017 at 12:28 PM, Fujii Masao wrote: >> On Thu, Apr 13, 2017 at 5:25 AM, Peter Eisentraut >> wrote: >>> On 4/12/17 09:55, Fujii Masao wrote: To fix this issue, we should terminate walsender for logical replication before shutdown checkpoint starts. Of course walsender for physical replication still needs to keep running until shutdown checkpoint ends, though. >>> >>> Can we turn it into a kind of read-only or no-new-commands mode instead, >>> so it can keep streaming but not accept any new actions? >> >> So we make walsenders switch to that mode and wait for all the >> already-ongoing >> their "write" commands to finish, and then we start a shutdown checkpoint? >> This is an idea, but seems a bit complicated. ISTM that it's simpler to >> terminate only walsenders for logical rep before shutdown checkpoint. > > Perhaps my memory is failing me here... But in clean shutdowns we do > shut down WAL senders after the checkpoint has completed so as we are > sure that they have flushed the LSN corresponding to the checkpoint > ending, right? Yes. > Why introducing an inconsistency for logical workers? > It seems to me that logical workers should fail under the same rules. Could you tell me why? You think that even walsender for logical rep needs to stream the shutdown checkpoint WAL record to the subscriber? I was not thinking that's true. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Thu, Apr 13, 2017 at 12:28 PM, Fujii Masao wrote: > On Thu, Apr 13, 2017 at 5:25 AM, Peter Eisentraut > wrote: >> On 4/12/17 09:55, Fujii Masao wrote: >>> To fix this issue, we should terminate walsender for logical replication >>> before shutdown checkpoint starts. Of course walsender for physical >>> replication still needs to keep running until shutdown checkpoint ends, >>> though. >> >> Can we turn it into a kind of read-only or no-new-commands mode instead, >> so it can keep streaming but not accept any new actions? > > So we make walsenders switch to that mode and wait for all the already-ongoing > their "write" commands to finish, and then we start a shutdown checkpoint? > This is an idea, but seems a bit complicated. ISTM that it's simpler to > terminate only walsenders for logical rep before shutdown checkpoint. Perhaps my memory is failing me here... But in clean shutdowns we do shut down WAL senders after the checkpoint has completed so as we are sure that they have flushed the LSN corresponding to the checkpoint ending, right? Why introducing an inconsistency for logical workers? It seems to me that logical workers should fail under the same rules. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On Thu, Apr 13, 2017 at 5:25 AM, Peter Eisentraut wrote: > On 4/12/17 09:55, Fujii Masao wrote: >> To fix this issue, we should terminate walsender for logical replication >> before shutdown checkpoint starts. Of course walsender for physical >> replication still needs to keep running until shutdown checkpoint ends, >> though. > > Can we turn it into a kind of read-only or no-new-commands mode instead, > so it can keep streaming but not accept any new actions? So we make walsenders switch to that mode and wait for all the already-ongoing their "write" commands to finish, and then we start a shutdown checkpoint? This is an idea, but seems a bit complicated. ISTM that it's simpler to terminate only walsenders for logical rep before shutdown checkpoint. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher
On 4/12/17 09:55, Fujii Masao wrote: > To fix this issue, we should terminate walsender for logical replication > before shutdown checkpoint starts. Of course walsender for physical > replication still needs to keep running until shutdown checkpoint ends, > though. Can we turn it into a kind of read-only or no-new-commands mode instead, so it can keep streaming but not accept any new actions? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers