Re: A WalSnd issue related to state WALSNDSTATE_STOPPING

2018-11-28 Thread Michael Paquier
On Tue, Nov 27, 2018 at 11:40:25AM +0900, Michael Paquier wrote:
> Thanks Paul for double-checking.  I'll come back to this patch in two
> days, look at it again and commit it if there are no objections from
> others.

Committed and back-patched down to 9.4.  Thanks Paul for the report!
--
Michael


signature.asc
Description: PGP signature


Re: A WalSnd issue related to state WALSNDSTATE_STOPPING

2018-11-26 Thread Michael Paquier
On Tue, Nov 27, 2018 at 10:07:04AM +0800, Paul Guo wrote:
> Agree. Your code change is safer.

Thanks Paul for double-checking.  I'll come back to this patch in two
days, look at it again and commit it if there are no objections from
others.
--
Michael


signature.asc
Description: PGP signature


Re: A WalSnd issue related to state WALSNDSTATE_STOPPING

2018-11-26 Thread Paul Guo
>
> On Thu, Nov 22, 2018 at 06:31:04PM +0800, Paul Guo wrote:
> > Yes, please see the attached patch.
>
> Thanks, I have reviewed your patch, and could not resist to change
> SyncRepReleaseWaiters() on top of the rest with a comment to not be
> confused about the WAL sender states allowed to release waiters.
>
> The previous experience is proving that it seems unwise to rely on the
> order of the elements in WalSndState, so I have tweaked things so as the
> state is listed for all the code paths involved.
>
> Paul, what do you think?
> --
> Michael
>

Agree. Your code change is safer.


Re: A WalSnd issue related to state WALSNDSTATE_STOPPING

2018-11-26 Thread Michael Paquier
On Thu, Nov 22, 2018 at 06:31:04PM +0800, Paul Guo wrote:
> Yes, please see the attached patch.

Thanks, I have reviewed your patch, and could not resist to change
SyncRepReleaseWaiters() on top of the rest with a comment to not be
confused about the WAL sender states allowed to release waiters.

The previous experience is proving that it seems unwise to rely on the
order of the elements in WalSndState, so I have tweaked things so as the
state is listed for all the code paths involved.

Paul, what do you think?
--
Michael
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 9a13c50ce8..5b8a268fa1 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -425,10 +425,12 @@ SyncRepReleaseWaiters(void)
 	 * If this WALSender is serving a standby that is not on the list of
 	 * potential sync standbys then we have nothing to do. If we are still
 	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.
+	 * still invalid, then leave quickly also.  Streaming or stopping WAL
+	 * senders are allowed to release waiters.
 	 */
 	if (MyWalSnd->sync_standby_priority == 0 ||
-		MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		(MyWalSnd->state != WALSNDSTATE_STREAMING &&
+		 MyWalSnd->state != WALSNDSTATE_STOPPING) ||
 		XLogRecPtrIsInvalid(MyWalSnd->flush))
 	{
 		announce_next_takeover = true;
@@ -730,8 +732,9 @@ SyncRepGetSyncStandbysQuorum(bool *am_sync)
 		if (pid == 0)
 			continue;
 
-		/* Must be streaming */
-		if (state != WALSNDSTATE_STREAMING)
+		/* Must be streaming or stopping */
+		if (state != WALSNDSTATE_STREAMING &&
+			state != WALSNDSTATE_STOPPING)
 			continue;
 
 		/* Must be synchronous */
@@ -809,8 +812,9 @@ SyncRepGetSyncStandbysPriority(bool *am_sync)
 		if (pid == 0)
 			continue;
 
-		/* Must be streaming */
-		if (state != WALSNDSTATE_STREAMING)
+		/* Must be streaming or stopping */
+		if (state != WALSNDSTATE_STREAMING &&
+			state != WALSNDSTATE_STOPPING)
 			continue;
 
 		/* Must be synchronous */


signature.asc
Description: PGP signature


Re: A WalSnd issue related to state WALSNDSTATE_STOPPING

2018-11-22 Thread Paul Guo
On Thu, Nov 22, 2018 at 1:29 PM Michael Paquier  wrote:

> On Wed, Nov 21, 2018 at 04:09:41PM +0900, Michael Paquier wrote:
> > The checkpointer initializes a shutdown checkpoint where it tells to all
> > the WAL senders to stop once all the children processes are gone, so it
> > seems to me that there is little point in processing
> > SyncRepReleaseWaiters() when a WAL sender is in WALSNDSTATE_STOPPING
> > state as at this stage syncrep makes little sense.  It is still
> > necessary to process standby messages at this stage so as the primary
> > can shut down when it is sure that all the standbys have flushed the
> > shutdown checkpoint record of the primary.
>
> Just refreshed my memory with c6c33343, which is actually at the origin
> of the issue, and my previous argument is flawed.  If a WAL sender has
> reached WALSNDSTATE_STOPPING no regular backends are around but a WAL
> sender could always commit a transaction in parallel which may need to
> make sure that its record is flushed and sync'ed, and this needs to make
> sure that waiters are correctly released.  So it is necessary to patch
> up SyncRepGetSyncStandbysPriority and SyncRepGetSyncStandbysQuorum as
> mentioned upthread, perhaps adding a comment when looking at
> MyWalSnd->state looks adapted. Paul, would you like to write a patch?
> --
> Michael
>

Yes, please see the attached patch.


0001-Allow-stopping-wal-senders-to-be-invovled-in-SyncRep.patch
Description: Binary data


Re: A WalSnd issue related to state WALSNDSTATE_STOPPING

2018-11-21 Thread Michael Paquier
On Wed, Nov 21, 2018 at 04:09:41PM +0900, Michael Paquier wrote:
> The checkpointer initializes a shutdown checkpoint where it tells to all
> the WAL senders to stop once all the children processes are gone, so it
> seems to me that there is little point in processing
> SyncRepReleaseWaiters() when a WAL sender is in WALSNDSTATE_STOPPING 
> state as at this stage syncrep makes little sense.  It is still
> necessary to process standby messages at this stage so as the primary
> can shut down when it is sure that all the standbys have flushed the
> shutdown checkpoint record of the primary.

Just refreshed my memory with c6c33343, which is actually at the origin
of the issue, and my previous argument is flawed.  If a WAL sender has
reached WALSNDSTATE_STOPPING no regular backends are around but a WAL
sender could always commit a transaction in parallel which may need to
make sure that its record is flushed and sync'ed, and this needs to make
sure that waiters are correctly released.  So it is necessary to patch
up SyncRepGetSyncStandbysPriority and SyncRepGetSyncStandbysQuorum as
mentioned upthread, perhaps adding a comment when looking at
MyWalSnd->state looks adapted. Paul, would you like to write a patch?
--
Michael


signature.asc
Description: PGP signature


Re: A WalSnd issue related to state WALSNDSTATE_STOPPING

2018-11-20 Thread Michael Paquier
On Wed, Nov 21, 2018 at 12:54:30PM +0800, Paul Guo wrote:
> The panic reason is that since there is just one wal sender and
> WalSndCtl->walsnds[0].state is WALSNDSTATE_STOPPING so syncWalSnd will be
> NULL and that causes assert failure. Latest postgresql code removes the
> Assert code but the related similar code logic was kept. It looks like that
> we need to modify the logic similar like below (PG 9.4 STABLE) to allow
> WALSNDSTATE_STOPPING which is reasonable here If I understand correctly.

The checkpointer initializes a shutdown checkpoint where it tells to all
the WAL senders to stop once all the children processes are gone, so it
seems to me that there is little point in processing
SyncRepReleaseWaiters() when a WAL sender is in WALSNDSTATE_STOPPING 
state as at this stage syncrep makes little sense.  It is still
necessary to process standby messages at this stage so as the primary
can shut down when it is sure that all the standbys have flushed the
shutdown checkpoint record of the primary.
--
Michael


signature.asc
Description: PGP signature