Hi, With synchronous replication typically all the transactions (txns) first locally get committed, then streamed to the sync standbys and the backend that generated the transaction will wait for ack from sync standbys. While waiting for ack, it may happen that the query or the txn gets canceled (QueryCancelPending is true) or the waiting backend is asked to exit (ProcDiePending is true). In either of these cases, the wait for ack gets canceled and leaves the txn in an inconsistent state (as in the client thinks that the txn would have replicated to sync standbys) - "The transaction has already committed locally, but might not have been replicated to the standby.". Upon restart after the crash or in the next txn after the old locally committed txn was canceled, the client will be able to see the txns that weren't actually streamed to sync standbys. Also, if the client fails over to one of the sync standbys after the crash (either by choice or because of automatic failover management after crash), the locally committed txns on the crashed primary would be lost which isn't good in a true HA solution.
Here's a proposal (mentioned previously by Satya [1]) to avoid the above problems: 1) Wait a configurable amount of time before canceling the sync replication by the backends i.e. delay processing of QueryCancelPending and ProcDiePending in Introduced a new timeout GUC synchronous_replication_naptime_before_cancel, when set, it will let the backends wait for the ack before canceling the synchronous replication so that the transaction can be available in sync standbys as well. If the ack isn't received even within this time frame, the backend cancels the wait and goes ahead as it does today. In production HA environments, the GUC can be set to a reasonable value to avoid missing transactions during failovers. 2) Wait for sync standbys to catch up upon restart after the crash or in the next txn after the old locally committed txn was canceled. One way to achieve this is to let the backend, that's making the first connection, wait for sync standbys to catch up in ClientAuthentication right after successful authentication. However, I'm not sure this is the best way to do it at this point. Thoughts? Here's a WIP patch implementing the (1), I'm yet to code for (2). I haven't added tests, I'm yet to figure out how to add one as there's no way we can delay the WAL sender so that we can reliably hit this code. I will think more about this. [1] https://www.postgresql.org/message-id/CAHg%2BQDdTdPsqtu0QLG8rMg3Xo%3D6Xo23TwHPYsUgGNEK13wTY5g%40mail.gmail.com Regards, Bharath Rupireddy.
From d5fe07bbd80b72dfbf06e9b039b9e4a93a7f7a06 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Date: Sun, 24 Apr 2022 03:42:59 +0000 Subject: [PATCH v1] Wait specified amount of time before cancelling sync replication In PostgreSQL high availability setup with synchronous replication, typically all the transactions first locally get committed, then streamed to the synchronous standbys and the backends that generated the transaction will wait for acknowledgement from synchronous standbys. While waiting for acknowledgement, it may happen that the query or the transaction gets canceled or the backend that's waiting for acknowledgement is asked to exit. In either of these cases, the wait for acknowledgement gets canceled and leaves transaction in an inconsistent state as it got committed locally but not on the standbys. When set the GUC synchronous_replication_naptime_before_cancel introduced in this patch, it will let the backends wait for the acknowledgement before canceling the wait for acknowledgement so that the transaction can be available in synchronous standbys as well. --- doc/src/sgml/config.sgml | 30 +++++++++++ src/backend/replication/syncrep.c | 50 +++++++++++++++++++ src/backend/utils/misc/guc.c | 12 +++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/replication/syncrep.h | 3 ++ 5 files changed, 97 insertions(+) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 03986946a8..1681ea173f 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4507,6 +4507,36 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class=" </listitem> </varlistentry> + <varlistentry id="guc-synchronous-replication-naptime-before-cancel" xreflabel="synchronous_replication_naptime_before_cancel"> + <term><varname>synchronous_replication_naptime_before_cancel</varname> (<type>integer</type>) + <indexterm> + <primary><varname>synchronous_replication_naptime_before_cancel</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the amount of time in milliseconds to wait for synchronous + replication before cancelling. Default value is 0, a value of -1 or 0 + disables this feature. In <productname>PostgreSQL</productname> high + availability setup with synchronous replication, typically all the + transactions first locally get committed, then streamed to the + synchronous standbys and the backends that generated the transaction + will wait for acknowledgement from synchronous standbys. While waiting + for acknowledgement, it may happen that the query or the transaction + gets canceled or the backend that's waiting for acknowledgement is + asked to exit. In either of these cases, the wait for acknowledgement + gets canceled and leaves transaction in an inconsistent state as it got + committed locally but not on the standbys. When set the + <varname>synchronous_replication_naptime_before_cancel</varname> + parameter, it will let the backends wait for the acknowledgement + before canceling the wait for acknowledgement so that the transaction + can be available in synchronous standbys as well. This parameter can + only be set in the <filename>postgresql.conf</filename> file or on the + server command line. + </para> + </listitem> + </varlistentry> + </variablelist> </sect2> diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index ce163b99e9..0f54d81f2b 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -88,6 +88,7 @@ /* User-settable parameters for sync rep */ char *SyncRepStandbyNames; +int SyncRepNapTimeBeforeCancel = 0; #define SyncStandbysDefined() \ (SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0') @@ -119,6 +120,7 @@ static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr, static int SyncRepGetStandbyPriority(void); static int standby_priority_comparator(const void *a, const void *b); static int cmp_lsn(const void *a, const void *b); +static bool SyncRepNapBeforeCancel(void); #ifdef USE_ASSERT_CHECKING static bool SyncRepQueueIsOrderedByLSN(int mode); @@ -130,6 +132,42 @@ static bool SyncRepQueueIsOrderedByLSN(int mode); * =========================================================== */ +/* + * Wait for synchronous replication before cancelling, if requested by user. + */ +static bool +SyncRepNapBeforeCancel(void) +{ + int wait_time; + + if (SyncRepNapTimeBeforeCancel <= 0) + return false; + + ereport(WARNING, + (errmsg_plural("waiting %d millisecond for synchronous replication before cancelling", + "waiting %d milliseconds for synchronous replication before cancelling", + SyncRepNapTimeBeforeCancel, + SyncRepNapTimeBeforeCancel))); + + wait_time = SyncRepNapTimeBeforeCancel; + + while (wait_time > 0) + { + /* + * Wait in intervals of 1 millisecond so that we can frequently check + * for the acknowledgement. + */ + pg_usleep(1 * 1000L); + + wait_time--; + + if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE) + return true; + } + + return true; +} + /* * Wait for synchronous replication, if requested by user. * @@ -263,6 +301,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit) */ if (ProcDiePending) { + if (SyncRepNapBeforeCancel()) + { + if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE) + break; + } + ereport(WARNING, (errcode(ERRCODE_ADMIN_SHUTDOWN), errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"), @@ -280,6 +324,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit) */ if (QueryCancelPending) { + if (SyncRepNapBeforeCancel()) + { + if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE) + break; + } + QueryCancelPending = false; ereport(WARNING, (errmsg("canceling wait for synchronous replication due to user request"), diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 8e9b71375c..547bc2727f 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2743,6 +2743,18 @@ static struct config_int ConfigureNamesInt[] = 0, 0, 1000000, /* see ComputeXidHorizons */ NULL, NULL, NULL }, + + { + {"synchronous_replication_naptime_before_cancel", PGC_SIGHUP, REPLICATION_PRIMARY, + gettext_noop("Sets the amount of time to wait for synchronous replictaion before cancelling."), + gettext_noop("A value of -1 or 0 disables this feature."), + GUC_UNIT_MS + }, + &SyncRepNapTimeBeforeCancel, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + { {"vacuum_failsafe_age", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Age at which VACUUM should trigger failsafe to avoid a wraparound outage."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 94270eb0ec..4fd4d04804 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -324,6 +324,8 @@ # and comma-separated list of application_name # from standby(s); '*' = all #vacuum_defer_cleanup_age = 0 # number of xacts by which cleanup is delayed +#synchronous_replication_naptime_before_cancel = 0 # amount of time to wait for + # synchronous replictaion before cancelling; 0 or -1 disables # - Standby Servers - diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h index 4d7c90b9f0..6678f14b93 100644 --- a/src/include/replication/syncrep.h +++ b/src/include/replication/syncrep.h @@ -81,6 +81,9 @@ extern PGDLLIMPORT char *syncrep_parse_error_msg; /* user-settable parameters for synchronous replication */ extern PGDLLIMPORT char *SyncRepStandbyNames; +/* user-settable nap time for synchronous replictaion before cancelling */ +extern PGDLLIMPORT int SyncRepNapTimeBeforeCancel; + /* called by user backend */ extern void SyncRepWaitForLSN(XLogRecPtr lsn, bool commit); -- 2.25.1