On Mon, May 9, 2022 at 2:50 PM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > > On Tue, Apr 26, 2022 at 11:57 AM Laurenz Albe <laurenz.a...@cybertec.at> > wrote: > > > > On Mon, 2022-04-25 at 19:51 +0530, Bharath Rupireddy wrote: > > > With synchronous replication typically all the transactions (txns) > > > first locally get committed, then streamed to the sync standbys and > > > the backend that generated the transaction will wait for ack from sync > > > standbys. While waiting for ack, it may happen that the query or the > > > txn gets canceled (QueryCancelPending is true) or the waiting backend > > > is asked to exit (ProcDiePending is true). In either of these cases, > > > the wait for ack gets canceled and leaves the txn in an inconsistent > > > state [...] > > > > > > Here's a proposal (mentioned previously by Satya [1]) to avoid the > > > above problems: > > > 1) Wait a configurable amount of time before canceling the sync > > > replication by the backends i.e. delay processing of > > > QueryCancelPending and ProcDiePending in Introduced a new timeout GUC > > > synchronous_replication_naptime_before_cancel, when set, it will let > > > the backends wait for the ack before canceling the synchronous > > > replication so that the transaction can be available in sync standbys > > > as well. > > > 2) Wait for sync standbys to catch up upon restart after the crash or > > > in the next txn after the old locally committed txn was canceled. > > > > While this may mitigate the problem, I don't think it will deal with > > all the cases which could cause a transaction to end up committed locally, > > but not on the synchronous standby. I think that only using the full > > power of two-phase commit can make this bulletproof. > > Not sure if it's recommended to use 2PC in postgres HA with sync > replication where the documentation says that "PREPARE TRANSACTION" > and other 2PC commands are "intended for use by external transaction > management systems" and with explicit transactions. Whereas, the txns > within a postgres HA with sync replication always don't have to be > explicit txns. Am I missing something here? > > > Is it worth adding additional complexity that is not a complete solution? > > The proposed approach helps to avoid some common possible problems > that arise with simple scenarios (like cancelling a long running query > while in SyncRepWaitForLSN) within sync replication.
IMHO, making it wait for some amount of time, based on GUC is not a complete solution. It is just a hack to avoid the problem in some cases. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com