On 1/10/23 10:17 AM, Amit Kapila wrote:
On Tue, Jan 10, 2023 at 8:13 AM Jonathan S. Katz <jk...@postgresql.org> wrote:

This consistently created the deadlock in my testing.

Discussing with Masahiko off-list, this is due to a deadlock from 4
processes: the walsenders on A and B, and the apply workers on A and B.
The walsenders are waiting for feedback from the apply workers, and the
apply workers are waiting for the walsenders to synchronize (I may be
oversimplifying).

He suggested I try the above example instead with `synchronous_commit`
set to `local`. In this case, I verified that there is no more deadlock,
but he informed me that we would not be able to use cascading
synchronous replication when "origin=none".


This has nothing to do with the origin feature. I mean this should
happen with origin=any or even in PG15 without using origin at all.
Am, I missing something? One related point to note is that in physical
replication cascading replication is asynchronous. See docs [1]
(Cascading replication is currently asynchronous....)

This is not directly related to the origin feature, but the origin feature makes it apparent. There is a common use-case where people want to replicate data synchronously between two tables, and this is enabled by the "origin" feature.

To be clear, my bigger concern is that it's fairly easy for users to create a deadlock situation based on how they set "synchronous_commit" with the origin feature -- this is the main reason why I brought it up. I'm less concerned about the "cascading" case, though I want to try out sync rep between 3 instances to see what happens.

If we decide that this is a documentation issue, I'd suggest we improve
the guidance around using `synchronous_commit`[1] on the CREATE
SUBSCRIPTION page, as the GUC page[2] warns against using `local`:


Yeah, but on Create Subscription page, we have mentioned that it is
safe to use off for logical replication.

While I think you can infer that it's "safe" for synchronous replication, I don't think it's clear.

We say it's "safe to use `off` for logical replication", but provide a lengthy explanation around synchronous logical replication that concludes that it's "advantageous to set synchronous_commit to local or higher" but does not address safety. The first line in the explanation of the parameter links to the `synchronous_commit` GUC which specifically advises against using "local" for synchronous replication.

Additionally, because we say "local" or higher, we increase the risk of the aforementioned in HEAD with the origin feature.

I know I'm hammering on this point a bit, but it feels like this is relatively easy to misconfigure with the upcoming "origin" change (I did so myself from reading the devel docs) and we should ensure we guide our users appropriately.

One can use local or higher
for reducing the latency for COMMIT when synchronous replication is
used in the publisher. Won't using 'local' while creating subscription
would suffice the need to consistently replicate the data? I mean it
is equivalent to somebody using levels greater than local in case of
physical replication. I think in the case of physical replication, we
won't wait for standby to replicate to another node before sending a
response, so why to wait in the case of logical replication? If this
understanding is correct, then probably it is sufficient to support
'local' for a subscription.

I think this is a good explanation. We should incorporate some version of this into the docs, and add some clarity around the use of `synchronous_commit` option in `CREATE SUBSCRIPTION` in particular with the origin feature. I can make an attempt at this.

Perhaps another question: based on this, should we disallow setting values of `synchronous_commit` greater than `local` when using "origin=none"? That might be too strict, but maybe we should warn around the risk of deadlock either in the logs or in the docs.

Thanks,

Jonathan

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to