On Thu, Jan 6, 2022 at 1:29 PM SATYANARAYANA NARLAPURAM
<satyanarlapu...@gmail.com> wrote:
>
> Consider a cluster formation where we have a Primary(P), Sync Replica(S1), 
> and multiple async replicas for disaster recovery and read scaling (within 
> the region and outside the region). In this setup, S1 is the preferred 
> failover target in an event of the primary failure. When a transaction is 
> committed on the primary, it is not acknowledged to the client until the 
> primary gets an acknowledgment from the sync standby that the WAL is flushed 
> to the disk (assume synchrnous_commit configuration is remote_flush). 
> However, walsenders corresponds to the async replica on the primary don't 
> wait for the flush acknowledgment from the primary and send the WAL to the 
> async standbys (also any logical replication/decoding clients). So it is 
> possible for the async replicas and logical client ahead of the sync replica. 
> If a failover is initiated in such a scenario, to bring the formation into a 
> healthy state we have to either
>
>  run the pg_rewind on the async replicas for them to reconnect with the new 
> primary or
> collect the latest WAL across the replicas and feed the standby.
>
> Both these operations are involved, error prone, and can cause multiple 
> minutes of downtime if done manually. In addition, there is a window where 
> the async replicas can show the data that was neither acknowledged to the 
> client nor committed on standby. Logical clients if they are ahead may need 
> to reseed the data as no easy rewind option for them.
>
> I would like to propose a GUC send_Wal_after_quorum_committed which when set 
> to ON, walsenders corresponds to async standbys and logical replication 
> workers wait until the LSN is quorum committed on the primary before sending 
> it to the standby. This not only simplifies the post failover steps but 
> avoids unnecessary downtime for the async replicas. Thoughts?

Thanks Satya and others for the inputs. Here's the v1 patch that
basically allows async wal senders to wait until the sync standbys
report their flush lsn back to the primary. Please let me know your
thoughts.

I've done pgbench testing to see if the patch causes any problems. I
ran tests two times, there isn't much difference in the txns per
seconds (tps), although there's a delay in the async standby receiving
the WAL, after all, that's the feature we are pursuing.

[1]
HEAD or WITHOUT PATCH:
./pgbench -c 10 -t 500 -P 10 testdb
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 500
number of transactions actually processed: 5000/5000
latency average = 247.395 ms
latency stddev = 74.409 ms
initial connection time = 13.622 ms
tps = 39.713114 (without initial connection time)

PATCH:
./pgbench -c 10 -t 500 -P 10 testdb
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 500
number of transactions actually processed: 5000/5000
latency average = 251.757 ms
latency stddev = 72.846 ms
initial connection time = 13.025 ms
tps = 39.315862 (without initial connection time)

TEST SETUP:
primary in region 1
async standby 1 in the same region as that of the primary region 1
i.e. close to primary
sync standby 1 in region 2
sync standby 2 in region 3
an archive location in a region different from the primary and
standbys regions, region 4
Note that I intentionally kept sync standbys in regions far from
primary because it allows sync standbys to receive WAL a bit late by
default, which works well for our testing.

PGBENCH SETUP:
./psql -d postgres -c "drop database testdb"
./psql -d postgres -c "create database testdb"
./pgbench -i -s 100 testdb
./psql -d testdb -c "\dt"
./psql -d testdb -c "SELECT pg_size_pretty(pg_database_size('testdb'))"
./pgbench -c 10 -t 500 -P 10 testdb

Regards,
Bharath Rupireddy.

Attachment: v1-0001-Allow-async-standbys-wait-for-sync-replication.patch
Description: Binary data

Reply via email to