Hi hackers! I want to revive attempts to fix some old edge cases of physical quorum replication.
Please find attached draft patches that demonstrate ideas. These patches are
not actually proposed code changes, I rather want to have a design consensus
first.
1. Allow checking standby sync before making data visible after crash recovery
Problem: Postgres instance must not allow to read data, if it is not yet known
to be replicated.
Instantly after the crash we do not know if we are still cluster primary. We
can disallow new
connections until standby quorum is established. Of course, walsenders and
superusers must be exempt from this restriction.
Key change is following:
@@ -1214,6 +1215,16 @@ InitPostgres(const char *in_dbname, Oid dboid,
if (PostAuthDelay > 0)
pg_usleep(PostAuthDelay * 1000000L);
+ /* Check if we need to wait for startup synchronous replication */
+ if (!am_walsender &&
+ !superuser() &&
+ !StartupSyncRepEstablished())
+ {
+ ereport(FATAL,
+ (errcode(ERRCODE_CANNOT_CONNECT_NOW),
+ errmsg("cannot connect until synchronous
replication is established with standbys according to
startup_synchronous_standby_level")));
+ }
We might also want to have some kind of cache that quorum was already
established. Also the place where the check is done might be not most
appropriate.
2. Do not allow to cancel locally written transaction
The problem was discussed many times [0,1,2,3] with some agreement on taken
approach. But there was concerns that the solution is incomplete without first
patch in the current thread.
Problem: user might try to cancel locally committed transaction and if we do so
we will show non-replicated data as committed. This leads to loosing data with
UPSERTs.
The key change is how we process cancels in SyncRepWaitForLSN().
3. Allow reading LSN written by walreciever, but not flushed yet
Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might
be ahead of node1 by flush LSN, but before by written LSN. If we do a failover
we choose node2 instead of node1 and loose data recently committed with
synchronous_commit=remote_write.
Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact
returns flushed LSN, not written. I propose to add a new function which returns
LSN actually written. Internals of this function are already implemented
(GetWalRcvWriteRecPtr()), but unused.
Currently we just use a separate program lwaldump [4] which just reads WAL
until last valid record. In case of failover pg_consul uses LSNs from lwaldump.
This approach works well, but is cumbersome.
There are other caveats of replication, but IMO these 3 problems are most
annoying in terms of data durability.
I'd greatly appreciate any thoughts on this.
Best regards, Andrey Borodin.
[0]
https://www.postgresql.org/message-id/flat/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C%40yandex-team.ru
[1]
https://www.postgresql.org/message-id/flat/caeet0zhg5off7iecby6tzadh1moslmfz1hlm311p9vot7z+...@mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/[email protected]#415dc2f7d41b8a251b419256407bb64d
[3]
https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com
[4] https://github.com/g0djan/lwaldump
0001-Allow-checking-standby-sync-before-making-data-visib.patch
Description: Binary data
0002-Do-not-allow-to-cancel-locally-written-transaction.patch
Description: Binary data
0003-Allow-reading-LSN-written-by-walreciever-but-not-flu.patch
Description: Binary data
