> On 12 July 2017 at 23:45, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > I renamed it to "synchronous replay", because "causal reads" seemed a bit too > arcane.
Hi I looked through the code of `synchronous-replay-v1.patch` a bit and ran a few tests. I didn't manage to break anything, except one mysterious error that I've got only once on one of my replicas, but I couldn't reproduce it yet. Interesting thing is that this error did not affect another replica or primary. Just in case here is the log for this error (maybe you can see something obvious, that I've not noticed): ``` LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732": Directory not empty CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211": Directory not empty CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732": Directory not empty CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211": Directory not empty CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 LOG: directories for tablespace 47733 could not be removed HINT: You can remove the directories manually if necessary. CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733 FATAL: could not create directory "pg_tblspc/47734/PG_10_201707211/47732": File exists CONTEXT: WAL redo at 0/125F5768 for Storage/CREATE: pg_tblspc/47734/PG_10_201707211/47732/47736 LOG: startup process (PID 8034) exited with exit code 1 LOG: terminating any other active server processes LOG: database system is shut down ``` And speaking about the code, so far I have just a few notes (some of them merely questions): * In general the idea behind this patch sounds interesting for me, but it relies heavily on time synchronization. As mentioned in the documentation: "Current hardware clocks, NTP implementations and public time servers are unlikely to allow the system clocks to differ more than tens or hundreds of milliseconds, and systems synchronized with dedicated local time servers may be considerably more accurate." But as far as I remember from my own experience sometimes it maybe not that trivial on something like AWS because of virtualization. Maybe it's an unreasonable fear, but is it possible to address this problem somehow? * Also I noticed that some time-related values are hardcoded (e.g. 50%/25% time shift when we're dealing with leases). Does it make sense to move them out and make them configurable? * Judging from the `SyncReplayPotentialStandby` function, it's possible to have `synchronous_replay_standby_names = "*, server_name"`, which is basically an equivalent for just `*`, but it looks confusing. Is it worth it to prevent this behaviour? * In the same function `SyncReplayPotentialStandby` there is this code: ``` if (!SplitIdentifierString(rawstring, ',', &elemlist)) { /* syntax error in list */ pfree(rawstring); list_free(elemlist); /* GUC machinery will have already complained - no need to do again */ return false; } ``` Am I right that ideally this (a situation when at this point in the code `synchronous_replay_standby_names` has incorrect value) should not happen, because GUC will prevent us from that? If yes, then it looks for me that it still makes sense to put here a log message, just to give more information in a potentially weird situation. * In the function `SyncRepReleaseWaiters` there is a commentary: ``` /* * If the number of sync standbys is less than requested or we aren't * managing a sync standby or a standby in synchronous replay state that * blocks then just leave. * / if ((!got_recptr || !am_sync) && !walsender_sr_blocker) ``` Is this commentary correct? If I understand everything right `!got_recptr` - the number of sync standbys is less than requested (a), `!am_sync` - we aren't managing a sync standby (b), `walsender_sr_blocker` - a standby in synchronous replay state that blocks (c). Looks like condition is `(a or b) and not c`. * In the function `ProcessStandbyReplyMessage` there is a code that implements this: ``` * If the standby reports that it has fully replayed the WAL for at least * 10 seconds, then let's clear the lag times that were measured when it * last wrote/flushed/applied a WAL record. This way we avoid displaying * stale lag data until more WAL traffic arrives. ``` but I never found any mention of this 10 seconds in the documentation. Is it not that important? Also, question 2 is related to this one. * In the function `WalSndGetSyncReplayStateString` all the states are in lower case except `UNKNOWN`, is there any particular reason for that? There are also few more not that important notes (mostly about some typos and few confusing names), but I'm going to do another round of review and testing anyway so I'll just send them all next time.