On Mon, Dec 19, 2016 at 10:46 PM, Simon Riggs <si...@2ndquadrant.com> wrote: > On 26 October 2016 at 11:34, Thomas Munro <thomas.mu...@enterprisedb.com> > wrote: > >> It works by taking advantage of the { time, end-of-WAL } samples that >> sending servers already include in message headers to standbys. That >> seems to provide a pretty good proxy for when the WAL was written, if >> you ignore messages where the LSN hasn't advanced. The patch >> introduces a new GUC replay_lag_sample_interval, defaulting to 1s, to >> control how often the standby should record these timestamped LSNs >> into a small circular buffer. When its recovery process eventually >> replays a timestamped LSN, the timestamp is sent back to the upstream >> server in a new reply message field. The value visible in >> pg_stat_replication.replay_lag can then be updated by comparing with >> the current time. > > Why not just send back the lag as calculated by max_standby_streaming_delay? > I.e. at the end of replay of each chunk record the current delay in > shmem, then send it back periodically. > > If we have two methods of calculation it would be confusing.
Hmm. If I understand correctly, GetStandbyLimitTime is measuring something a bit different: it computes how long it has been since the recovery process received the chunk that it's currently trying to apply, most interestingly in the case where we are waiting due to conflicts. It doesn't include time in walsender, on the network, in walreceiver, or writing and flushing and reading before it arrives in the recovery process. If I'm reading it correctly, it only updates XLogReceiptTime when it's completely caught up applying all earlier chunks, so when it falls behind, its measure of lag has a growing-only phase and a reset that can only be triggered by catching up to the latest chunk. That seems OK for its intended purpose of putting a cap on the delay introduced by conflicts. But that's not what I'm trying to provide here. The purpose of this proposal is to show the replay_lag as judged by the sending server: in the case of a primary server, that is an indication of how commits done here will take to show up to users over there, and how long COMMIT will take with remote_apply will take to come back. It measures the WAL's whole journey, and does so in a smooth way that shows accurate information even if the standby never quite catches up during long periods. Example 1: Suppose I have two servers right next each other, and the primary server has periods of high activity which exceed the standby's replay rate, perhaps because of slower/busier hardware, or because of conflicts with other queries, or because our single-core 'REDO' can't always keep up with multi-core 'DO'. By the logic of max_standby_streaming_delay, if it never catches up to the latest chunk but remains a fluctuating number of chunks behind, then AIUI the standby will compute a constantly increasing lag. By my logic, the primary will tell you quite accurately how far behind the standby's recovery is at regular intervals, showing replay_lag fluctuating up and down as appropriate, even if it never quite catches up. It can do that because it has a buffer full of regularly spaced out samples to work through, and even if you exceed the buffer size (8192 seconds' worth by default) it'll just increase the interval between samples. Example 2: Suppose I have servers on opposite sides of the world with a ping time of 300ms. By the logic used for max_standby_streaming_delay, the lag computed by the standby would be close to zero when there is no concurrent activity to conflict with. I don't think that's what users other than the recovery-conflict resolution code want to know. By my logic, replay_lag computed by the primary would show 300ms + a tiny bit more, which is how long it takes for committed transactions to be visible to user queries on the standby and for us to know that that is the case. That's interesting because it tells you how long synchronous_commit = remote_apply would make you wait (if that server is waited for according to syncrep config). In summary, the max_standby_streaming_delay approach only measures activity inside the recovery process on the standby, and only uses a single variable for timestamp tracking, so although it's semi-related it's not what I wanted to show. (I suppose there might be an argument that max_standby_streaming_delay should also track received-on-standby-time for each sampled LSN in a circular buffer, and then use that information to implement max_standby_streaming_delay more fairly. We only need to cancel queries that conflict with WAL records that have truly been waiting max_standby_streaming_delay since receive time, instead of cancelling everything that conflicts with recovery until we're caught up to the last chunk as we do today as soon as max_standby_streaming_delay is exceeded while trying to apply *any* WAL record. This may not make any sense or be worth doing, just an idea...) > Admittedly the approach here is the same one I advocated a some years > back when Robert and I were discussing time delayed standbys. That is reassuring! -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers