Re: [HACKERS] max_standby_delay considered harmful

Simon Riggs Tue, 04 May 2010 07:28:14 -0700

Downthread, I said..

On Tue, 2010-05-04 at 14:49 +0100, Simon Riggs wrote:


> The only difference is that max_standby_delay is measured from log
> timestamp. Perhaps it should work from WAL receipt timestamp rather than
> from log timestamp? That would make some of the problems go away without
> significantly changing the definition. I'll look at that.

Patch to implement this idea attached: for discussion, not tested yet.
No docs yet.

The attached patch redefines "standby delay" to be the amount of time
elapsed from point of receipt to point of application. The "point of
receipt" is reset every chunk of data when streaming, or every file when
reading file by file. In all cases this new time is later than the
latest log time we would have used previously.

This addresses all of your points, as shown below.

On Mon, 2010-05-03 at 11:37 -0400, Tom Lane wrote:
> There are three really fundamental problems with it:
> 
> 1. The timestamps we are reading from the log might be historical,
> if we are replaying from archive rather than reading a live SR stream.
> In the current implementation that means zero grace period for standby
> queries.  Now if your only interest is catching up as fast as possible,
> that could be a sane behavior, but this is clearly not the only possible
> interest --- in fact, if that's all you care about, why did you allow
> standby queries at all?

The delay used is from time of receipt of WAL, no longer from log date.
So this would no longer apply.

> 2. There could be clock skew between the master and slave servers.
> If the master's clock is a minute or so ahead of the slave's, again we
> get into a situation where standby queries have zero grace period, even
> though killing them won't do a darn thing to permit catchup.  If the
> master is behind the slave then we have an artificially inflated grace
> period, which is going to slow down the slave.

The timestamp is from standby, not master, so this would no longer
apply.

> 3. There could be significant propagation delay from master to slave,
> if the WAL stream is being transmitted with pg_standby or some such.
> Again this results in cutting into the standby queries' grace period,
> for no defensible reason.

The timestamp is taken immediately at the point the WAL is ready for
replay, so other timing overheads would not be included.

> In addition to these fundamental problems there's a fatal implementation
> problem: the actual comparison is not to the master's current clock
> reading, but to the latest commit, abort, or checkpoint timestamp read
> from the WAL.  Thus, if the last commit was more than max_standby_delay
> seconds ago, zero grace time.  Now if the master is really idle then
> there aren't going to be any conflicts anyway, but what if it's running
> only long-running queries?  Or what happens when it was idle for awhile
> and then starts new queries?  Zero grace period, that's what.
> 
> We could possibly improve matters for the SR case by having walsender
> transmit the master's current clock reading every so often (probably
> once per activity cycle), outside the WAL stream proper.  The receiver
> could subtract off its own clock reading in order to measure the skew,
> and then we could cancel queries if the de-skewed transmission time
> falls too far behind.  However this doesn't do anything to fix the cases
> where we aren't reading (and caught up to) a live SR broadcast.

-- 
 Simon Riggs           www.2ndQuadrant.com

*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 9281,9287 **** retry:
--- 9281,9294 ----
  												  sources);
  					switched_segment = true;
  					if (readFile != -1)
+ 					{
+ 						/*
+ 						 * Nudge forwards the WAL receive timestamp when
+ 						 * we are relying on WAL files.
+ 						 */
+ 						SetWalRcvTimestamp();
  						break;
+ 					}
  
  					/*
  					 * Nope, not found in archive and/or pg_xlog.
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 515,521 **** XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
  	}
  }
  
! /* Flush the log to disk */
  static void
  XLogWalRcvFlush(void)
  {
--- 515,521 ----
  	}
  }
  
! /* Flush the log to disk and update shared memory pointer and timestamp */
  static void
  XLogWalRcvFlush(void)
  {
***************
*** 524,537 **** XLogWalRcvFlush(void)
--- 524,541 ----
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalRcvData *walrcv = WalRcv;
  		char		activitymsg[50];
+ 		TimestampTz receivedTimestamp;
  
  		issue_xlog_fsync(recvFile, recvId, recvSeg);
  
  		LogstreamResult.Flush = LogstreamResult.Write;
  
+ 		receivedTimestamp = GetCurrentTimestamp();
+ 
  		/* Update shared-memory status */
  		SpinLockAcquire(&walrcv->mutex);
  		walrcv->receivedUpto = LogstreamResult.Flush;
+ 		walrcv->receivedTimestamp = receivedTimestamp;
  		SpinLockRelease(&walrcv->mutex);
  
  		/* Report XLOG streaming progress in PS display */
*** a/src/backend/replication/walreceiverfuncs.c
--- b/src/backend/replication/walreceiverfuncs.c
***************
*** 220,222 **** GetWalRcvWriteRecPtr(void)
--- 220,254 ----
  
  	return recptr;
  }
+ 
+ /*
+  * Returns the timestamp of last walreceiver write
+  */
+ TimestampTz
+ GetWalRcvTimestamp(void)
+ {
+ 	/* use volatile pointer to prevent code rearrangement */
+ 	volatile WalRcvData *walrcv = WalRcv;
+ 	TimestampTz receivedTimestamp;
+ 
+ 	SpinLockAcquire(&walrcv->mutex);
+ 	receivedTimestamp = walrcv->receivedTimestamp;
+ 	SpinLockRelease(&walrcv->mutex);
+ 
+ 	return receivedTimestamp;
+ }
+ 
+ /*
+  * Sets the timestamp for when receiving WAL files
+  */
+ void
+ SetWalRcvTimestamp(void)
+ {
+ 	/* use volatile pointer to prevent code rearrangement */
+ 	volatile WalRcvData *walrcv = WalRcv;
+ 	TimestampTz receivedTimestamp = GetCurrentTimestamp();
+ 
+ 	SpinLockAcquire(&walrcv->mutex);
+ 	walrcv->receivedTimestamp = receivedTimestamp;
+ 	SpinLockRelease(&walrcv->mutex);
+ }
*** a/src/backend/storage/ipc/standby.c
--- b/src/backend/storage/ipc/standby.c
***************
*** 22,27 ****
--- 22,28 ----
  #include "access/xlog.h"
  #include "miscadmin.h"
  #include "pgstat.h"
+ #include "replication/walreceiver.h"
  #include "storage/bufmgr.h"
  #include "storage/lmgr.h"
  #include "storage/proc.h"
***************
*** 126,132 **** WaitExceedsMaxStandbyDelay(void)
  {
  	/* Are we past max_standby_delay? */
  	if (MaxStandbyDelay >= 0 &&
! 		TimestampDifferenceExceeds(GetLatestXLogTime(), GetCurrentTimestamp(),
  								   MaxStandbyDelay))
  		return true;
  
--- 127,133 ----
  {
  	/* Are we past max_standby_delay? */
  	if (MaxStandbyDelay >= 0 &&
! 		TimestampDifferenceExceeds(GetWalRcvTimestamp(), GetCurrentTimestamp(),
  								   MaxStandbyDelay))
  		return true;
  
*** a/src/include/replication/walreceiver.h
--- b/src/include/replication/walreceiver.h
***************
*** 58,63 **** typedef struct
--- 58,64 ----
  	 * walreceiver updates this whenever it flushes the received WAL.
  	 */
  	XLogRecPtr	receivedUpto;
+ 	TimestampTz	receivedTimestamp;
  
  	slock_t		mutex;			/* locks shared variables shown above */
  } WalRcvData;
***************
*** 83,87 **** extern bool WalRcvInProgress(void);
--- 84,90 ----
  extern XLogRecPtr WaitNextXLogAvailable(XLogRecPtr recptr, bool *finished);
  extern void RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo);
  extern XLogRecPtr GetWalRcvWriteRecPtr(void);
+ extern TimestampTz GetWalRcvTimestamp(void);
+ extern void SetWalRcvTimestamp(void);
  
  #endif   /* _WALRECEIVER_H */

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] max_standby_delay considered harmful

Reply via email to