On Tue, 2009-12-15 at 20:11 +0900, Hiroyuki Yamada wrote:
> Hot Standby node can freeze when startup process calls LockBufferForCleanup().
> This bug can be reproduced by the following procedure.
> 
> 0. start Hot Standby, with one active node(node A) and one standby node(node 
> B)
> 1. create table X and table Y in node A
> 2. insert several rows in table X in node A
> 3. delete one row from table X in node A
> 4. begin xact 1 in node A, execute following commands, and leave xact 1 open
> 4.1 LOCK table Y IN ACCESS EXCLUSIVE MODE
> 5. wait until WAL's for above actions are applied in node B
> 6. begin xact 2 in node B, and execute following commands
> 6.1 DECLARE CURSOR test_cursor FOR SELECT * FROM table X;
> 6.2 FETCH test_cursor;
> 6.3 SELECT * FROM table Y;
> 7. execute VACUUM FREEZE table A in node A
> 8. commit xact 1 in node A
> 
> ...then in node B occurs following "deadlock" situation, which is not 
> detected by deadlock check.
>  * startup process waits for xact 2 to release buffers in table X (in 
> LockBufferForCleanup())
>  * xact 2 waits for startup process to release ACCESS EXCLUSIVE lock in table 
> Y

Deadlock bug was prevented by stop-gap measure in December commit.

Full resolution patch attached for Startup process waits on buffer pins.

Startup process sets SIGALRM when waiting on a buffer pin. If woken by
alarm we send SIGUSR1 to all backends requesting that they check to see
if they are blocking Startup process. If so, they throw ERROR/FATAL as
for other conflict resolutions. Deadlock stop gap removed.
max_standby_delay = -1 option removed to prevent deadlock.

Reviews welcome, otherwise commit at end of week.

-- 
 Simon Riggs           www.2ndQuadrant.com
*** a/doc/src/sgml/backup.sgml
--- b/doc/src/sgml/backup.sgml
***************
*** 2399,2405 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
         </listitem>
         <listitem>
          <para>
!          Waiting to acquire buffer cleanup locks (for which there is no time out)
          </para>
         </listitem>
         <listitem>
--- 2399,2405 ----
         </listitem>
         <listitem>
          <para>
!          Waiting to acquire buffer cleanup locks
          </para>
         </listitem>
         <listitem>
***************
*** 2536,2546 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
      Three-way deadlocks are possible between AccessExclusiveLocks arriving from
      the primary, cleanup WAL records that require buffer cleanup locks and
      user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks
!     are currently resolved by the cancellation of user processes that would
!     need to wait on a lock. This is heavy-handed and generates more query
!     cancellations than we need to, though does remove the possibility of deadlock.
!     This behaviour is expected to improve substantially for the main release
!     version of 8.5.
     </para>
  
     <para>
--- 2536,2542 ----
      Three-way deadlocks are possible between AccessExclusiveLocks arriving from
      the primary, cleanup WAL records that require buffer cleanup locks and
      user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks
!     are resolved by time-out when we exceed <varname>max_standby_delay</>.
     </para>
  
     <para>
***************
*** 2630,2640 **** LOG:  database system is ready to accept read only connections
      <varname>max_standby_delay</> or even set it to zero, though that is a
      very aggressive setting. If the standby server is tasked as an additional
      server for decision support queries then it may be acceptable to set this
!     to a value of many hours (in seconds).  It is also possible to set
!     <varname>max_standby_delay</> to -1 which means wait forever for queries
!     to complete, if there are conflicts; this will be useful when performing
!     an archive recovery from a backup.
!    </para>
  
     <para>
      Transaction status "hint bits" written on primary are not WAL-logged,
--- 2626,2632 ----
      <varname>max_standby_delay</> or even set it to zero, though that is a
      very aggressive setting. If the standby server is tasked as an additional
      server for decision support queries then it may be acceptable to set this
!     to a value of many hours (in seconds).
  
     <para>
      Transaction status "hint bits" written on primary are not WAL-logged,
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1825,1838 **** archive_command = 'copy "%p" "C:\\server\\archivedir\\%f"'  # Windows
        <listitem>
         <para>
          When server acts as a standby, this parameter specifies a wait policy
!         for queries that conflict with incoming data changes. Valid settings
!         are -1, meaning wait forever, or a wait time of 0 or more seconds.
!         If a conflict should occur the server will delay up to this
!         amount before it begins trying to resolve things less amicably, as
          described in <xref linkend="hot-standby-conflict">. Typically,
          this parameter makes sense only during replication, so when
!         performing an archive recovery to recover from data loss a
!         parameter setting of 0 is recommended.  The default is 30 seconds.
          This parameter can only be set in the <filename>postgresql.conf</>
          file or on the server command line.
         </para>
--- 1825,1839 ----
        <listitem>
         <para>
          When server acts as a standby, this parameter specifies a wait policy
!         for queries that conflict with data changes being replayed by recovery.
!         If a conflict should occur the server will delay up to this number
!         of seconds before it begins trying to resolve things less amicably, as
          described in <xref linkend="hot-standby-conflict">. Typically,
          this parameter makes sense only during replication, so when
!         performing an archive recovery to recover from data loss a very high
!         parameter setting is recommended.  The default is 30 seconds.
!         There is no wait-forever setting because of the potential for deadlock
!         which that setting would introduce.
          This parameter can only be set in the <filename>postgresql.conf</>
          file or on the server command line.
         </para>
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 8759,8767 **** StartupProcessMain(void)
  	 */
  	pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
  	pqsignal(SIGINT, SIG_IGN);	/* ignore query cancel */
! 	pqsignal(SIGTERM, StartupProcShutdownHandler);		/* request shutdown */
! 	pqsignal(SIGQUIT, startupproc_quickdie);	/* hard crash time */
! 	pqsignal(SIGALRM, SIG_IGN);
  	pqsignal(SIGPIPE, SIG_IGN);
  	pqsignal(SIGUSR1, SIG_IGN);
  	pqsignal(SIGUSR2, SIG_IGN);
--- 8759,8770 ----
  	 */
  	pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
  	pqsignal(SIGINT, SIG_IGN);	/* ignore query cancel */
! 	pqsignal(SIGTERM, StartupProcShutdownHandler);	/* request shutdown */
! 	pqsignal(SIGQUIT, startupproc_quickdie);		/* hard crash time */
! 	if (XLogRequestRecoveryConnections)
! 		pqsignal(SIGALRM, handle_standby_sig_alarm); /* ignored unless InHotStandby */
! 	else
! 		pqsignal(SIGALRM, SIG_IGN);
  	pqsignal(SIGPIPE, SIG_IGN);
  	pqsignal(SIGUSR1, SIG_IGN);
  	pqsignal(SIGUSR2, SIG_IGN);
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 44,49 ****
--- 44,50 ----
  #include "storage/ipc.h"
  #include "storage/proc.h"
  #include "storage/smgr.h"
+ #include "storage/standby.h"
  #include "utils/rel.h"
  #include "utils/resowner.h"
  
***************
*** 2417,2430 **** LockBufferForCleanup(Buffer buffer)
  		PinCountWaitBuf = bufHdr;
  		UnlockBufHdr(bufHdr);
  		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
  		/* Wait to be signaled by UnpinBuffer() */
! 		ProcWaitForSignal();
  		PinCountWaitBuf = NULL;
  		/* Loop back and try again */
  	}
  }
  
  /*
   * ConditionalLockBufferForCleanup - as above, but don't wait to get the lock
   *
   * We won't loop, but just check once to see if the pin count is OK.  If
--- 2418,2459 ----
  		PinCountWaitBuf = bufHdr;
  		UnlockBufHdr(bufHdr);
  		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+ 
  		/* Wait to be signaled by UnpinBuffer() */
! 		if (InHotStandby)
! 		{
! 			/* Share the bufid that Startup process waits on */
! 			SetStartupBufferPinWaitBufId(buffer - 1);
! 			/* Set alarm and then wait to be signaled by UnpinBuffer() */
! 			ResolveRecoveryConflictWithBufferPin();
! 			SetStartupBufferPinWaitBufId(-1);
! 		}
! 		else
! 			ProcWaitForSignal();
! 
  		PinCountWaitBuf = NULL;
  		/* Loop back and try again */
  	}
  }
  
  /*
+  * Check called from RecoveryConflictInterrupt handler when Startup
+  * process requests cancelation of all pin holders that are blocking it.
+  */
+ bool
+ HoldingBufferPinThatDelaysRecovery(void)
+ {
+ 	int		bufid = GetStartupBufferPinWaitBufId();
+ 
+ 	Assert(bufid >= 0);
+ 
+ 	if (PrivateRefCount[bufid] > 0)
+ 		return true;
+ 
+ 	return false;
+ }
+ 
+ /*
   * ConditionalLockBufferForCleanup - as above, but don't wait to get the lock
   *
   * We won't loop, but just check once to see if the pin count is OK.  If
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 1620,1634 **** GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
   * latestCompletedXid since doing so would be a performance issue during
   * normal running, so we check it essentially for free on the standby.
   *
!  * If dbOid is valid we skip backends attached to other databases. Some
!  * callers choose to skipExistingConflicts.
   *
   * Be careful to *not* pfree the result from this function. We reuse
   * this array sufficiently often that we use malloc for the result.
   */
  VirtualTransactionId *
! GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid,
! 						  bool skipExistingConflicts)
  {
  	static VirtualTransactionId *vxids;
  	ProcArrayStruct *arrayP = procArray;
--- 1620,1632 ----
   * latestCompletedXid since doing so would be a performance issue during
   * normal running, so we check it essentially for free on the standby.
   *
!  * If dbOid is valid we skip backends attached to other databases.
   *
   * Be careful to *not* pfree the result from this function. We reuse
   * this array sufficiently often that we use malloc for the result.
   */
  VirtualTransactionId *
! GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
  {
  	static VirtualTransactionId *vxids;
  	ProcArrayStruct *arrayP = procArray;
***************
*** 1667,1675 **** GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid,
  		if (proc->pid == 0)
  			continue;
  
- 		if (skipExistingConflicts && proc->recoveryConflictPending)
- 			continue;
- 
  		if (!OidIsValid(dbOid) ||
  			proc->databaseId == dbOid)
  		{
--- 1665,1670 ----
***************
*** 1826,1832 **** CountDBBackends(Oid databaseid)
   * CancelDBBackends --- cancel backends that are using specified database
   */
  void
! CancelDBBackends(Oid databaseid)
  {
  	ProcArrayStruct *arrayP = procArray;
  	int			index;
--- 1821,1827 ----
   * CancelDBBackends --- cancel backends that are using specified database
   */
  void
! CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending)
  {
  	ProcArrayStruct *arrayP = procArray;
  	int			index;
***************
*** 1839,1851 **** CancelDBBackends(Oid databaseid)
  	{
  		volatile PGPROC *proc = arrayP->procs[index];
  
! 		if (proc->databaseId == databaseid)
  		{
  			VirtualTransactionId procvxid;
  
  			GET_VXID_FROM_PGPROC(procvxid, *proc);
  
! 			proc->recoveryConflictPending = true;
  			pid = proc->pid;
  			if (pid != 0)
  			{
--- 1834,1846 ----
  	{
  		volatile PGPROC *proc = arrayP->procs[index];
  
! 		if (databaseid == InvalidOid || proc->databaseId == databaseid)
  		{
  			VirtualTransactionId procvxid;
  
  			GET_VXID_FROM_PGPROC(procvxid, *proc);
  
! 			proc->recoveryConflictPending = conflictPending;
  			pid = proc->pid;
  			if (pid != 0)
  			{
***************
*** 1853,1860 **** CancelDBBackends(Oid databaseid)
  				 * Kill the pid if it's still here. If not, that's what we wanted
  				 * so ignore any errors.
  				 */
! 				(void) SendProcSignal(pid, PROCSIG_RECOVERY_CONFLICT_DATABASE,
! 										procvxid.backendId);
  			}
  		}
  	}
--- 1848,1854 ----
  				 * Kill the pid if it's still here. If not, that's what we wanted
  				 * so ignore any errors.
  				 */
! 				(void) SendProcSignal(pid, sigmode, procvxid.backendId);
  			}
  		}
  	}
*** a/src/backend/storage/ipc/procsignal.c
--- b/src/backend/storage/ipc/procsignal.c
***************
*** 272,276 **** procsignal_sigusr1_handler(SIGNAL_ARGS)
--- 272,279 ----
  	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT))
  		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
  
+ 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
+ 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+ 
  	errno = save_errno;
  }
*** a/src/backend/storage/ipc/standby.c
--- b/src/backend/storage/ipc/standby.c
***************
*** 126,135 **** WaitExceedsMaxStandbyDelay(void)
  	long	delay_secs;
  	int		delay_usecs;
  
- 	/* max_standby_delay = -1 means wait forever, if necessary */
- 	if (MaxStandbyDelay < 0)
- 		return false;
- 
  	/* Are we past max_standby_delay? */
  	TimestampDifference(GetLatestXLogTime(), GetCurrentTimestamp(),
  						&delay_secs, &delay_usecs);
--- 126,131 ----
***************
*** 241,248 **** ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid)
  	VirtualTransactionId *backends;
  
  	backends = GetConflictingVirtualXIDs(latestRemovedXid,
! 										 InvalidOid,
! 										 true);
  
  	ResolveRecoveryConflictWithVirtualXIDs(backends,
  										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
--- 237,243 ----
  	VirtualTransactionId *backends;
  
  	backends = GetConflictingVirtualXIDs(latestRemovedXid,
! 										 InvalidOid);
  
  	ResolveRecoveryConflictWithVirtualXIDs(backends,
  										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT);
***************
*** 273,280 **** ResolveRecoveryConflictWithTablespace(Oid tsid)
  	 * non-transactional.
  	 */
  	temp_file_users = GetConflictingVirtualXIDs(InvalidTransactionId,
! 												InvalidOid,
! 												false);
  	ResolveRecoveryConflictWithVirtualXIDs(temp_file_users,
  										   PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
  }
--- 268,274 ----
  	 * non-transactional.
  	 */
  	temp_file_users = GetConflictingVirtualXIDs(InvalidTransactionId,
! 												InvalidOid);
  	ResolveRecoveryConflictWithVirtualXIDs(temp_file_users,
  										   PROCSIG_RECOVERY_CONFLICT_TABLESPACE);
  }
***************
*** 295,301 **** ResolveRecoveryConflictWithDatabase(Oid dbid)
  	 */
  	while (CountDBBackends(dbid) > 0)
  	{
! 		CancelDBBackends(dbid);
  
  		/*
  		 * Wait awhile for them to die so that we avoid flooding an
--- 289,295 ----
  	 */
  	while (CountDBBackends(dbid) > 0)
  	{
! 		CancelDBBackends(dbid, PROCSIG_RECOVERY_CONFLICT_TABLESPACE, true);
  
  		/*
  		 * Wait awhile for them to die so that we avoid flooding an
***************
*** 331,338 **** ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid)
  		else
  		{
  			backends = GetConflictingVirtualXIDs(InvalidTransactionId,
! 												 InvalidOid,
! 												 true);
  			report_memory_error = true;
  		}
  
--- 325,331 ----
  		else
  		{
  			backends = GetConflictingVirtualXIDs(InvalidTransactionId,
! 												 InvalidOid);
  			report_memory_error = true;
  		}
  
***************
*** 346,351 **** ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid)
--- 339,451 ----
  }
  
  /*
+  * ResolveRecoveryConflictWithBufferPin is called from LockBufferForCleanup()
+  * to resolve conflicts with other backends holding buffer pins.
+  *
+  * We either resolve conflicts immediately or set a SIGALRM to wake us at
+  * the limit of our patience. The sleep in LockBufferForCleanup() is
+  * performed here, for code clarity.
+  *
+  * Resolve conflict by sending a SIGUSR1 reason to all backends to check if
+  * they hold one of the buffer pins that is blocking Startup process. If so,
+  * backends will take an appropriate error action, ERROR or FATAL.
+  *
+  * A secondary purpose of this is to avoid deadlocks that might occur between
+  * the Startup process and lock waiters. Deadlocks occur because if queries
+  * wait on a lock, that must be behind an AccessExclusiveLock, which can only
+  * be clared if the Startup process replays a transaction completion record.
+  * If Startup process is waiting then that is a deadlock. If we allowed a
+  * setting of max_standby_delay that meant "wait forever" we would then need
+  * special code to protect against deadlock. Such deadlocks are rare, so the
+  * code would be almost certainly buggy, so we avoid both long waits and
+  * deadlocks using the same mechanism.
+  */
+ void
+ ResolveRecoveryConflictWithBufferPin(void)
+ {
+ 	bool	sig_alarm_enabled = false;
+ 
+ 	Assert(InHotStandby);
+ 
+ 	/*
+ 	 * Signal immediately or set alarm for later.
+ 	 */
+ 	if (MaxStandbyDelay == 0)
+ 		SendRecoveryConflictWithBufferPin();
+ 	else
+ 	{
+ 		TimestampTz now;
+ 		long	standby_delay_secs;		/* How far Startup process is lagging */
+ 		int		standby_delay_usecs;
+ 
+ 		now = GetCurrentTimestamp();
+ 
+ 		/* Are we past max_standby_delay? */
+ 		TimestampDifference(GetLatestXLogTime(), now,
+ 							&standby_delay_secs, &standby_delay_usecs);
+ 
+ 		if (standby_delay_secs >= (long) MaxStandbyDelay)
+ 			SendRecoveryConflictWithBufferPin();
+ 		else
+ 		{
+ 			TimestampTz fin_time;			/* Expected wake-up time by timer */
+ 			long	timer_delay_secs;		/* Amount of time we set timer for */
+ 			int		timer_delay_usecs = 0;
+ 
+ 			/*
+ 			 * How much longer we should wait?
+ 			 */
+ 			timer_delay_secs = MaxStandbyDelay - standby_delay_secs;
+ 			if (standby_delay_usecs > 0)
+ 			{
+ 				timer_delay_secs -= 1;
+ 				timer_delay_usecs = 1000000 - standby_delay_usecs;
+ 			}
+ 
+ 			/*
+ 			 * It's possible that the difference is less than a microsecond;
+ 			 * ensure we don't cancel, rather than set, the interrupt.
+ 			 */
+ 			if (timer_delay_secs == 0 && timer_delay_usecs == 0)
+ 				timer_delay_usecs = 1;
+ 
+ 			/*
+ 			 * When is the finish time? We recheck this if we are woken early.
+ 			 */
+ 			fin_time = TimestampTzPlusMilliseconds(now,
+ 													(timer_delay_secs * 1000) +
+ 													(timer_delay_usecs / 1000));
+ 
+ 			if (enable_standby_sig_alarm(timer_delay_secs, timer_delay_usecs, fin_time))
+ 				sig_alarm_enabled = true;
+ 			else
+ 				elog(FATAL, "could not set timer for process wakeup");
+ 		}
+ 	}
+ 
+ 	/* Wait to be signaled by UnpinBuffer() */
+ 	ProcWaitForSignal();
+ 
+ 	if (sig_alarm_enabled)
+ 	{
+ 		if (!disable_standby_sig_alarm())
+ 			elog(FATAL, "could not disable timer for process wakeup");
+ 	}
+ }
+ 
+ void
+ SendRecoveryConflictWithBufferPin(void)
+ {
+ 	/*
+ 	 * We send signal to all backends to ask them if they are holding
+ 	 * the buffer pin which is delaying the Startup process. We must
+ 	 * not set the conflict flag yet, since most backends will be innocent.
+ 	 * Let the SIGUSR1 handling in each backend decide their own fate.
+ 	 */
+ 	CancelDBBackends(InvalidOid, PROCSIG_RECOVERY_CONFLICT_BUFFERPIN, false);
+ }
+ 
+ /*
   * -----------------------------------------------------
   * Locking in Recovery Mode
   * -----------------------------------------------------
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
***************
*** 815,839 **** LockAcquireExtended(const LOCKTAG *locktag,
  		}
  
  		/*
- 		 * In Hot Standby we abort the lock wait if Startup process is waiting
- 		 * since this would result in a deadlock. The deadlock occurs because
- 		 * if we are waiting it must be behind an AccessExclusiveLock, which
- 		 * can only clear when a transaction completion record is replayed.
- 		 * If Startup process is waiting we never will clear that lock, so to
- 		 * wait for it just causes a deadlock.
- 		 */
- 		if (RecoveryInProgress() && !InRecovery &&
- 			locktag->locktag_type == LOCKTAG_RELATION)
- 		{
- 			LWLockRelease(partitionLock);
- 			ereport(ERROR,
- 					(errcode(ERRCODE_T_R_DEADLOCK_DETECTED),
- 					 errmsg("possible deadlock detected"),
- 					 errdetail("process conflicts with recovery - please resubmit query later"),
- 					 errdetail_log("process conflicts with recovery")));
- 		}
- 
- 		/*
  		 * Set bitmask of locks this process already holds on this object.
  		 */
  		MyProc->heldLocks = proclock->holdMask;
--- 815,820 ----
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 73,78 **** NON_EXEC_STATIC PGPROC *AuxiliaryProcs = NULL;
--- 73,79 ----
  static LOCALLOCK *lockAwaited = NULL;
  
  /* Mark these volatile because they can be changed by signal handler */
+ static volatile bool standby_timeout_active = false;
  static volatile bool statement_timeout_active = false;
  static volatile bool deadlock_timeout_active = false;
  static volatile DeadLockState deadlock_state = DS_NOT_YET_CHECKED;
***************
*** 89,94 **** static void RemoveProcFromArray(int code, Datum arg);
--- 90,96 ----
  static void ProcKill(int code, Datum arg);
  static void AuxiliaryProcKill(int code, Datum arg);
  static bool CheckStatementTimeout(void);
+ static bool CheckStandbyTimeout(void);
  
  
  /*
***************
*** 107,112 **** ProcGlobalShmemSize(void)
--- 109,116 ----
  	size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
  	/* ProcStructLock */
  	size = add_size(size, sizeof(slock_t));
+ 	/* startupBufferPinWaitBufId */
+ 	size = add_size(size, sizeof(NBuffers));
  
  	return size;
  }
***************
*** 487,497 **** PublishStartupProcessInformation(void)
--- 491,540 ----
  
  	procglobal->startupProc = MyProc;
  	procglobal->startupProcPid = MyProcPid;
+ 	procglobal->startupBufferPinWaitBufId = 0;
  
  	SpinLockRelease(ProcStructLock);
  }
  
  /*
+  * Used from bufgr to share the value of the buffer that Startup waits on,
+  * or to reset the value to "not waiting" (-1). This allows processing
+  * of recovery conflicts for buffer pins.
+  */
+ void
+ SetStartupBufferPinWaitBufId(int bufid)
+ {
+ 	/* use volatile pointer to prevent code rearrangement */
+ 	volatile PROC_HDR *procglobal = ProcGlobal;
+ 
+ 	SpinLockAcquire(ProcStructLock);
+ 
+ 	procglobal->startupBufferPinWaitBufId = bufid;
+ 
+ 	SpinLockRelease(ProcStructLock);
+ }
+ 
+ /*
+  * Used by backends when they receive a request to check for buffer pin waits.
+  */
+ int
+ GetStartupBufferPinWaitBufId(void)
+ {
+ 	int bufid;
+ 
+ 	/* use volatile pointer to prevent code rearrangement */
+ 	volatile PROC_HDR *procglobal = ProcGlobal;
+ 
+ 	SpinLockAcquire(ProcStructLock);
+ 
+ 	bufid = procglobal->startupBufferPinWaitBufId;
+ 
+ 	SpinLockRelease(ProcStructLock);
+ 
+ 	return bufid;
+ }
+ 
+ /*
   * Check whether there are at least N free PGPROC objects.
   *
   * Note: this is designed on the assumption that N will generally be small.
***************
*** 1542,1548 **** CheckStatementTimeout(void)
  
  
  /*
!  * Signal handler for SIGALRM
   *
   * Process deadlock check and/or statement timeout check, as needed.
   * To avoid various edge cases, we must be careful to do nothing
--- 1585,1591 ----
  
  
  /*
!  * Signal handler for SIGALRM for normal user backends
   *
   * Process deadlock check and/or statement timeout check, as needed.
   * To avoid various edge cases, we must be careful to do nothing
***************
*** 1565,1567 **** handle_sig_alarm(SIGNAL_ARGS)
--- 1608,1714 ----
  
  	errno = save_errno;
  }
+ 
+ /*
+  * Signal handler for SIGALRM in Startup process
+  *
+  * To avoid various edge cases, we must be careful to do nothing
+  * when there is nothing to be done.  We also need to be able to
+  * reschedule the timer interrupt if called before end of statement.
+  */
+ bool
+ enable_standby_sig_alarm(long delay_s, int delay_us, TimestampTz fin_time)
+ {
+ 	struct itimerval timeval;
+ 
+ 	Assert(delay_s >= 0 && delay_us >= 0);
+ 
+ 	statement_fin_time = fin_time;
+ 
+ 	standby_timeout_active = true;
+ 
+ 	MemSet(&timeval, 0, sizeof(struct itimerval));
+ 	timeval.it_value.tv_sec = delay_s;
+ 	timeval.it_value.tv_usec = delay_us;
+ 	if (setitimer(ITIMER_REAL, &timeval, NULL))
+ 		return false;
+ 	return true;
+ }
+ 
+ bool
+ disable_standby_sig_alarm(void)
+ {
+ 	/*
+ 	 * Always disable the interrupt if it is active; this avoids being
+ 	 * interrupted by the signal handler and thereby possibly getting
+ 	 * confused.
+ 	 *
+ 	 * We will re-enable the interrupt if necessary in CheckStandbyTimeout.
+ 	 */
+ 	if (standby_timeout_active)
+ 	{
+ 		struct itimerval timeval;
+ 
+ 		MemSet(&timeval, 0, sizeof(struct itimerval));
+ 		if (setitimer(ITIMER_REAL, &timeval, NULL))
+ 		{
+ 			standby_timeout_active = false;
+ 			return false;
+ 		}
+ 	}
+ 
+ 	return true;
+ }
+ 
+ /*
+  * CheckStandbyTimeout() runs unconditionally in the Startup process
+  * SIGALRM handler. Timers will only be set when InHotStandby.
+  * We simply ignore any signals unless the timer has been set.
+  */
+ static bool
+ CheckStandbyTimeout(void)
+ {
+ 	TimestampTz now;
+ 
+ 	if (!standby_timeout_active)
+ 		return true;			/* do nothing if not active */
+ 
+ 	now = GetCurrentTimestamp();
+ 
+ 	if (now >= statement_fin_time)
+ 		SendRecoveryConflictWithBufferPin();
+ 	else
+ 	{
+ 		/* Not time yet, so (re)schedule the interrupt */
+ 		long		secs;
+ 		int			usecs;
+ 		struct itimerval timeval;
+ 
+ 		TimestampDifference(now, statement_fin_time,
+ 							&secs, &usecs);
+ 
+ 		/*
+ 		 * It's possible that the difference is less than a microsecond;
+ 		 * ensure we don't cancel, rather than set, the interrupt.
+ 		 */
+ 		if (secs == 0 && usecs == 0)
+ 			usecs = 1;
+ 		MemSet(&timeval, 0, sizeof(struct itimerval));
+ 		timeval.it_value.tv_sec = secs;
+ 		timeval.it_value.tv_usec = usecs;
+ 		if (setitimer(ITIMER_REAL, &timeval, NULL))
+ 			return false;
+ 	}
+ 
+ 	return true;
+ }
+ 
+ void
+ handle_standby_sig_alarm(SIGNAL_ARGS)
+ {
+ 	int save_errno = errno;
+ 
+ 	(void) CheckStandbyTimeout();
+ 
+ 	errno = save_errno;
+ }
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
***************
*** 2718,2723 **** RecoveryConflictInterrupt(ProcSignalReason reason)
--- 2718,2735 ----
  	{
  		switch (reason)
  		{
+ 			case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+ 					/*
+ 					 * If we aren't blocking the Startup process there is
+ 					 * nothing more to do.
+ 					 */
+ 					if (!HoldingBufferPinThatDelaysRecovery())
+ 						return;
+ 
+ 					MyProc->recoveryConflictPending = true;
+ 
+ 					/* Intentional drop through to error handling */
+ 
  			case PROCSIG_RECOVERY_CONFLICT_LOCK:
  			case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
  			case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 1383,1389 **** static struct config_int ConfigureNamesInt[] =
  			NULL
  		},
  		&MaxStandbyDelay,
! 		30, -1, INT_MAX, NULL, NULL
  	},
  
  	{
--- 1383,1389 ----
  			NULL
  		},
  		&MaxStandbyDelay,
! 		30, 0, INT_MAX, NULL, NULL
  	},
  
  	{
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 198,203 **** extern void LockBuffer(Buffer buffer, int mode);
--- 198,204 ----
  extern bool ConditionalLockBuffer(Buffer buffer);
  extern void LockBufferForCleanup(Buffer buffer);
  extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+ extern bool HoldingBufferPinThatDelaysRecovery(void);
  
  extern void AbortBufferIO(void);
  
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
***************
*** 16,22 ****
  
  #include "storage/lock.h"
  #include "storage/pg_sema.h"
! 
  
  /*
   * Each backend advertises up to PGPROC_MAX_CACHED_SUBXIDS TransactionIds
--- 16,22 ----
  
  #include "storage/lock.h"
  #include "storage/pg_sema.h"
! #include "utils/timestamp.h"
  
  /*
   * Each backend advertises up to PGPROC_MAX_CACHED_SUBXIDS TransactionIds
***************
*** 145,150 **** typedef struct PROC_HDR
--- 145,152 ----
  	/* The proc of the Startup process, since not in ProcArray */
  	PGPROC	   *startupProc;
  	int			startupProcPid;
+ 	/* Buffer id of the buffer that Startup process waits for pin on */
+ 	int			startupBufferPinWaitBufId;
  } PROC_HDR;
  
  /*
***************
*** 177,182 **** extern void InitProcessPhase2(void);
--- 179,186 ----
  extern void InitAuxiliaryProcess(void);
  
  extern void PublishStartupProcessInformation(void);
+ extern void SetStartupBufferPinWaitBufId(int bufid);
+ extern int GetStartupBufferPinWaitBufId(void);
  
  extern bool HaveNFreeProcs(int n);
  extern void ProcReleaseLocks(bool isCommit);
***************
*** 194,197 **** extern bool enable_sig_alarm(int delayms, bool is_statement_timeout);
--- 198,205 ----
  extern bool disable_sig_alarm(bool is_statement_timeout);
  extern void handle_sig_alarm(SIGNAL_ARGS);
  
+ extern bool enable_standby_sig_alarm(long delay_s, int delay_us, TimestampTz fin_time);
+ extern bool disable_standby_sig_alarm(void);
+ extern void handle_standby_sig_alarm(SIGNAL_ARGS);
+ 
  #endif   /* PROC_H */
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
***************
*** 57,69 **** extern bool IsBackendPid(int pid);
  extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
  					  bool excludeXmin0, bool allDbs, int excludeVacuum,
  					  int *nvxids);
! extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin,
! 					Oid dbOid, bool skipExistingConflicts);
  extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode);
  
  extern int	CountActiveBackends(void);
  extern int	CountDBBackends(Oid databaseid);
! extern void	CancelDBBackends(Oid databaseid);
  extern int	CountUserBackends(Oid roleid);
  extern bool CountOtherDBBackends(Oid databaseId,
  					 int *nbackends, int *nprepared);
--- 57,68 ----
  extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
  					  bool excludeXmin0, bool allDbs, int excludeVacuum,
  					  int *nvxids);
! extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid);
  extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode);
  
  extern int	CountActiveBackends(void);
  extern int	CountDBBackends(Oid databaseid);
! extern void CancelDBBackends(Oid databaseid, ProcSignalReason sigmode, bool conflictPending);
  extern int	CountUserBackends(Oid roleid);
  extern bool CountOtherDBBackends(Oid databaseId,
  					 int *nbackends, int *nprepared);
*** a/src/include/storage/procsignal.h
--- b/src/include/storage/procsignal.h
***************
*** 37,42 **** typedef enum
--- 37,43 ----
  	PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
  	PROCSIG_RECOVERY_CONFLICT_LOCK,
  	PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+ 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
  
  	NUM_PROCSIGNALS				/* Must be last! */
  } ProcSignalReason;
*** a/src/include/storage/standby.h
--- b/src/include/storage/standby.h
***************
*** 19,30 ****
  
  extern int	vacuum_defer_cleanup_age;
  
  extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid);
  extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
  extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
  
! extern void InitRecoveryTransactionEnvironment(void);
! extern void ShutdownRecoveryTransactionEnvironment(void);
  
  /*
   * Standby Rmgr (RM_STANDBY_ID)
--- 19,33 ----
  
  extern int	vacuum_defer_cleanup_age;
  
+ extern void InitRecoveryTransactionEnvironment(void);
+ extern void ShutdownRecoveryTransactionEnvironment(void);
+ 
  extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid);
  extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
  extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
  
! extern void ResolveRecoveryConflictWithBufferPin(void);
! extern void SendRecoveryConflictWithBufferPin(void);
  
  /*
   * Standby Rmgr (RM_STANDBY_ID)
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to