On 24.11.2010 12:48, Heikki Linnakangas wrote:
When recovery starts, we fetch the oldestActiveXid from the checkpoint
record. Let's say that it's 100. We then start replaying WAL records
from the Redo pointer, and the first record (heap insert in your case)
contains an Xid that's much larger than 100, say 10000. We call
RecordKnownAssignedXids() to make note that all xids between that range
are in-progress, but there isn't enough room in the array for that.

We normally get away with a smallish array because the array is trimmed
at commit and abort records, and the special xid-assignment record to
handle the case of a lot of subtransactions. We initialize the array
from the running-xacts record that's written at a checkpoint. That
mechanism fails in this case because the heap insert record is seen
before the running-xacts record, causing all those xids in the range
100-10000 to be considered in-progress. The running-xacts record that
comes later would prune them, but we don't have enough slots to hold
them until that.

Hmm. I'm not sure off the top of my head how to fix that. Perhaps stash
the xids we see during WAL replay in private memory instead of putting
them in the KnownAssignedXids array until we see the running-xacts record.

So, here's a patch using that approach.

Another approach would be to revisit the way the running-xacts snapshot is taken. Currently, we first take a snapshot, and then WAL-log it. There is a small window between the steps where backends can begin/end transactions, and recovery has to deal with that. When this was designed, there was long discussion on whether we should instead grab WALInsertLock and ProcArrayLock at the same time, to ensure that the running-xacts snapshot represents an up-to-date situation at the point in WAL where it's inserted.

We didn't want to do that because both locks can be heavily contended. But maybe we should after all. It would make the recovery code simpler.

If we want to get fancy, we wouldn't necessarily need to hold both locks for the whole duration. We could first grab ProcArrayLock and construct the snapshot. Then grab WALInsertLock and release ProcArrayLock, and finally write the WAL record and release WALInsertLock. But that would require small changes to XLogInsert.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 6095,6102 **** StartupXLOG(void)
  			StartupSUBTRANS(oldestActiveXID);
  			StartupMultiXact();
  
- 			ProcArrayInitRecoveryInfo(oldestActiveXID);
- 
  			/*
  			 * If we're beginning at a shutdown checkpoint, we know that
  			 * nothing was running on the master at this point. So fake-up an
--- 6095,6100 ----
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 435,453 **** ProcArrayClearTransaction(PGPROC *proc)
  }
  
  /*
-  * ProcArrayInitRecoveryInfo
-  *
-  * When trying to assemble our snapshot we only care about xids after this value.
-  * See comments for LogStandbySnapshot().
-  */
- void
- ProcArrayInitRecoveryInfo(TransactionId oldestActiveXid)
- {
- 	latestObservedXid = oldestActiveXid;
- 	TransactionIdRetreat(latestObservedXid);
- }
- 
- /*
   * ProcArrayApplyRecoveryInfo -- apply recovery info about xids
   *
   * Takes us through 3 states: Initialized, Pending and Ready.
--- 435,440 ----
***************
*** 519,533 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	Assert(standbyState == STANDBY_INITIALIZED);
  
  	/*
! 	 * OK, we need to initialise from the RunningTransactionsData record
! 	 */
! 
! 	/*
! 	 * Remove all xids except xids later than the snapshot. We don't know
! 	 * exactly which ones that is until precisely now, so that is why we allow
! 	 * xids to be added only to remove most of them again here.
  	 */
- 	ExpireOldKnownAssignedTransactionIds(running->nextXid);
  	StandbyReleaseOldLocks(running->nextXid);
  
  	/*
--- 506,514 ----
  	Assert(standbyState == STANDBY_INITIALIZED);
  
  	/*
! 	 * Release any locks belonging to old transactions that are not
! 	 * running according to the running-xacts record.
  	 */
  	StandbyReleaseOldLocks(running->nextXid);
  
  	/*
***************
*** 536,544 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
  
  	/*
- 	 * Combine the running xact data with already known xids, if any exist.
  	 * KnownAssignedXids is sorted so we cannot just add new xids, we have to
! 	 * combine them first, sort them and then re-add to KnownAssignedXids.
  	 *
  	 * Some of the new xids are top-level xids and some are subtransactions.
  	 * We don't call SubtransSetParent because it doesn't matter yet. If we
--- 517,524 ----
  	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
  
  	/*
  	 * KnownAssignedXids is sorted so we cannot just add new xids, we have to
! 	 * sort them first.
  	 *
  	 * Some of the new xids are top-level xids and some are subtransactions.
  	 * We don't call SubtransSetParent because it doesn't matter yet. If we
***************
*** 547,597 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	 * xids to subtrans. If RunningXacts is overflowed then we don't have
  	 * enough information to correctly update subtrans anyway.
  	 */
  
  	/*
! 	 * Allocate a temporary array so we can combine xids. The total of both
! 	 * arrays should never normally exceed TOTAL_MAX_CACHED_SUBXIDS.
! 	 */
! 	xids = palloc(sizeof(TransactionId) * TOTAL_MAX_CACHED_SUBXIDS);
! 
! 	/*
! 	 * Get the remaining KnownAssignedXids. In most cases there won't be any
! 	 * at all since this exists only to catch a theoretical race condition.
  	 */
! 	nxids = KnownAssignedXidsGet(xids, InvalidTransactionId);
! 	if (nxids > 0)
! 		KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
! 
! 	/*
! 	 * Now we have a copy of any KnownAssignedXids we can zero the array
! 	 * before we re-insert combined snapshot.
! 	 */
! 	KnownAssignedXidsRemovePreceding(InvalidTransactionId);
  
  	/*
! 	 * Add to the temp array any xids which have not already completed, taking
! 	 * care not to overflow in extreme cases.
  	 */
  	for (i = 0; i < running->xcnt; i++)
  	{
  		TransactionId xid = running->xids[i];
  
  		/*
  		 * The running-xacts snapshot can contain xids that were running at
! 		 * the time of the snapshot, yet complete before the snapshot was
! 		 * written to WAL. They're running now, so ignore them.
  		 */
  		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
  			continue;
  
  		xids[nxids++] = xid;
- 
- 		/*
- 		 * Test for overflow only after we have filtered out already complete
- 		 * transactions.
- 		 */
- 		if (nxids > TOTAL_MAX_CACHED_SUBXIDS)
- 			elog(ERROR, "too many xids to add into KnownAssignedXids");
  	}
  
  	if (nxids > 0)
--- 527,557 ----
  	 * xids to subtrans. If RunningXacts is overflowed then we don't have
  	 * enough information to correctly update subtrans anyway.
  	 */
+ 	Assert(procArray->numKnownAssignedXids == 0);
  
  	/*
! 	 * Allocate a temporary array to avoid modifying the array passed as
! 	 * argument.
  	 */
! 	xids = palloc(sizeof(TransactionId) * running->xcnt);
  
  	/*
! 	 * Add to the temp array any xids which have not already completed.
  	 */
+ 	nxids = 0;
  	for (i = 0; i < running->xcnt; i++)
  	{
  		TransactionId xid = running->xids[i];
  
  		/*
  		 * The running-xacts snapshot can contain xids that were running at
! 		 * the time of the snapshot, yet completed before the snapshot was
! 		 * written to WAL. They're not running anymore, so ignore them.
  		 */
  		if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid))
  			continue;
  
  		xids[nxids++] = xid;
  	}
  
  	if (nxids > 0)
***************
*** 603,621 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  		qsort(xids, nxids, sizeof(TransactionId), xidComparator);
  
  		/*
- 		 * Re-initialise latestObservedXid to the highest xid we've seen.
- 		 */
- 		latestObservedXid = xids[nxids - 1];
- 
- 		/*
  		 * Add the sorted snapshot into KnownAssignedXids
  		 */
  		for (i = 0; i < nxids; i++)
! 		{
! 			TransactionId xid = xids[i];
! 
! 			KnownAssignedXidsAdd(xid, xid, true);
! 		}
  
  		KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
  	}
--- 563,572 ----
  		qsort(xids, nxids, sizeof(TransactionId), xidComparator);
  
  		/*
  		 * Add the sorted snapshot into KnownAssignedXids
  		 */
  		for (i = 0; i < nxids; i++)
! 			KnownAssignedXidsAdd(xids[i], xids[i], true);
  
  		KnownAssignedXidsDisplay(trace_recovery(DEBUG3));
  	}
***************
*** 623,630 **** ProcArrayApplyRecoveryInfo(RunningTransactions running)
  	pfree(xids);
  
  	/*
! 	 * Now we've got the running xids we need to set the global values thare
! 	 * used to track snapshots as they evolve further
  	 *
  	 * * latestCompletedXid which will be the xmax for snapshots *
  	 * lastOverflowedXid which shows whether snapshots overflow * nextXid
--- 574,613 ----
  	pfree(xids);
  
  	/*
! 	 * Ok, KnownAssignedXids has now been initialized with the xids from
! 	 * the snapshot. But we still have to add any XIDs we saw before the
! 	 * running-xacts record, that belong to transactions that started
! 	 * between taking the snapshot was taken and writing it to WAL.
! 	 * RecordKnownAssignedXids() has been updating latestObservedXids to
! 	 * keep track of those.
! 	 *
! 	 * XXX: This can lead to overflowing the KnownAssignedXids array, if
! 	 * a large number of new subtransactions started in that window. That
! 	 * should be rare in practice; we try to keep the window small, so that
! 	 * there isn't time to create many subtransactions. We could avoid the
! 	 * problem by keeping track of xact-assignment records that we see
! 	 * before the running-xact record, and but it doesn't seem worth it.
! 	 */
! 	if (!TransactionIdIsValid(latestObservedXid))
! 	{
! 		latestObservedXid = running->nextXid;
! 		TransactionIdRetreat(latestObservedXid);
! 	}
! 	else
! 	{
! 		TransactionId xid = running->nextXid;
! 
! 		while (TransactionIdPrecedesOrEquals(xid, latestObservedXid))
! 		{
! 			if (!TransactionIdDidCommit(xid) && !TransactionIdDidAbort(xid))
! 				KnownAssignedXidsAdd(xid, xid, true);
! 			TransactionIdAdvance(xid);
! 		}
! 	}
! 
! 	/*
! 	 * Now we've got the running xids we need to set the global values that
! 	 * are used to track snapshots as they evolve further.
  	 *
  	 * * latestCompletedXid which will be the xmax for snapshots *
  	 * lastOverflowedXid which shows whether snapshots overflow * nextXid
***************
*** 2337,2346 **** DisplayXidCache(void)
   *		unobserved XIDs.
   *
   * RecordKnownAssignedTransactionIds() should be run for *every* WAL record
!  * type apart from XLOG_RUNNING_XACTS (since that initialises the first
!  * snapshot so that RecordKnownAssignedTransactionIds() can be called). Must
!  * be called for each record after we have executed StartupCLOG() et al,
!  * since we must ExtendCLOG() etc..
   *
   * Called during recovery in analogy with and in place of GetNewTransactionId()
   */
--- 2320,2327 ----
   *		unobserved XIDs.
   *
   * RecordKnownAssignedTransactionIds() should be run for *every* WAL record
!  * associated with a transaction. Must be called for each record after we
!  * have executed StartupCLOG() et al, since we must ExtendCLOG() etc..
   *
   * Called during recovery in analogy with and in place of GetNewTransactionId()
   */
***************
*** 2348,2360 **** void
  RecordKnownAssignedTransactionIds(TransactionId xid)
  {
  	Assert(standbyState >= STANDBY_INITIALIZED);
- 	Assert(TransactionIdIsValid(latestObservedXid));
  	Assert(TransactionIdIsValid(xid));
  
  	elog(trace_recovery(DEBUG4), "record known xact %u latestObservedXid %u",
  		 xid, latestObservedXid);
  
  	/*
  	 * When a newly observed xid arrives, it is frequently the case that it is
  	 * *not* the next xid in sequence. When this occurs, we must treat the
  	 * intervening xids as running also.
--- 2329,2354 ----
  RecordKnownAssignedTransactionIds(TransactionId xid)
  {
  	Assert(standbyState >= STANDBY_INITIALIZED);
  	Assert(TransactionIdIsValid(xid));
  
  	elog(trace_recovery(DEBUG4), "record known xact %u latestObservedXid %u",
  		 xid, latestObservedXid);
  
  	/*
+ 	 * If the KnownAssignedXids machinery isn't up yet, just update
+ 	 * latestObservedXid.
+ 	 */
+ 	if (standbyState == STANDBY_INITIALIZED)
+ 	{
+ 		if (!TransactionIdIsValid(latestObservedXid) ||
+ 			TransactionIdFollows(xid, latestObservedXid))
+ 			latestObservedXid = xid;
+ 		return;
+ 	}
+ 
+ 	Assert(TransactionIdIsValid(latestObservedXid));
+ 
+ 	/*
  	 * When a newly observed xid arrives, it is frequently the case that it is
  	 * *not* the next xid in sequence. When this occurs, we must treat the
  	 * intervening xids as running also.
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
***************
*** 28,34 **** extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid);
  extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid);
  extern void ProcArrayClearTransaction(PGPROC *proc);
  
- extern void ProcArrayInitRecoveryInfo(TransactionId oldestActiveXid);
  extern void ProcArrayApplyRecoveryInfo(RunningTransactions running);
  extern void ProcArrayApplyXidAssignment(TransactionId topxid,
  							int nsubxids, TransactionId *subxids);
--- 28,33 ----
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to