Re: [HACKERS] Race condition in GetOldestActiveTransactionId()
On 08/22/2016 01:46 PM, Heikki Linnakangas wrote: While hacking on the CSN patch, I spotted a race condition between GetOldestActiveTransactionId() and GetNewTransactionId(). GetOldestActiveTransactionId() calculates the oldest XID that's still running, by doing: 1. Read nextXid, without a lock. This is the upper bound, if there are no active XIDs in the proc array. 2. Loop through the proc array, making note of the oldest XID. While GetNewTransactionId() does: 1. Read and Increment nextXid 2. Set MyPgXact->xid. It seems possible that if you call GetNewTransactionId() concurrently with GetOldestActiveTransactionId(), GetOldestActiveTransactionId() sees the new nextXid value that the concurrent GetNewTransactionId() set, but doesn't see the old XID in the proc array. It will return a value that doesn't cover the old XID, i.e. it won't consider the just-assigned XID as in-progress. Am I missing something? Commit c6d76d7c added a comment to GetOldestActiveTransactionId() explaining why it's not necessary to acquire XidGenLock there, but I think it missed the above race condition. GetOldestActiveTransactionId() is not performance-critical, it's only called when performing a checkpoint, so I think we should just bite the bullet and grab the lock. Per attached patch. I did some testing, and was able to indeed construct a case where oldestActiveXid was off-by-one in an online checkpoint record. However, looking at how it's used, I think it happened to not have any user-visible effect. The oldestActiveXid value of an online checkpoint is only used to determine the point where pg_subtrans is initialized, and the XID missed in the computation could not be a subtransaction. Nevertheless, it's clearly a bug and the fix is simple, so I committed a fix. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Race condition in GetOldestActiveTransactionId()
While hacking on the CSN patch, I spotted a race condition between GetOldestActiveTransactionId() and GetNewTransactionId(). GetOldestActiveTransactionId() calculates the oldest XID that's still running, by doing: 1. Read nextXid, without a lock. This is the upper bound, if there are no active XIDs in the proc array. 2. Loop through the proc array, making note of the oldest XID. While GetNewTransactionId() does: 1. Read and Increment nextXid 2. Set MyPgXact->xid. It seems possible that if you call GetNewTransactionId() concurrently with GetOldestActiveTransactionId(), GetOldestActiveTransactionId() sees the new nextXid value that the concurrent GetNewTransactionId() set, but doesn't see the old XID in the proc array. It will return a value that doesn't cover the old XID, i.e. it won't consider the just-assigned XID as in-progress. Am I missing something? Commit c6d76d7c added a comment to GetOldestActiveTransactionId() explaining why it's not necessary to acquire XidGenLock there, but I think it missed the above race condition. GetOldestActiveTransactionId() is not performance-critical, it's only called when performing a checkpoint, so I think we should just bite the bullet and grab the lock. Per attached patch. - Heikki diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index e5d487d..de45c16 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -2097,13 +2097,13 @@ GetOldestActiveTransactionId(void) LWLockAcquire(ProcArrayLock, LW_SHARED); /* - * It's okay to read nextXid without acquiring XidGenLock because (1) we - * assume TransactionIds can be read atomically and (2) we don't care if - * we get a slightly stale value. It can't be very stale anyway, because - * the LWLockAcquire above will have done any necessary memory - * interlocking. + * Acquire XidGenLock while we read nextXid, to make sure that all + * XIDs < nextXid are already present in the proc array (or have + * already completed), when we spin over it. */ + LWLockAcquire(XidGenLock, LW_SHARED); oldestRunningXid = ShmemVariableCache->nextXid; + LWLockRelease(XidGenLock); /* * Spin over procArray collecting all xids and subxids. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers