I spent a lot of time looking at this issue. Your problem scenario can be reduced to the following test case:
CREATE TABLE t (a INT); INSERT INTO t VALUES (1); The three sessions need to execute the following commands, in this sequence: -- Session 1 /* 1 */ BEGIN; SELECT * FROM t FOR UPDATE; -- Session 2 /* 2 */ BEGIN; UPDATE t SET a = 2; -- blocks waiting on S1 -- Session 3 /* 3 */ BEGIN; UPDATE t SET a = 3; -- blocks -- this is waiting on S1, but since S2 is also waiting on S1 and will acquire a -- conflicting lock first, this session will only be released after S2 releases -- its own lock. Session 1 /* 4 */ UPDATE t SET a = 4; COMMIT; -- releases session 2 BEGIN; SELECT * FROM t FOR UPDATE; -- blocks waiting on S2 (*) Session 2: /* 5 */ COMMIT; -- releases session 1 BEGIN; UPDATE t SET a = 5; -- blocks waiting on S3 Session 1: /* 6 */ UPDATE t SET a = 6; -- blocks waiting on S3 At this point, all sessions are blocked on their UPDATE commands, and deadlock is reported. (For debugging, it's useful to set a very large deadlock_timeout and examine pg_locks, etc.) The point at which things started to go wrong was where the session 1's SELECT FOR UPDATE with the (*) was allowed to continue after session 2 commits. In 9.2 that session remains blocked, and instead we release session 3. The problem is the handling of following of the update chain during a lock command; the question is should sessions be blocked waiting for locks on future versions of the row? They currently don't block, which is what leads the SELECT FOR UPDATE to continue; but if we make them block, other things break. I initially toyed with the following patch, which removes the deadlock the above report and also your original test case: --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -4832,15 +4832,12 @@ l4: xmax = HeapTupleHeaderGetRawXmax(mytup.t_data); /* - * If this tuple is updated and the key has been modified (or - * deleted), what we do depends on the status of the updating - * transaction: if it's live, we sleep until it finishes; if it has - * committed, we have to fail (i.e. return HeapTupleUpdated); if it - * aborted, we ignore it. For updates that didn't touch the key, we - * can just plough ahead. + * If this tuple is updated, what we do depends on the status of the + * updating transaction: if it's live, we sleep until it finishes; if + * it has committed, we have to fail (i.e. return HeapTupleUpdated); if + * it aborted, we ignore it. */ - if (!(old_infomask & HEAP_XMAX_INVALID) && - (mytup.t_data->t_infomask2 & HEAP_KEYS_UPDATED)) + if (!(old_infomask & HEAP_XMAX_INVALID)) { TransactionId update_xid; However, the differences in behavior this causes (visible by running the isolation tests) don't look good to me. It's quite possible that there are other bugs elsewhere. Sadly, I have to look at other things now and I won't have time to research this further until after next week's minor releases; so whatever we decide to do, if anything, will not be in 9.3.1. Thanks for the report. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs