Using slony1-1.1.5-RC2, in testing failover, we are getting deadlock errors like the following from the db log:
 
[ copy table statements ]
2006-03-07 15:06:28 EST-[29406] NOTICE:  truncate of "monkeys"."session" succeeded
2006-03-07 15:06:28 EST-[29406] STATEMENT:  select "_tzreplic".prepareTableForCopy(151); copy "monkeys"."session" ("id","accessadress","accessid","entryaddress","entryid","properties","starttime","updatetime") from stdin;
2006-03-07 15:06:28 EST-[29406] NOTICE:  truncate of "monkeys"."vc" succeeded
2006-03-07 15:06:28 EST-[29406] STATEMENT:  select "_tzreplic".prepareTableForCopy(152); copy "monkeys"."vc" ("id","used","capacity") from stdin;
2006-03-07 15:06:28 EST-[29406] NOTICE:  truncate of "monkeys"."vp" succeeded
2006-03-07 15:06:28 EST-[29406] STATEMENT:  select "_tzreplic".prepareTableForCopy(153); copy "monkeys"."vp" ("id","used","capacity") from stdin;
2006-03-07 15:06:31 EST-[29438] NOTICE:  failedNode: set 1 has no other direct receivers - move now
2006-03-07 15:06:31 EST-[29438] STATEMENT:  select "_tzreplic".failedNode(2, 3);
2006-03-07 15:06:37 EST-[29438] NOTICE:  failedNode: set 118 has no other direct receivers - move now
2006-03-07 15:06:37 EST-[29438] STATEMENT:  select "_tzreplic".failedNode(2, 3);
2006-03-07 15:06:38 EST-[29438] ERROR:  deadlock detected
2006-03-07 15:06:38 EST-[29438] DETAIL:  Process 29438 waits for ShareLock on transaction 3856; blocked by process 29406.
 Process 29406 waits for ShareLock on transaction 3851; blocked by process 29438.
2006-03-07 15:06:38 EST-[29438] CONTEXT:  SQL statement "delete from "_tzreplic".sl_setsync where ssy_setid =  $1 "
 PL/pgSQL function "failednode" line 115 at SQL statement
2006-03-07 15:06:38 EST-[29438] STATEMENT:  select "_tzreplic".failedNode(2, 3);
 
So, we establish replication for 2 sets, then we issue a slonik FAILOVER (pid 29438). The deadlock seems to be between the slonik connection and the remote_worker connection in the slon. Another example of the same type of problem:
 
2006-03-07 14:26:39 EST-[2629] NOTICE:  failedNode: set 74 has no other direct receivers - move now
2006-03-07 14:26:39 EST-[2629] STATEMENT:  select "_tzreplic".failedNode(1, 2);
2006-03-07 14:26:42 EST-[2629] NOTICE:  failedNode: set 111 has no other direct receivers - move now
2006-03-07 14:26:42 EST-[2629] STATEMENT:  select "_tzreplic".failedNode(1, 2);
2006-03-07 14:26:42 EST-[2629] NOTICE:  failedNode: set 123 has no other direct receivers - move now
2006-03-07 14:26:42 EST-[2629] STATEMENT:  select "_tzreplic".failedNode(1, 2);
2006-03-07 14:26:43 EST-[2629] NOTICE:  failedNode: set 135 has no other direct receivers - move now
2006-03-07 14:26:43 EST-[2629] STATEMENT:  select "_tzreplic".failedNode(1, 2);
2006-03-07 14:26:43 EST-[2635] NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=974
2006-03-07 14:26:43 EST-[2635] STATEMENT:  select "_tzreplic".cleanupNodelock(); insert into "_tzreplic".sl_nodelock values (    2, 0, "pg_catalog".pg_backend_pid());
2006-03-07 14:26:44 EST-[2629] NOTICE:  failedNode2(): faking FAILOVER_SET event
2006-03-07 14:26:44 EST-[2629] STATEMENT:  select "_tzreplic".failedNode2(1,2,74,'650','651');
2006-03-07 14:26:46 EST-[2629] NOTICE:  failedNode2(): faking FAILOVER_SET event
2006-03-07 14:26:46 EST-[2629] STATEMENT:  select "_tzreplic".failedNode2(1,2,111,'650','652');
2006-03-07 14:26:46 EST-[2629] NOTICE:  failedNode2(): faking FAILOVER_SET event
2006-03-07 14:26:46 EST-[2629] STATEMENT:  select "_tzreplic".failedNode2(1,2,123,'650','653');
2006-03-07 14:26:46 EST-[2629] NOTICE:  failedNode2(): faking FAILOVER_SET event
2006-03-07 14:26:46 EST-[2629] STATEMENT:  select "_tzreplic".failedNode2(1,2,135,'650','654');
2006-03-07 14:26:48 EST-[2636] ERROR:  deadlock detected
2006-03-07 14:26:48 EST-[2636] DETAIL:  Process 2636 waits for ShareLock on transaction 613918; blocked by process 2629.
 Process 2629 waits for ExclusiveLock on relation 20809 of database 16384; blocked by process 2636.
2006-03-07 14:26:48 EST-[2636] STATEMENT:  notify "_tzreplic_Event"; notify "_tzreplic_Confirm"; insert into "_tzreplic".sl_event     (ev_origin, ev_seqno, ev_timestamp,      ev_minxid, ev_maxxid, ev_xip, ev_type     ) values ('1', '651', '2006-03-07 14:26:33.770196', '867522', '867523', '', 'SYNC'); insert into "_tzreplic".sl_confirm  (con_origin, con_received, con_seqno, con_timestamp)    values (1, 2, '651', now()); commit transaction;
2006-03-07 14:26:48 EST-[2636] LOG:  unexpected EOF on client connection
We didn't have the these sorts of deadlock issues with slony1-1.0.5, but we were also only using one replication set at that point, and postgres 7.4.5, so too many variables have changed, obviously. The common theme appears to be the interaction of the slonik failover command and the slon running on the node being failed-over-to....
 
Has anybody else run into this kind of behavior?
 
- DAP
======================================================
David Parker    Tazz Networks
 
_______________________________________________
Slony1-general mailing list
[email protected]
http://gborg.postgresql.org/mailman/listinfo/slony1-general

Reply via email to