Using
slony1-1.1.5-RC2, in testing failover, we are getting deadlock errors like the
following from the db log:
[ copy table
statements ]
2006-03-07 15:06:28
EST-[29406] NOTICE: truncate of "monkeys"."session"
succeeded
2006-03-07 15:06:28 EST-[29406] STATEMENT: select "_tzreplic".prepareTableForCopy(151); copy "monkeys"."session" ("id","accessadress","accessid","entryaddress","entryid","properties","starttime","updatetime") from stdin;
2006-03-07 15:06:28 EST-[29406] NOTICE: truncate of "monkeys"."vc" succeeded
2006-03-07 15:06:28 EST-[29406] STATEMENT: select "_tzreplic".prepareTableForCopy(152); copy "monkeys"."vc" ("id","used","capacity") from stdin;
2006-03-07 15:06:28 EST-[29406] NOTICE: truncate of "monkeys"."vp" succeeded
2006-03-07 15:06:28 EST-[29406] STATEMENT: select "_tzreplic".prepareTableForCopy(153); copy "monkeys"."vp" ("id","used","capacity") from stdin;
2006-03-07 15:06:31 EST-[29438] NOTICE: failedNode: set 1 has no other direct receivers - move now
2006-03-07 15:06:31 EST-[29438] STATEMENT: select "_tzreplic".failedNode(2, 3);
2006-03-07 15:06:37 EST-[29438] NOTICE: failedNode: set 118 has no other direct receivers - move now
2006-03-07 15:06:37 EST-[29438] STATEMENT: select "_tzreplic".failedNode(2, 3);
2006-03-07 15:06:38 EST-[29438] ERROR: deadlock detected
2006-03-07 15:06:38 EST-[29438] DETAIL: Process 29438 waits for ShareLock on transaction 3856; blocked by process 29406.
Process 29406 waits for ShareLock on transaction 3851; blocked by process 29438.
2006-03-07 15:06:38 EST-[29438] CONTEXT: SQL statement "delete from "_tzreplic".sl_setsync where ssy_setid = $1 "
PL/pgSQL function "failednode" line 115 at SQL statement
2006-03-07 15:06:38 EST-[29438] STATEMENT: select "_tzreplic".failedNode(2, 3);
2006-03-07 15:06:28 EST-[29406] STATEMENT: select "_tzreplic".prepareTableForCopy(151); copy "monkeys"."session" ("id","accessadress","accessid","entryaddress","entryid","properties","starttime","updatetime") from stdin;
2006-03-07 15:06:28 EST-[29406] NOTICE: truncate of "monkeys"."vc" succeeded
2006-03-07 15:06:28 EST-[29406] STATEMENT: select "_tzreplic".prepareTableForCopy(152); copy "monkeys"."vc" ("id","used","capacity") from stdin;
2006-03-07 15:06:28 EST-[29406] NOTICE: truncate of "monkeys"."vp" succeeded
2006-03-07 15:06:28 EST-[29406] STATEMENT: select "_tzreplic".prepareTableForCopy(153); copy "monkeys"."vp" ("id","used","capacity") from stdin;
2006-03-07 15:06:31 EST-[29438] NOTICE: failedNode: set 1 has no other direct receivers - move now
2006-03-07 15:06:31 EST-[29438] STATEMENT: select "_tzreplic".failedNode(2, 3);
2006-03-07 15:06:37 EST-[29438] NOTICE: failedNode: set 118 has no other direct receivers - move now
2006-03-07 15:06:37 EST-[29438] STATEMENT: select "_tzreplic".failedNode(2, 3);
2006-03-07 15:06:38 EST-[29438] ERROR: deadlock detected
2006-03-07 15:06:38 EST-[29438] DETAIL: Process 29438 waits for ShareLock on transaction 3856; blocked by process 29406.
Process 29406 waits for ShareLock on transaction 3851; blocked by process 29438.
2006-03-07 15:06:38 EST-[29438] CONTEXT: SQL statement "delete from "_tzreplic".sl_setsync where ssy_setid = $1 "
PL/pgSQL function "failednode" line 115 at SQL statement
2006-03-07 15:06:38 EST-[29438] STATEMENT: select "_tzreplic".failedNode(2, 3);
So, we establish
replication for 2 sets, then we issue a slonik FAILOVER (pid 29438). The
deadlock seems to be between the slonik connection and the remote_worker
connection in the slon. Another example of the same type of
problem:
2006-03-07 14:26:39
EST-[2629] NOTICE: failedNode: set 74 has no other direct receivers - move
now
2006-03-07 14:26:39 EST-[2629] STATEMENT: select "_tzreplic".failedNode(1, 2);
2006-03-07 14:26:42 EST-[2629] NOTICE: failedNode: set 111 has no other direct receivers - move now
2006-03-07 14:26:42 EST-[2629] STATEMENT: select "_tzreplic".failedNode(1, 2);
2006-03-07 14:26:42 EST-[2629] NOTICE: failedNode: set 123 has no other direct receivers - move now
2006-03-07 14:26:42 EST-[2629] STATEMENT: select "_tzreplic".failedNode(1, 2);
2006-03-07 14:26:43 EST-[2629] NOTICE: failedNode: set 135 has no other direct receivers - move now
2006-03-07 14:26:43 EST-[2629] STATEMENT: select "_tzreplic".failedNode(1, 2);
2006-03-07 14:26:43 EST-[2635] NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=974
2006-03-07 14:26:43 EST-[2635] STATEMENT: select "_tzreplic".cleanupNodelock(); insert into "_tzreplic".sl_nodelock values ( 2, 0, "pg_catalog".pg_backend_pid());
2006-03-07 14:26:44 EST-[2629] NOTICE: failedNode2(): faking FAILOVER_SET event
2006-03-07 14:26:44 EST-[2629] STATEMENT: select "_tzreplic".failedNode2(1,2,74,'650','651');
2006-03-07 14:26:46 EST-[2629] NOTICE: failedNode2(): faking FAILOVER_SET event
2006-03-07 14:26:46 EST-[2629] STATEMENT: select "_tzreplic".failedNode2(1,2,111,'650','652');
2006-03-07 14:26:46 EST-[2629] NOTICE: failedNode2(): faking FAILOVER_SET event
2006-03-07 14:26:46 EST-[2629] STATEMENT: select "_tzreplic".failedNode2(1,2,123,'650','653');
2006-03-07 14:26:46 EST-[2629] NOTICE: failedNode2(): faking FAILOVER_SET event
2006-03-07 14:26:46 EST-[2629] STATEMENT: select "_tzreplic".failedNode2(1,2,135,'650','654');
2006-03-07 14:26:48 EST-[2636] ERROR: deadlock detected
2006-03-07 14:26:48 EST-[2636] DETAIL: Process 2636 waits for ShareLock on transaction 613918; blocked by process 2629.
Process 2629 waits for ExclusiveLock on relation 20809 of database 16384; blocked by process 2636.
2006-03-07 14:26:48 EST-[2636] STATEMENT: notify "_tzreplic_Event"; notify "_tzreplic_Confirm"; insert into "_tzreplic".sl_event (ev_origin, ev_seqno, ev_timestamp, ev_minxid, ev_maxxid, ev_xip, ev_type ) values ('1', '651', '2006-03-07 14:26:33.770196', '867522', '867523', '', 'SYNC'); insert into "_tzreplic".sl_confirm (con_origin, con_received, con_seqno, con_timestamp) values (1, 2, '651', now()); commit transaction;
2006-03-07 14:26:48 EST-[2636] LOG: unexpected EOF on client connection
2006-03-07 14:26:39 EST-[2629] STATEMENT: select "_tzreplic".failedNode(1, 2);
2006-03-07 14:26:42 EST-[2629] NOTICE: failedNode: set 111 has no other direct receivers - move now
2006-03-07 14:26:42 EST-[2629] STATEMENT: select "_tzreplic".failedNode(1, 2);
2006-03-07 14:26:42 EST-[2629] NOTICE: failedNode: set 123 has no other direct receivers - move now
2006-03-07 14:26:42 EST-[2629] STATEMENT: select "_tzreplic".failedNode(1, 2);
2006-03-07 14:26:43 EST-[2629] NOTICE: failedNode: set 135 has no other direct receivers - move now
2006-03-07 14:26:43 EST-[2629] STATEMENT: select "_tzreplic".failedNode(1, 2);
2006-03-07 14:26:43 EST-[2635] NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=974
2006-03-07 14:26:43 EST-[2635] STATEMENT: select "_tzreplic".cleanupNodelock(); insert into "_tzreplic".sl_nodelock values ( 2, 0, "pg_catalog".pg_backend_pid());
2006-03-07 14:26:44 EST-[2629] NOTICE: failedNode2(): faking FAILOVER_SET event
2006-03-07 14:26:44 EST-[2629] STATEMENT: select "_tzreplic".failedNode2(1,2,74,'650','651');
2006-03-07 14:26:46 EST-[2629] NOTICE: failedNode2(): faking FAILOVER_SET event
2006-03-07 14:26:46 EST-[2629] STATEMENT: select "_tzreplic".failedNode2(1,2,111,'650','652');
2006-03-07 14:26:46 EST-[2629] NOTICE: failedNode2(): faking FAILOVER_SET event
2006-03-07 14:26:46 EST-[2629] STATEMENT: select "_tzreplic".failedNode2(1,2,123,'650','653');
2006-03-07 14:26:46 EST-[2629] NOTICE: failedNode2(): faking FAILOVER_SET event
2006-03-07 14:26:46 EST-[2629] STATEMENT: select "_tzreplic".failedNode2(1,2,135,'650','654');
2006-03-07 14:26:48 EST-[2636] ERROR: deadlock detected
2006-03-07 14:26:48 EST-[2636] DETAIL: Process 2636 waits for ShareLock on transaction 613918; blocked by process 2629.
Process 2629 waits for ExclusiveLock on relation 20809 of database 16384; blocked by process 2636.
2006-03-07 14:26:48 EST-[2636] STATEMENT: notify "_tzreplic_Event"; notify "_tzreplic_Confirm"; insert into "_tzreplic".sl_event (ev_origin, ev_seqno, ev_timestamp, ev_minxid, ev_maxxid, ev_xip, ev_type ) values ('1', '651', '2006-03-07 14:26:33.770196', '867522', '867523', '', 'SYNC'); insert into "_tzreplic".sl_confirm (con_origin, con_received, con_seqno, con_timestamp) values (1, 2, '651', now()); commit transaction;
2006-03-07 14:26:48 EST-[2636] LOG: unexpected EOF on client connection
We didn't have the
these sorts of deadlock issues with slony1-1.0.5, but we were also only using
one replication set at that point, and postgres 7.4.5, so too many
variables have changed, obviously. The common theme appears to be the
interaction of the slonik failover command and the slon running on the node
being failed-over-to....
Has anybody else run
into this kind of behavior?
- DAP
======================================================
David Parker Tazz
Networks
_______________________________________________ Slony1-general mailing list [email protected] http://gborg.postgresql.org/mailman/listinfo/slony1-general
