slony1-1.1.5-rc2
postgres
8.1.2
With multiple
replication sets defined (typically 2-4), in testing failover, I am consistently
getting deadlocks on the sl_setsync table between the slonik failover connection
and the local connection used by remote_worker.c in the sync_event
method.
I have node A as the
origin of, say, 3 sets, and node B has a subscription to each set. The deadlock
errors are showing up in the db log on node B after I attempt run a slonik
failover on node B.
Is anybody else
using multiple replication sets? We have to dynamically add schemas to our
database, which need to be replicated. Rather than do a MERGE SET for each
schema added, we created a separate replication set for each one. This now
appears to have been a big mistake, and we are re-working the code to do a MERGE
SET when we add new tables, but it seems that multiple replication sets should
work.
Here is yet another
example from the log:
2006-03-09 10:01:36
EST-[19314] LOG: duration: 0.148 ms statement: begin transaction;
set transaction isolation level serializable;
2006-03-09 10:01:36 EST-[19314] LOG: duration: 0.534 ms statement: select SSY.ssy_setid, SSY.ssy_seqno, SSY.ssy_minxid, SSY.ssy_maxxid, SSY.ssy_xip, SSY.ssy_action_list from "_tzreplic".sl_setsync SSY where SSY.ssy_seqno < '180' and SSY.ssy_setid in (271,388,400,412);
2006-03-09 10:01:36 EST-[19314] LOG: duration: 1.144 ms statement: select T.tab_id, T.tab_set, "_tzreplic".slon_quote_input('"' || PGN.nspname || '"."' || PGC.relname || '"') as tab_fqname from "_tzreplic".sl_table T, "pg_catalog".pg_class PGC, "pg_catalog".pg_namespace PGN where T.tab_set = 388 and PGC.oid = T.tab_reloid and PGC.relnamespace = PGN.oid;
2006-03-09 10:01:36 EST-[19314] LOG: duration: 0.908 ms statement: select T.tab_id, T.tab_set, "_tzreplic".slon_quote_input('"' || PGN.nspname || '"."' || PGC.relname || '"') as tab_fqname from "_tzreplic".sl_table T, "pg_catalog".pg_class PGC, "pg_catalog".pg_namespace PGN where T.tab_set = 400 and PGC.oid = T.tab_reloid and PGC.relnamespace = PGN.oid;
2006-03-09 10:01:36 EST-[19314] LOG: duration: 0.886 ms statement: select T.tab_id, T.tab_set, "_tzreplic".slon_quote_input('"' || PGN.nspname || '"."' || PGC.relname || '"') as tab_fqname from "_tzreplic".sl_table T, "pg_catalog".pg_class PGC, "pg_catalog".pg_namespace PGN where T.tab_set = 412 and PGC.oid = T.tab_reloid and PGC.relnamespace = PGN.oid;
2006-03-09 10:01:36 EST-[19314] LOG: duration: 4.225 ms statement: select T.tab_id, T.tab_set, "_tzreplic".slon_quote_input('"' || PGN.nspname || '"."' || PGC.relname || '"') as tab_fqname from "_tzreplic".sl_table T, "pg_catalog".pg_class PGC, "pg_catalog".pg_namespace PGN where T.tab_set = 271 and PGC.oid = T.tab_reloid and PGC.relnamespace = PGN.oid;
2006-03-09 10:01:36 EST-[19316] LOG: duration: 0.246 ms statement: start transaction;set transaction isolation level serializable;select last_value from "_tzreplic".sl_action_seq;
2006-03-09 10:01:36 EST-[19316] LOG: duration: 0.047 ms statement: rollback transaction;
2006-03-09 10:01:37 EST-[19316] LOG: duration: 0.267 ms statement: start transaction;set transaction isolation level serializable;select last_value from "_tzreplic".sl_action_seq;
2006-03-09 10:01:37 EST-[19316] LOG: duration: 0.050 ms statement: rollback transaction;
2006-03-09 10:01:37 EST-[19445] NOTICE: failedNode: set 388 has no other direct receivers - move now
2006-03-09 10:01:37 EST-[19445] STATEMENT: select "_tzreplic".failedNode(1, 2);
2006-03-09 10:01:38 EST-[19316] LOG: duration: 0.320 ms statement: start transaction;set transaction isolation level serializable;select last_value from "_tzreplic".sl_action_seq;
2006-03-09 10:01:38 EST-[19316] LOG: duration: 0.049 ms statement: rollback transaction;
2006-03-09 10:01:38 EST-[19445] ERROR: deadlock detected
2006-03-09 10:01:38 EST-[19445] DETAIL: Process 19445 waits for ShareLock on transaction 5609; blocked by process 19314.
Process 19314 waits for ShareLock on transaction 5596; blocked by process 19445.
2006-03-09 10:01:38 EST-[19445] CONTEXT: SQL statement "delete from "_tzreplic".sl_setsync where ssy_setid = $1 "
PL/pgSQL function "failednode" line 115 at SQL statement
2006-03-09 10:01:38 EST-[19445] STATEMENT: select "_tzreplic".failedNode(1, 2);
2006-03-09 10:01:38 EST-[19314] LOG: duration: 2599.338 ms statement: update "_tzreplic".sl_setsync set ssy_seqno = '180', ssy_minxid = '6111', ssy_maxxid = '6112', ssy_xip = '', ssy_action_list = '' where ssy_setid in (271,388,400,412) and ssy_seqno < '180';
2006-03-09 10:01:38 EST-[19445] LOG: duration: 0.064 ms statement: rollback transaction;
2006-03-09 10:01:38 EST-[19314] LOG: duration: 0.238 ms statement: insert into "_tzreplic".sl_seqlog (seql_seqid, seql_origin, seql_ev_seqno, seql_last_value) values (0, 1, '180', '1000000000000000');
2006-03-09 10:01:38 EST-[19314] LOG: duration: 3.842 ms statement: notify "_tzreplic_Event"; notify "_tzreplic_Confirm"; insert into "_tzreplic".sl_event (ev_origin, ev_seqno, ev_timestamp, ev_minxid, ev_maxxid, ev_xip, ev_type ) values ('1', '180', '2006-03-09 10:01:36.232092', '6111', '6112', '', 'SYNC'); insert into "_tzreplic".sl_confirm (con_origin, con_received, con_seqno, con_timestamp) values (1, 2, '180', now()); commit transaction;
2006-03-09 10:01:36 EST-[19314] LOG: duration: 0.534 ms statement: select SSY.ssy_setid, SSY.ssy_seqno, SSY.ssy_minxid, SSY.ssy_maxxid, SSY.ssy_xip, SSY.ssy_action_list from "_tzreplic".sl_setsync SSY where SSY.ssy_seqno < '180' and SSY.ssy_setid in (271,388,400,412);
2006-03-09 10:01:36 EST-[19314] LOG: duration: 1.144 ms statement: select T.tab_id, T.tab_set, "_tzreplic".slon_quote_input('"' || PGN.nspname || '"."' || PGC.relname || '"') as tab_fqname from "_tzreplic".sl_table T, "pg_catalog".pg_class PGC, "pg_catalog".pg_namespace PGN where T.tab_set = 388 and PGC.oid = T.tab_reloid and PGC.relnamespace = PGN.oid;
2006-03-09 10:01:36 EST-[19314] LOG: duration: 0.908 ms statement: select T.tab_id, T.tab_set, "_tzreplic".slon_quote_input('"' || PGN.nspname || '"."' || PGC.relname || '"') as tab_fqname from "_tzreplic".sl_table T, "pg_catalog".pg_class PGC, "pg_catalog".pg_namespace PGN where T.tab_set = 400 and PGC.oid = T.tab_reloid and PGC.relnamespace = PGN.oid;
2006-03-09 10:01:36 EST-[19314] LOG: duration: 0.886 ms statement: select T.tab_id, T.tab_set, "_tzreplic".slon_quote_input('"' || PGN.nspname || '"."' || PGC.relname || '"') as tab_fqname from "_tzreplic".sl_table T, "pg_catalog".pg_class PGC, "pg_catalog".pg_namespace PGN where T.tab_set = 412 and PGC.oid = T.tab_reloid and PGC.relnamespace = PGN.oid;
2006-03-09 10:01:36 EST-[19314] LOG: duration: 4.225 ms statement: select T.tab_id, T.tab_set, "_tzreplic".slon_quote_input('"' || PGN.nspname || '"."' || PGC.relname || '"') as tab_fqname from "_tzreplic".sl_table T, "pg_catalog".pg_class PGC, "pg_catalog".pg_namespace PGN where T.tab_set = 271 and PGC.oid = T.tab_reloid and PGC.relnamespace = PGN.oid;
2006-03-09 10:01:36 EST-[19316] LOG: duration: 0.246 ms statement: start transaction;set transaction isolation level serializable;select last_value from "_tzreplic".sl_action_seq;
2006-03-09 10:01:36 EST-[19316] LOG: duration: 0.047 ms statement: rollback transaction;
2006-03-09 10:01:37 EST-[19316] LOG: duration: 0.267 ms statement: start transaction;set transaction isolation level serializable;select last_value from "_tzreplic".sl_action_seq;
2006-03-09 10:01:37 EST-[19316] LOG: duration: 0.050 ms statement: rollback transaction;
2006-03-09 10:01:37 EST-[19445] NOTICE: failedNode: set 388 has no other direct receivers - move now
2006-03-09 10:01:37 EST-[19445] STATEMENT: select "_tzreplic".failedNode(1, 2);
2006-03-09 10:01:38 EST-[19316] LOG: duration: 0.320 ms statement: start transaction;set transaction isolation level serializable;select last_value from "_tzreplic".sl_action_seq;
2006-03-09 10:01:38 EST-[19316] LOG: duration: 0.049 ms statement: rollback transaction;
2006-03-09 10:01:38 EST-[19445] ERROR: deadlock detected
2006-03-09 10:01:38 EST-[19445] DETAIL: Process 19445 waits for ShareLock on transaction 5609; blocked by process 19314.
Process 19314 waits for ShareLock on transaction 5596; blocked by process 19445.
2006-03-09 10:01:38 EST-[19445] CONTEXT: SQL statement "delete from "_tzreplic".sl_setsync where ssy_setid = $1 "
PL/pgSQL function "failednode" line 115 at SQL statement
2006-03-09 10:01:38 EST-[19445] STATEMENT: select "_tzreplic".failedNode(1, 2);
2006-03-09 10:01:38 EST-[19314] LOG: duration: 2599.338 ms statement: update "_tzreplic".sl_setsync set ssy_seqno = '180', ssy_minxid = '6111', ssy_maxxid = '6112', ssy_xip = '', ssy_action_list = '' where ssy_setid in (271,388,400,412) and ssy_seqno < '180';
2006-03-09 10:01:38 EST-[19445] LOG: duration: 0.064 ms statement: rollback transaction;
2006-03-09 10:01:38 EST-[19314] LOG: duration: 0.238 ms statement: insert into "_tzreplic".sl_seqlog (seql_seqid, seql_origin, seql_ev_seqno, seql_last_value) values (0, 1, '180', '1000000000000000');
2006-03-09 10:01:38 EST-[19314] LOG: duration: 3.842 ms statement: notify "_tzreplic_Event"; notify "_tzreplic_Confirm"; insert into "_tzreplic".sl_event (ev_origin, ev_seqno, ev_timestamp, ev_minxid, ev_maxxid, ev_xip, ev_type ) values ('1', '180', '2006-03-09 10:01:36.232092', '6111', '6112', '', 'SYNC'); insert into "_tzreplic".sl_confirm (con_origin, con_received, con_seqno, con_timestamp) values (1, 2, '180', now()); commit transaction;
My
interpretation of this log is that the sync_event in the remote_worker thread
started running, then a slonik failover command came in, deadlocked and got
kicked out, and the sync_event transaction was allowed to finish. Does that make
sense?
Thanks.
- DAP
======================================================
David Parker Tazz
Networks
_______________________________________________ Slony1-general mailing list [email protected] http://gborg.postgresql.org/mailman/listinfo/slony1-general
