slony1-1.1.5-rc2
postgres 8.1.2
 
With multiple replication sets defined (typically 2-4), in testing failover, I am consistently getting deadlocks on the sl_setsync table between the slonik failover connection and the local connection used by remote_worker.c in the sync_event method.
 
I have node A as the origin of, say, 3 sets, and node B has a subscription to each set. The deadlock errors are showing up in the db log on node B after I attempt run a slonik failover on node B.
 
Is anybody else using multiple replication sets? We have to dynamically add schemas to our database, which need to be replicated. Rather than do a MERGE SET for each schema added, we created a separate replication set for each one. This now appears to have been a big mistake, and we are re-working the code to do a MERGE SET when we add new tables, but it seems that multiple replication sets should work.
 
Here is yet another example from the log:
 
2006-03-09 10:01:36 EST-[19314] LOG:  duration: 0.148 ms  statement: begin transaction; set transaction isolation level serializable;
2006-03-09 10:01:36 EST-[19314] LOG:  duration: 0.534 ms  statement: select SSY.ssy_setid, SSY.ssy_seqno,     SSY.ssy_minxid, SSY.ssy_maxxid, SSY.ssy_xip,     SSY.ssy_action_list from "_tzreplic".sl_setsync SSY where SSY.ssy_seqno < '180'     and SSY.ssy_setid in (271,388,400,412);
2006-03-09 10:01:36 EST-[19314] LOG:  duration: 1.144 ms  statement: select T.tab_id, T.tab_set,     "_tzreplic".slon_quote_input('"' || PGN.nspname || '"."' ||     PGC.relname || '"') as tab_fqname from "_tzreplic".sl_table T,     "pg_catalog".pg_class PGC,     "pg_catalog".pg_namespace PGN where T.tab_set = 388     and PGC.oid = T.tab_reloid     and PGC.relnamespace = PGN.oid;
2006-03-09 10:01:36 EST-[19314] LOG:  duration: 0.908 ms  statement: select T.tab_id, T.tab_set,     "_tzreplic".slon_quote_input('"' || PGN.nspname || '"."' ||     PGC.relname || '"') as tab_fqname from "_tzreplic".sl_table T,     "pg_catalog".pg_class PGC,     "pg_catalog".pg_namespace PGN where T.tab_set = 400     and PGC.oid = T.tab_reloid     and PGC.relnamespace = PGN.oid;
2006-03-09 10:01:36 EST-[19314] LOG:  duration: 0.886 ms  statement: select T.tab_id, T.tab_set,     "_tzreplic".slon_quote_input('"' || PGN.nspname || '"."' ||     PGC.relname || '"') as tab_fqname from "_tzreplic".sl_table T,     "pg_catalog".pg_class PGC,     "pg_catalog".pg_namespace PGN where T.tab_set = 412     and PGC.oid = T.tab_reloid     and PGC.relnamespace = PGN.oid;
2006-03-09 10:01:36 EST-[19314] LOG:  duration: 4.225 ms  statement: select T.tab_id, T.tab_set,     "_tzreplic".slon_quote_input('"' || PGN.nspname || '"."' ||     PGC.relname || '"') as tab_fqname from "_tzreplic".sl_table T,     "pg_catalog".pg_class PGC,     "pg_catalog".pg_namespace PGN where T.tab_set = 271     and PGC.oid = T.tab_reloid     and PGC.relnamespace = PGN.oid;
2006-03-09 10:01:36 EST-[19316] LOG:  duration: 0.246 ms  statement: start transaction;set transaction isolation level serializable;select last_value from "_tzreplic".sl_action_seq;
2006-03-09 10:01:36 EST-[19316] LOG:  duration: 0.047 ms  statement: rollback transaction;
2006-03-09 10:01:37 EST-[19316] LOG:  duration: 0.267 ms  statement: start transaction;set transaction isolation level serializable;select last_value from "_tzreplic".sl_action_seq;
2006-03-09 10:01:37 EST-[19316] LOG:  duration: 0.050 ms  statement: rollback transaction;
2006-03-09 10:01:37 EST-[19445] NOTICE:  failedNode: set 388 has no other direct receivers - move now
2006-03-09 10:01:37 EST-[19445] STATEMENT:  select "_tzreplic".failedNode(1, 2);
2006-03-09 10:01:38 EST-[19316] LOG:  duration: 0.320 ms  statement: start transaction;set transaction isolation level serializable;select last_value from "_tzreplic".sl_action_seq;
2006-03-09 10:01:38 EST-[19316] LOG:  duration: 0.049 ms  statement: rollback transaction;
2006-03-09 10:01:38 EST-[19445] ERROR:  deadlock detected
2006-03-09 10:01:38 EST-[19445] DETAIL:  Process 19445 waits for ShareLock on transaction 5609; blocked by process 19314.
 Process 19314 waits for ShareLock on transaction 5596; blocked by process 19445.
2006-03-09 10:01:38 EST-[19445] CONTEXT:  SQL statement "delete from "_tzreplic".sl_setsync where ssy_setid =  $1 "
 PL/pgSQL function "failednode" line 115 at SQL statement
2006-03-09 10:01:38 EST-[19445] STATEMENT:  select "_tzreplic".failedNode(1, 2);
2006-03-09 10:01:38 EST-[19314] LOG:  duration: 2599.338 ms  statement: update "_tzreplic".sl_setsync set     ssy_seqno = '180', ssy_minxid = '6111', ssy_maxxid = '6112',     ssy_xip = '', ssy_action_list = '' where ssy_setid in (271,388,400,412) and ssy_seqno < '180';
2006-03-09 10:01:38 EST-[19445] LOG:  duration: 0.064 ms  statement: rollback transaction;
2006-03-09 10:01:38 EST-[19314] LOG:  duration: 0.238 ms  statement: insert into "_tzreplic".sl_seqlog  (seql_seqid, seql_origin, seql_ev_seqno, seql_last_value)  values (0, 1, '180', '1000000000000000');
2006-03-09 10:01:38 EST-[19314] LOG:  duration: 3.842 ms  statement: notify "_tzreplic_Event"; notify "_tzreplic_Confirm"; insert into "_tzreplic".sl_event     (ev_origin, ev_seqno, ev_timestamp,      ev_minxid, ev_maxxid, ev_xip, ev_type     ) values ('1', '180', '2006-03-09 10:01:36.232092', '6111', '6112', '', 'SYNC'); insert into "_tzreplic".sl_confirm  (con_origin, con_received, con_seqno, con_timestamp)    values (1, 2, '180', now()); commit transaction;
 
My interpretation of this log is that the sync_event in the remote_worker thread started running, then a slonik failover command came in, deadlocked and got kicked out, and the sync_event transaction was allowed to finish. Does that make sense?
 
Thanks.
 
- DAP
======================================================
David Parker    Tazz Networks
 
_______________________________________________
Slony1-general mailing list
[email protected]
http://gborg.postgresql.org/mailman/listinfo/slony1-general

Reply via email to