Hi
I send a small patch for REL_1_2_STABLE branch.
When this patch was applied, the problem of "FAILOVER/MOVE_SET" was solved.
This patch only move the
"begin transaction; set transaction isolation level serializable; lock table
"_testdbcluster".sl_config_lock;"
after
the processing of 'ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet -
sleep' in remote_worker.c.
This "ACCEPT_SET" loops used only SELECT QUERY. I don't know why it was used in
islocation-level-serializable and why "lock table" is necessary.
This patch doesn't care for "archive log" and take care,please.
-------------------
Index: remote_worker.c
===================================================================
RCS file: /slony1/slony1-engine/src/slon/remote_worker.c,v
retrieving revision 1.124.2.31
diff -u -r1.124.2.31 remote_worker.c
--- remote_worker.c 6 Feb 2008 20:23:52 -0000 1.124.2.31
+++ remote_worker.c 4 Mar 2008 02:42:30 -0000
@@ -677,9 +677,12 @@
slon_appendquery(&query1,
"lock table %s.sl_config_lock; ",
rtcfg_namespace);
- if (query_execute(node, local_dbconn, &query1) < 0)
- slon_retry();
- dstring_reset(&query1);
+ if (strcmp(event->ev_type, "ACCEPT_SET") != 0)
+ {
+ if (query_execute(node, local_dbconn, &query1) < 0)
+ slon_retry();
+ dstring_reset(&query1);
+ }
/*
* For all non-SYNC events, we write at least a standard
@@ -1017,6 +1020,10 @@
PQclear(res);
slon_log(SLON_DEBUG2, "ACCEPT_SET - MOVE_SET or FAILOVER_SET exists -
adjusting setsync status\n");
+ if (query_execute(node, local_dbconn, &query1) < 0)
+ slon_retry();
+ dstring_reset(&query1);
+
/*
* Finalize the setsync status to mave the ACCEPT_SET's
* seqno and snapshot info.
@@ -1056,6 +1063,10 @@
else
{
slon_log(SLON_DEBUG2, "ACCEPT_SET - on origin node...\n");
+
+ if (query_execute(node, local_dbconn, &query1) < 0)
+ slon_retry();
+ dstring_reset(&query1);
}
}
--------------------
> Hello.
>
> > >> [...]
> > >> Hey, I should test failover before updating to 1.2.13...
> > >
> > > I have some strange periodic problems with 'ACCEPT_SET - MOVE_SET or
> > > FAILOVER_SET not received yet - sleep' on 1.2.12 and 1.2.13. Looks
> > > similar to this one.
> > >
> > > I should try to downgrade to 1.2.11 and try if my 'move set' problems
> > > will disappear. Here is the initial problem description:
> > > http://lists.slony.info/pipermail/slony1-general/2008-February/007445.html
> >
> > There's something about this that isn't making sense...
> >
> > I just did a CVS diff between 1.2.11 and REL_1_2_STABLE, and didn't
> > see anything that ought to have anything to do with this.
> >
> > I haven't yet done any testing of this case, out of the samples
> I think this problem is not the difference of the version but
> "remoteWorkerThread"
>
> When the problem of 'ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet -
> sleep' occurs,
> the pg_lock table is as following.
>
> ----
> testdb=# SELECT relname,granted,pid,mode from pg_locks as l , pg_class as c
> where c.oid = l.relation and locktype='relation';
> relname | granted | pid | mode
> ----------------------------+---------+-------+---------------------
> pg_class_oid_index | t | 15778 | AccessShareLock
> pg_class_relname_nsp_index | t | 15778 | AccessShareLock
> pg_locks | t | 15778 | AccessShareLock
> pg_class | t | 15778 | AccessShareLock
> sl_event | t | 15771 | AccessShareLock
> sl_event-pkey | t | 15771 | AccessShareLock
> sl_config_lock | f | 15770 | AccessExclusiveLock <--
> attention!
> sl_config_lock | t | 15771 | AccessExclusiveLock
> ----
>
> Next,I examined why two lock table sl_config_lock was executed.
>
> In the case of failover or move set, two events are generated.
> The one is "FAILOVER/MOVE_SET",the other is "ACCEPT_SET".
> Furthermore, "FAILOVER/MOVE_SET" event is executed by remoteWorkerThread_1
> which INSERT INTO sl_event table.
> and "ACCEPT_SET" event is executed by remoteWorkerThread_2 which SELECT
> ev_type FROM sl_event.
>
> Both events lock sl_config_lock table as following.
> ---
> "begin transaction; set transaction isolation level serializable; lock table
> "_testdbcluster".sl_config_lock;
> ---
>
> if it is executed in order of remoteWorkerThread_1(INSERT) and
> remoteWorkerThread_2(SELECT), the problem doesn't occur as following.
>
> ----this is postgresql SQL-log SUCCESS CASE: attention pid=15407 ---
> 2008-03-03 18:56:15 JST[15407]LOG: statement: begin transaction; set
> transaction isolation level serializable; /* FAILOVER_SET */ lock table
> "_testdbcluster".sl_config_lock;
> 2008-03-03 18:56:15 JST[15408]LOG: statement: begin transaction; set
> transaction isolation level serializable; /* ACCEPT_SET */ lock table
> "_testdbcluster".sl_config_lock;
> 2008-03-03 18:56:15 JST[15407]LOG: statement: select
> "_testdbcluster".failoverSet_int(1, 2, 1, 16); notify "_testdbcluster_Event";
> insert into "_testdbcluster".sl_event (ev_origin, ev_seqno, ev_timestamp,
> ev_minxid, ev_maxxid, ev_xip, ev_type , ev_data1, ev_data2, ev_data3
> ) values ('1', '16', '2008-03-03 18:56:14.173481', '798269', '798271',
> '''798270''', 'FAILOVER_SET', '1', '2', '1'); insert into
> "_testdbcluster".sl_confirm (con_origin, con_received, con_seqno,
> con_timestamp) values (1, 3, '16', now()); commit transaction;
> -------------------------------
>
> But, if it is executed in order of remoteWorkerThread_2(SELECT) and
> remoteWorkerThread_2(INSERT),
~
sorry typo 2->1
> we have 'ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep'
> loops.
>
> -- this is postgresql SQL-log FAILED CASE: attention pid = 15771 ---
> 2008-03-03 19:13:51 JST[15771]LOG: statement: begin transaction; set
> transaction isolation level serializable; /* ACCEPT_SET */ lock table
> "_testdbcluster".sl_config_lock;
> 2008-03-03 19:13:51 JST[15770]LOG: statement: begin transaction; set
> transaction isolation level serializable; /* FAILOVER_SET */ lock table
> "_testdbcluster".sl_config_lock;
> 2008-03-03 19:13:51 JST[15771]LOG: statement: select 1 from
> "_testdbcluster".sl_event where (ev_origin = 1 and ev_seqno = 22
> and ev_type = 'MOVE_SET' and ev_data1 = '1' and ev_data2 =
> '1' and ev_data3 = '2') or (ev_origin = 1 and ev_seqno = 22
> and ev_type = 'FAILOVER_SET' and ev_data1 = '1' and
> ev_data2 = '2' and ev_data3 = '1');
> ----------------------------------------------
>
> Because of "lock table sl_config_lock", remoteWorkerThread_1 cannot insert
> "FAILOVER/MOVE_SET" event into sl_event!!
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general