Hi

I send a small patch for REL_1_2_STABLE branch.

When this patch was applied, the problem of "FAILOVER/MOVE_SET" was solved.

This patch only move the

  "begin transaction; set transaction isolation level serializable; lock table 
"_testdbcluster".sl_config_lock;"

after

  the processing of 'ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - 
sleep' in remote_worker.c.

This "ACCEPT_SET" loops used only SELECT QUERY. I don't know why it was used in
islocation-level-serializable and why "lock table" is necessary.

This patch doesn't care for "archive log" and take care,please.

-------------------
Index: remote_worker.c
===================================================================
RCS file: /slony1/slony1-engine/src/slon/remote_worker.c,v
retrieving revision 1.124.2.31
diff -u -r1.124.2.31 remote_worker.c
--- remote_worker.c 6 Feb 2008 20:23:52 -0000 1.124.2.31
+++ remote_worker.c 4 Mar 2008 02:42:30 -0000
@@ -677,9 +677,12 @@
      slon_appendquery(&query1,
               "lock table %s.sl_config_lock; ",
               rtcfg_namespace);
-     if (query_execute(node, local_dbconn, &query1) < 0)
-       slon_retry();
-     dstring_reset(&query1);
+     if (strcmp(event->ev_type, "ACCEPT_SET") != 0)
+     {
+       if (query_execute(node, local_dbconn, &query1) < 0)
+         slon_retry();
+       dstring_reset(&query1);
+     }

      /*
       * For all non-SYNC events, we write at least a standard
@@ -1017,6 +1020,10 @@
          PQclear(res);
          slon_log(SLON_DEBUG2, "ACCEPT_SET - MOVE_SET or FAILOVER_SET exists - 
adjusting setsync status\n");

+         if (query_execute(node, local_dbconn, &query1) < 0)
+           slon_retry();
+         dstring_reset(&query1);
+
          /*
           * Finalize the setsync status to mave the ACCEPT_SET's
           * seqno and snapshot info.
@@ -1056,6 +1063,10 @@
        else
        {
          slon_log(SLON_DEBUG2, "ACCEPT_SET - on origin node...\n");
+
+         if (query_execute(node, local_dbconn, &query1) < 0)
+           slon_retry();
+         dstring_reset(&query1);
        }

      }
--------------------
> Hello.
> 
> > >> [...]
> > >> Hey, I should test failover before updating to 1.2.13...
> > >
> > > I have some strange periodic problems with 'ACCEPT_SET - MOVE_SET or
> > > FAILOVER_SET not received yet - sleep' on 1.2.12 and 1.2.13. Looks
> > > similar to this one.
> > >
> > > I should try to downgrade to 1.2.11 and try if my 'move set' problems
> > > will disappear. Here is the initial problem description:
> > > http://lists.slony.info/pipermail/slony1-general/2008-February/007445.html
> > 
> > There's something about this that isn't making sense...
> > 
> > I just did a CVS diff between 1.2.11 and REL_1_2_STABLE, and didn't
> > see anything that ought to have anything to do with this.
> > 
> > I haven't yet done any testing of this case, out of the samples

 
> I think this problem is not the difference of the version but 
> "remoteWorkerThread"
> 
> When the problem of 'ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - 
> sleep' occurs,
> the pg_lock table is as following.
> 
> ----
> testdb=# SELECT relname,granted,pid,mode from pg_locks as l , pg_class as c 
> where c.oid = l.relation and locktype='relation';
>           relname           | granted |  pid  |        mode
> ----------------------------+---------+-------+---------------------
>  pg_class_oid_index         | t       | 15778 | AccessShareLock
>  pg_class_relname_nsp_index | t       | 15778 | AccessShareLock
>  pg_locks                   | t       | 15778 | AccessShareLock
>  pg_class                   | t       | 15778 | AccessShareLock
>  sl_event                   | t       | 15771 | AccessShareLock
>  sl_event-pkey              | t       | 15771 | AccessShareLock
>  sl_config_lock             | f       | 15770 | AccessExclusiveLock <-- 
> attention!
>  sl_config_lock             | t       | 15771 | AccessExclusiveLock
> ----
> 
> Next,I examined why two lock table sl_config_lock was executed.
> 
> In the case of failover or move set, two events are generated.
> The one is "FAILOVER/MOVE_SET",the other is "ACCEPT_SET".
> Furthermore, "FAILOVER/MOVE_SET" event is executed by remoteWorkerThread_1 
> which INSERT INTO sl_event table.
> and "ACCEPT_SET" event is executed by remoteWorkerThread_2 which SELECT 
> ev_type FROM sl_event.
> 
> Both events lock sl_config_lock table as following.
> ---
> "begin transaction; set transaction isolation level serializable; lock table 
> "_testdbcluster".sl_config_lock;
> ---
> 
> if it is executed in order of remoteWorkerThread_1(INSERT) and 
> remoteWorkerThread_2(SELECT), the problem doesn't occur as following.
> 
> ----this is postgresql SQL-log SUCCESS  CASE: attention pid=15407 ---
> 2008-03-03 18:56:15 JST[15407]LOG:  statement: begin transaction; set 
> transaction isolation level serializable; /* FAILOVER_SET */ lock table 
> "_testdbcluster".sl_config_lock;
> 2008-03-03 18:56:15 JST[15408]LOG:  statement: begin transaction; set 
> transaction isolation level serializable; /* ACCEPT_SET */ lock table 
> "_testdbcluster".sl_config_lock;
> 2008-03-03 18:56:15 JST[15407]LOG:  statement: select 
> "_testdbcluster".failoverSet_int(1, 2, 1, 16); notify "_testdbcluster_Event"; 
> insert into "_testdbcluster".sl_event     (ev_origin, ev_seqno, ev_timestamp, 
>      ev_minxid, ev_maxxid, ev_xip, ev_type , ev_data1, ev_data2, ev_data3    
> ) values ('1', '16', '2008-03-03 18:56:14.173481', '798269', '798271', 
> '''798270''', 'FAILOVER_SET', '1', '2', '1'); insert into 
> "_testdbcluster".sl_confirm   (con_origin, con_received, con_seqno, 
> con_timestamp)    values (1, 3, '16', now()); commit transaction;
> -------------------------------
> 
> But, if it is executed in order of remoteWorkerThread_2(SELECT) and 
> remoteWorkerThread_2(INSERT),
                                                                                
         ~
                                                                           
sorry typo 2->1

> we have  'ACCEPT_SET - MOVE_SET or FAILOVER_SET not received yet - sleep' 
> loops.
> 
> -- this is postgresql SQL-log FAILED CASE: attention pid = 15771 ---
> 2008-03-03 19:13:51 JST[15771]LOG:  statement: begin transaction; set 
> transaction isolation level serializable; /* ACCEPT_SET */ lock table 
> "_testdbcluster".sl_config_lock;
> 2008-03-03 19:13:51 JST[15770]LOG:  statement: begin transaction; set 
> transaction isolation level serializable; /* FAILOVER_SET */ lock table 
> "_testdbcluster".sl_config_lock;
> 2008-03-03 19:13:51 JST[15771]LOG:  statement: select 1 from 
> "_testdbcluster".sl_event where      (ev_origin = 1 and       ev_seqno = 22 
> and       ev_type = 'MOVE_SET' and       ev_data1 = '1' and      ev_data2 = 
> '1' and       ev_data3 = '2') or      (ev_origin = 1 and       ev_seqno = 22 
> and       ev_type = 'FAILOVER_SET' and       ev_data1 = '1' and       
> ev_data2 = '2' and       ev_data3 = '1');
> ----------------------------------------------
> 
> Because of "lock table sl_config_lock", remoteWorkerThread_1 cannot insert 
> "FAILOVER/MOVE_SET" event into sl_event!!




_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to