At Mon, 7 Nov 2011 10:03:19 -0500,
Shawn Moore wrote:
> 
> When I checked on the cluster this morning I see the following from
> cluster info.  A sheep and corosync process was found on all nodes
> except blade162 which didn't have a sheep process but did have a
> corosync one.  I'm not sure what has happened.  We have not had a

In blade162.log:

  Nov 05 00:06:30 sd_leave_handler(1222) Network Patition Bug: I should have 
exited.

Probably, this is a corosync's bug and Yunkai is trying to solve it.

  http://lists.wpkg.org/pipermail/sheepdog/2011-November/001835.html


> network interruption that we are aware of as all nodes are on the same
> switch (along with countless other production systems).  Logs from
> each node can be found
> http://www.stormpoint.com/files/sd_2011-11-07.zip.  Total
> un-compressed size is ~ 254MB and this download size is around 21MB.
> When I left Friday, this is how our cluster looked:
> 
> All nodes were running version 0.2.4_63_gd56e3b6
> 
>    Idx - Host:Port          Vnodes       Zone
> ---------------------------------------------
>      0 - 192.168.217.152:7000         64          1
>      1 - 192.168.217.153:7000         64          1
>      2 - 192.168.217.154:7000         64          1
>      3 - 192.168.217.155:7000         64          1
>      4 - 192.168.217.156:7000         64          1
>      5 - 192.168.217.157:7000         64          2
>      6 - 192.168.217.159:7000         64          2
>      7 - 192.168.217.160:7000         64          2
>      8 - 192.168.217.161:7000         64          2
>      9 - 192.168.217.162:7000         64          2
> 
> [root@blade152 sheep]# collie cluster info
> Cluster status: running
> 
> Cluster created at Wed Nov  2 11:02:26 2011
> 
> Epoch Time           Version
> 2011-11-04 17:26:22     14 [192.168.217.152:7000]
> 2011-11-04 17:26:22     13 [192.168.217.152:7000, 192.168.217.162:7000]
> 2011-11-04 17:26:22     12 [192.168.217.152:7000,
> 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-04 17:26:22     11 [192.168.217.152:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-04 17:26:22     10 [192.168.217.152:7000,
> 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
> 192.168.217.162:7000]
> 2011-11-04 17:26:21      9 [192.168.217.152:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-04 17:26:21      8 [192.168.217.152:7000,
> 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-04 17:26:21      7 [192.168.217.152:7000,
> 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000,
> 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
> 192.168.217.162:7000]
> 
> 
> [root@blade153 ~]# collie cluster info
> Cluster status: running
> 
> Cluster created at Wed Nov  2 11:02:26 2011
> 
> Epoch Time           Version
> 2011-11-05 00:05:19     14 [192.168.217.153:7000]
> 2011-11-05 00:05:19     13 [192.168.217.153:7000, 192.168.217.162:7000]
> 2011-11-05 00:05:19     12 [192.168.217.153:7000,
> 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-05 00:05:19     11 [192.168.217.153:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-05 00:05:19     10 [192.168.217.153:7000,
> 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
> 192.168.217.162:7000]
> 2011-11-05 00:05:19      9 [192.168.217.153:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-05 00:05:18      8 [192.168.217.153:7000,
> 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-05 00:05:18      7 [192.168.217.153:7000,
> 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000,
> 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
> 192.168.217.162:7000]
> 
> 
> [root@blade154 ~]# collie cluster info
> Cluster status: running
> 
> Cluster created at Wed Nov  2 11:02:26 2011
> 
> Epoch Time           Version
> 2011-11-04 13:25:06      6 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-04 06:58:12      5 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-04 05:57:43      4 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000,
> 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-02 10:49:34      3 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
> 192.168.217.162:7000]
> 2011-11-02 10:33:44      2 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-02 07:01:26      1 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
> 192.168.217.162:7000]
> 
> 
> [root@blade155 ~]# collie cluster info
> Cluster status: running
> 
> Cluster created at Wed Nov  2 11:02:26 2011
> 
> Epoch Time           Version
> 2011-11-04 13:24:42      6 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-04 06:57:48      5 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-04 05:57:19      4 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000,
> 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-02 10:49:07      3 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
> 192.168.217.162:7000]
> 2011-11-02 10:33:17      2 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-02 07:00:59      1 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
> 192.168.217.162:7000]
> 
> 
> [root@blade156 ~]# collie cluster info
> Cluster status: running
> 
> Cluster created at Wed Nov  2 11:02:26 2011
> 
> Epoch Time           Version
> 2011-11-05 07:39:11      9 [192.168.217.154:7000,
> 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
> 192.168.217.159:7000, 192.168.217.160:7000]
> 2011-11-05 07:39:11      8 [192.168.217.154:7000,
> 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-04 18:47:30      7 [192.168.217.153:7000,
> 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000,
> 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
> 192.168.217.162:7000]
> 2011-11-04 17:26:26      6 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-04 10:59:30      5 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-04 09:59:03      4 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000,
> 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-04 09:59:03      3 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
> 192.168.217.162:7000]
> 2011-11-02 10:33:44      2 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
> 
> 
> [root@blade157 ~]# collie cluster info
> Cluster status: running
> 
> Cluster created at Wed Nov  2 11:02:26 2011
> 
> Epoch Time           Version
> 2011-11-05 07:39:11      9 [192.168.217.154:7000,
> 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
> 192.168.217.159:7000, 192.168.217.160:7000]
> 2011-11-05 07:39:11      8 [192.168.217.154:7000,
> 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-04 18:47:30      7 [192.168.217.153:7000,
> 192.168.217.154:7000, 192.168.217.155:7000, 192.168.217.156:7000,
> 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
> 192.168.217.162:7000]
> 2011-11-04 17:26:26      6 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-04 10:59:32      5 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-04 10:59:32      4 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000,
> 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-02 10:49:34      3 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
> 192.168.217.162:7000]
> 2011-11-02 10:33:44      2 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
> 
> 
> [root@blade159 ~]# collie cluster info
> Cluster status: running
> 
> Cluster created at Wed Nov  2 11:02:26 2011
> 
> Epoch Time           Version
> 2011-11-04 17:26:11      6 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-04 10:59:17      5 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-04 09:58:48      4 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000,
> 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-02 14:50:37      3 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
> 192.168.217.162:7000]
> 2011-11-02 14:34:46      2 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-02 11:02:28      1 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
> 192.168.217.162:7000]
> 
> 
> [root@blade160 ~]# collie cluster info
> Cluster status: running
> 
> Cluster created at Wed Nov  2 11:02:26 2011
> 
> Epoch Time           Version
> 2011-11-04 17:26:26      6 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-04 10:59:30      5 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-04 09:59:02      4 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.159:7000, 192.168.217.160:7000,
> 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-02 14:50:46      3 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
> 192.168.217.162:7000]
> 2011-11-02 14:34:55      2 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.162:7000]
> 2011-11-02 11:02:37      1 [192.168.217.152:7000,
> 192.168.217.153:7000, 192.168.217.154:7000, 192.168.217.155:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
> 192.168.217.162:7000]
> 
> 
> [root@blade161 ~]# collie cluster info
> Cluster status: The sheepdog is stopped doing IO, short of living nodes
> 
> Cluster created at Wed Nov  2 11:02:26 2011
> 
> Epoch Time           Version
> 2011-11-04 17:26:51     14 [192.168.217.161:7000]
> 2011-11-04 17:26:51     13 [192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-04 17:26:51     12 [192.168.217.160:7000,
> 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-04 17:26:51     11 [192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-04 17:26:48     10 [192.168.217.157:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
> 192.168.217.162:7000]
> 2011-11-04 17:26:48      9 [192.168.217.156:7000,
> 192.168.217.157:7000, 192.168.217.159:7000, 192.168.217.160:7000,
> 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-04 17:26:48      8 [192.168.217.155:7000,
> 192.168.217.156:7000, 192.168.217.157:7000, 192.168.217.159:7000,
> 192.168.217.160:7000, 192.168.217.161:7000, 192.168.217.162:7000]
> 2011-11-04 17:26:48      7 [192.168.217.154:7000,
> 192.168.217.155:7000, 192.168.217.156:7000, 192.168.217.157:7000,
> 192.168.217.159:7000, 192.168.217.160:7000, 192.168.217.161:7000,
> 192.168.217.162:7000]
> 
> 
> [root@blade162 ~]# collie cluster info
> failed to connect to localhost:7000, Connection refused
> failed to connect to localhost:7000, Connection refused

It seems that a network partition is wrongly detected.

To make explanation simpler, I'll use the following labels for each
node:

    n0: 192.168.217.152
    n1: 192.168.217.153
    n2: 192.168.217.154
    n3: 192.168.217.155
    n4: 192.168.217.156
    n5: 192.168.217.157
    n6: 192.168.217.159
    n7: 192.168.217.160
    n8: 192.168.217.161
    n9: 192.168.217.162

I guess your cluster is splited into 5 groups;
{n0}, {n1}, {n2, n3, n4, n5, n6, n7}, {n8}, {n9}.

 - n0 received a notification that n[1-9] were left.
 - n1 received a notification that n0 and n[2-9] were left.
 - n[2-7] received a notification that n0, n1, n8, and n9 were left.
 - n8 received a notification that n[0-7] and n9 were left.
 - n9 received a notification that n[0-8] were left (and aborted due to the 
above bug).

Currently, Sheepdog cannot handle this kinds of false detection.

We may avoid this problem if we set appropriate values to
corosync.conf (totem.merge or totem.seqno_unchanged_const?), but I'm
not sure.  Does anyone know more about this?


Thanks,

Kazutaka
-- 
sheepdog mailing list
[email protected]
http://lists.wpkg.org/mailman/listinfo/sheepdog

Reply via email to