Hi,

It looks to me that the way the transition from Recovery to Operational works,
we can't guarantee that all nodes in the ring have entered Operational before
a node processes another Memb-Join message from a new node. E.g. we can't
guarantee the token has rotated right the way around the ring.

When this happens, the nodes still in Recovery will still use the older ring
ID. So they won't get added to the transitional membership, and CLM will report
leave events for these nodes. (Plus there might be other side-effects, like the
FAILED TO RECEIVE problem - I haven't quite worked out why that's happening).

We are currently using CLM to check the health of a node, i.e. so we can detect
if it locks up. My questions are:
i) Are there config settings we could change to improve this, like increasing
the 'join' timeout?
ii) Should I try to make a code change to fix the problem? E.g. delay
processing the Memb-Join message if the node's only just entered operational.
iii) Should we not be using CLM like this? I.e. should we just learn to live
with CLM/CPG sometimes reporting nodes as leaving when they're perfectly
healthy.

Thanks for your help.
Tim

On Wed, Aug 3, 2011 at 3:28 PM, Tim Beale <tlbe...@gmail.com> wrote:
> Hi,
>
> We're booting up a 10-node cluster (with all nodes starting corosync at 
> roughly
> the same time) and approx 1 in 10 times we see some problems:
> a) CLM is reporting nodes as leaving and then immediately rejoining (not sure
> if this is valid behaviour?)
> b) Probably an unrelated oddity, but we're getting flow control enabled on a
> client daemon using CLM that's only sending one request (saClmClusterTrack()).
> c) A node is hitting the FAILED TO RECEIVE case
> d) After c) there seems to be a lot of churn as the cluster tries to reform
> e) During the processing of node leave events, the CPG client can sometimes 
> get
> broken so it no longer processes *any* CPG events
>
> Corosync debug is attached (I commented out some of the noisier debug around
> message delivery). We don't really know enough about corosync to tell what
> exactly is incorrect behaviour and what should be fixed. But here's what we've
> noticed:
> 1). Node-4 joins soon after node-1. When this happens all nodes except node-12
> have entered operational state (see node-12.txt line 235). It looks like maybe
> node-12 hasn't received enough rotations of the token to enter operational 
> yet.
> Node-12's resulting transitional config consists of just itself. All nodes 
> then
> report node-1 and node-12 as leaving and immediately rejoining.
> 2) After this config change, node-3 eventually hits the FAILED TO RECEIVE case
> (node-3.txt line 380). At this point node-1 and node-12 have an ARU matching
> the high_seq_received, all other nodes have an ARU of zero.
> 3) Node-3 entering gather seems to result in a lot of config change churn
> across the cluster.
> 4) While processing the config changes on node-3, the CPG downlist it uses
> contains itself. When node-3 sends leave events for the nodes in the downlist
> (including itself), it sets its own cpd state to CPD_STATE_UNJOINED and clears
> the cpd->group_name. This means it no longer sends any CPG events to the CPG
> client.
>
> We tried cherry-picking this commit to fix the problem (#4) with the CPG 
> client.
> http://www.corosync.org/git/?p=corosync.git;a=commit;h=956a1dcb4236acbba37c07e2ac0b6c9ffcb32577
> It helped a bit, but didn't fix it completely. We've made an interim change
> (attached) to avoid this problem.
>
> We're using corosync v1.3.1 on an embedded linux system (with a low-spec CPU).
> Corosync is running over a basic ethernet interface (no hubs/routers/etc).
>
> Any help would be appreciated. Let me know if there's any other debug I can
> provide.
>
> Thanks,
> Tim
>
_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to