Re: [Openais] Problems forming cluster on corosync startup

Steven Dake Mon, 08 Aug 2011 02:49:44 -0700

On 08/08/2011 12:10 AM, Tim Beale wrote:
> Hi Steve,
> 
> Thanks for your help. I tried out your patch but the problem still
> occurs. The problem looks to me due to the ring-IDs used when forming
> the transitional memb-list, rather than with the memb-list itself. The
> ring-ID of the nodes still in Recovery is older than the rest of the
> nodes who have already shifted to Operational.
> 
> Attached is my attempt at fixing the problem. The idea is to delay the
> nodes processing a Memb-Join immediately after shifting to
> Operational, until the token has rotated the ring once.
> 
> It doesn't quite work either though. The nodes are still re-entering
> gather before all have left recovery. This time it's due to processing
> a Merge-Detect message. One node has just started up and set itself to
> the rep, and sends out a Merge-Detect which triggers the other nodes
> to enter gather and reform the ring.
> 
> Let me know if you have any other advice.
>


the problem is clear from the blackbox - 8 nodes enter operational while
1 in recovery is interrupted by a join message.  this interrupted node
then proceeds with a transitional membership of 1 node (which is correct).

The joined and left lists use the transitional list to determine their
contents, which is not correct.  This results in incorrect data
delivered to clm.  Try the follow-up patch which should correctly
calculate the joined and left lists.


> Thanks,
> Tim
> 
> On Mon, Aug 8, 2011 at 6:08 AM, Steven Dake <sd...@redhat.com> wrote:
>> On 08/03/2011 10:32 PM, Tim Beale wrote:
>>> Hi,
>>>
>>> It looks to me that the way the transition from Recovery to Operational 
>>> works,
>>> we can't guarantee that all nodes in the ring have entered Operational 
>>> before
>>> a node processes another Memb-Join message from a new node. E.g. we can't
>>> guarantee the token has rotated right the way around the ring.
>>>
>>> When this happens, the nodes still in Recovery will still use the older ring
>>> ID. So they won't get added to the transitional membership, and CLM will 
>>> report
>>> leave events for these nodes. (Plus there might be other side-effects, like 
>>> the
>>> FAILED TO RECEIVE problem - I haven't quite worked out why that's 
>>> happening).
>>>
>>
>> Thanks for the pointer here - patch on ml.
>>
>>> We are currently using CLM to check the health of a node, i.e. so we can 
>>> detect
>>> if it locks up. My questions are:
>>> i) Are there config settings we could change to improve this, like 
>>> increasing
>>> the 'join' timeout?
>>> ii) Should I try to make a code change to fix the problem? E.g. delay
>>> processing the Memb-Join message if the node's only just entered 
>>> operational.
>>> iii) Should we not be using CLM like this? I.e. should we just learn to live
>>> with CLM/CPG sometimes reporting nodes as leaving when they're perfectly
>>> healthy.
>>>
>>> Thanks for your help.
>>> Tim
>>>
>>
>> Tim please try the patch I have recently posted:
>> [PATCH] Set my_new_memb_list in recovery enter
>>
>> First and foremost, let me know if it resolves your 10 node startup case
>> which fails 10% of the time.  Then let me know if it treats other symptoms.
>>
>> Regards
>> -steve
>>
>>
>>> On Wed, Aug 3, 2011 at 3:28 PM, Tim Beale <tlbe...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> We're booting up a 10-node cluster (with all nodes starting corosync at 
>>>> roughly
>>>> the same time) and approx 1 in 10 times we see some problems:
>>>> a) CLM is reporting nodes as leaving and then immediately rejoining (not 
>>>> sure
>>>> if this is valid behaviour?)
>>>> b) Probably an unrelated oddity, but we're getting flow control enabled on 
>>>> a
>>>> client daemon using CLM that's only sending one request 
>>>> (saClmClusterTrack()).
>>>> c) A node is hitting the FAILED TO RECEIVE case
>>>> d) After c) there seems to be a lot of churn as the cluster tries to reform
>>>> e) During the processing of node leave events, the CPG client can 
>>>> sometimes get
>>>> broken so it no longer processes *any* CPG events
>>>>
>>>> Corosync debug is attached (I commented out some of the noisier debug 
>>>> around
>>>> message delivery). We don't really know enough about corosync to tell what
>>>> exactly is incorrect behaviour and what should be fixed. But here's what 
>>>> we've
>>>> noticed:
>>>> 1). Node-4 joins soon after node-1. When this happens all nodes except 
>>>> node-12
>>>> have entered operational state (see node-12.txt line 235). It looks like 
>>>> maybe
>>>> node-12 hasn't received enough rotations of the token to enter operational 
>>>> yet.
>>>> Node-12's resulting transitional config consists of just itself. All nodes 
>>>> then
>>>> report node-1 and node-12 as leaving and immediately rejoining.
>>>> 2) After this config change, node-3 eventually hits the FAILED TO RECEIVE 
>>>> case
>>>> (node-3.txt line 380). At this point node-1 and node-12 have an ARU 
>>>> matching
>>>> the high_seq_received, all other nodes have an ARU of zero.
>>>> 3) Node-3 entering gather seems to result in a lot of config change churn
>>>> across the cluster.
>>>> 4) While processing the config changes on node-3, the CPG downlist it uses
>>>> contains itself. When node-3 sends leave events for the nodes in the 
>>>> downlist
>>>> (including itself), it sets its own cpd state to CPD_STATE_UNJOINED and 
>>>> clears
>>>> the cpd->group_name. This means it no longer sends any CPG events to the 
>>>> CPG
>>>> client.
>>>>
>>>> We tried cherry-picking this commit to fix the problem (#4) with the CPG 
>>>> client.
>>>> http://www.corosync.org/git/?p=corosync.git;a=commit;h=956a1dcb4236acbba37c07e2ac0b6c9ffcb32577
>>>> It helped a bit, but didn't fix it completely. We've made an interim change
>>>> (attached) to avoid this problem.
>>>>
>>>> We're using corosync v1.3.1 on an embedded linux system (with a low-spec 
>>>> CPU).
>>>> Corosync is running over a basic ethernet interface (no hubs/routers/etc).
>>>>
>>>> Any help would be appreciated. Let me know if there's any other debug I can
>>>> provide.
>>>>
>>
>>
>>
>>>> Thanks,
>>>> Tim
>>>>
>>> _______________________________________________
>>> Openais mailing list
>>> Openais@lists.linux-foundation.org
>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>
>>

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Problems forming cluster on corosync startup

Reply via email to