Re: [Openais] Problems forming cluster on corosync startup

Steven Dake Wed, 10 Aug 2011 09:27:12 -0700

On 08/09/2011 09:56 PM, Tim Beale wrote:
> Hi Steve,
> 
> Thanks for your patch.
> 
> 1. I don't see the initial CLM leave events. But I still see the FAILED TO
> RECEIVE hit on node-3. A couple of nodes don't enter operational on ring 20,
> then after the ring next reforms (ring 24), the FAILED TO RECEIVE happens.
> Attached is the latest debug.
>


Keep in mind there are two problems here - (1) clm membership is wrong
and (2) fail to recv problem.  They are independent issues.

I definitely want to look into this failed to receive issue.  Can you
try changing "fail_recv_const" on all the nodes to some large value,
such as 5000?

One of 3 things should happen:
1. the protocol blocks forever
2. the protocol enters operational after some short period
3. fail to recv is printed after a long period of time (1-10 minutes).

Please report back which one happens with this tuning.


> I think the problem is some nodes end up missing a message/sequence-number,
> although I'm not sure exactly why. E.g. the token sequence starts off at one
> when they enter operational, but not all nodes receive this.
> 2011 Aug  9 10:07:18 daemon.debug node-3 corosync[1575]:   [TOTEM ]
> totemsrp.c:3785 retrans flag count 4 token aru 0 install seq 0 aru 0 1
> 
> The nodes that were still in recovery will be using different values for
> old_ring_state_high_seq_received and my_old_ring_id. It seems these nodes
> receive msg seq #1, but the others don't and hit the FAILED TO RECEIVE.
> 
> The debug attached has your first memb-list patch popped off, but I've seen 
> the
> same problem happen with it applied too.
> 
> 2. Note that I don't see any CLM leave events at all now, even though after 
> the
> FAILED TO RECEIVE, node-3 kicks all other nodes out of its ring. I think this
> is due to the logic:
>  diff = my_new_memb_list - my_memb_list

This isn't how the difference operation works.  It produces a list of
nodes that are not both in my_new_memb_list and my_memb_list, therefore,
the current and logic should be correct.  I wrote the patch at 2am and
was quite tired, so I'll double check it is correct.

Regards
-steve

> The diff doesn't include any nodes that are in my_memb_list but not in
> my_new_memb_list, i.e. left nodes. I guess you could get all the differences
> by doing the following:
>  memb_set_subtract( diff1, my_new_memb_list, my_memb_list )
>  memb_set_subtract( diff2, my_memb_list, my_new_memb_list )
>  memb_set_and( diff1, diff2, diff )
> 
> Thanks,
> Tim
> 
> On Mon, Aug 8, 2011 at 9:45 PM, Steven Dake <sd...@redhat.com> wrote:
>> On 08/08/2011 12:10 AM, Tim Beale wrote:
>>> Hi Steve,
>>>
>>> Thanks for your help. I tried out your patch but the problem still
>>> occurs. The problem looks to me due to the ring-IDs used when forming
>>> the transitional memb-list, rather than with the memb-list itself. The
>>> ring-ID of the nodes still in Recovery is older than the rest of the
>>> nodes who have already shifted to Operational.
>>>
>>> Attached is my attempt at fixing the problem. The idea is to delay the
>>> nodes processing a Memb-Join immediately after shifting to
>>> Operational, until the token has rotated the ring once.
>>>
>>> It doesn't quite work either though. The nodes are still re-entering
>>> gather before all have left recovery. This time it's due to processing
>>> a Merge-Detect message. One node has just started up and set itself to
>>> the rep, and sends out a Merge-Detect which triggers the other nodes
>>> to enter gather and reform the ring.
>>>
>>> Let me know if you have any other advice.
>>>
>>
>> the problem is clear from the blackbox - 8 nodes enter operational while
>> 1 in recovery is interrupted by a join message.  this interrupted node
>> then proceeds with a transitional membership of 1 node (which is correct).
>>
>> The joined and left lists use the transitional list to determine their
>> contents, which is not correct.  This results in incorrect data
>> delivered to clm.  Try the follow-up patch which should correctly
>> calculate the joined and left lists.
>>
>>
>>> Thanks,
>>> Tim
>>>
>>> On Mon, Aug 8, 2011 at 6:08 AM, Steven Dake <sd...@redhat.com> wrote:
>>>> On 08/03/2011 10:32 PM, Tim Beale wrote:
>>>>> Hi,
>>>>>
>>>>> It looks to me that the way the transition from Recovery to Operational 
>>>>> works,
>>>>> we can't guarantee that all nodes in the ring have entered Operational 
>>>>> before
>>>>> a node processes another Memb-Join message from a new node. E.g. we can't
>>>>> guarantee the token has rotated right the way around the ring.
>>>>>
>>>>> When this happens, the nodes still in Recovery will still use the older 
>>>>> ring
>>>>> ID. So they won't get added to the transitional membership, and CLM will 
>>>>> report
>>>>> leave events for these nodes. (Plus there might be other side-effects, 
>>>>> like the
>>>>> FAILED TO RECEIVE problem - I haven't quite worked out why that's 
>>>>> happening).
>>>>>
>>>>
>>>> Thanks for the pointer here - patch on ml.
>>>>
>>>>> We are currently using CLM to check the health of a node, i.e. so we can 
>>>>> detect
>>>>> if it locks up. My questions are:
>>>>> i) Are there config settings we could change to improve this, like 
>>>>> increasing
>>>>> the 'join' timeout?
>>>>> ii) Should I try to make a code change to fix the problem? E.g. delay
>>>>> processing the Memb-Join message if the node's only just entered 
>>>>> operational.
>>>>> iii) Should we not be using CLM like this? I.e. should we just learn to 
>>>>> live
>>>>> with CLM/CPG sometimes reporting nodes as leaving when they're perfectly
>>>>> healthy.
>>>>>
>>>>> Thanks for your help.
>>>>> Tim
>>>>>
>>>>
>>>> Tim please try the patch I have recently posted:
>>>> [PATCH] Set my_new_memb_list in recovery enter
>>>>
>>>> First and foremost, let me know if it resolves your 10 node startup case
>>>> which fails 10% of the time.  Then let me know if it treats other symptoms.
>>>>
>>>> Regards
>>>> -steve
>>>>
>>>>
>>>>> On Wed, Aug 3, 2011 at 3:28 PM, Tim Beale <tlbe...@gmail.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We're booting up a 10-node cluster (with all nodes starting corosync at 
>>>>>> roughly
>>>>>> the same time) and approx 1 in 10 times we see some problems:
>>>>>> a) CLM is reporting nodes as leaving and then immediately rejoining (not 
>>>>>> sure
>>>>>> if this is valid behaviour?)
>>>>>> b) Probably an unrelated oddity, but we're getting flow control enabled 
>>>>>> on a
>>>>>> client daemon using CLM that's only sending one request 
>>>>>> (saClmClusterTrack()).
>>>>>> c) A node is hitting the FAILED TO RECEIVE case
>>>>>> d) After c) there seems to be a lot of churn as the cluster tries to 
>>>>>> reform
>>>>>> e) During the processing of node leave events, the CPG client can 
>>>>>> sometimes get
>>>>>> broken so it no longer processes *any* CPG events
>>>>>>
>>>>>> Corosync debug is attached (I commented out some of the noisier debug 
>>>>>> around
>>>>>> message delivery). We don't really know enough about corosync to tell 
>>>>>> what
>>>>>> exactly is incorrect behaviour and what should be fixed. But here's what 
>>>>>> we've
>>>>>> noticed:
>>>>>> 1). Node-4 joins soon after node-1. When this happens all nodes except 
>>>>>> node-12
>>>>>> have entered operational state (see node-12.txt line 235). It looks like 
>>>>>> maybe
>>>>>> node-12 hasn't received enough rotations of the token to enter 
>>>>>> operational yet.
>>>>>> Node-12's resulting transitional config consists of just itself. All 
>>>>>> nodes then
>>>>>> report node-1 and node-12 as leaving and immediately rejoining.
>>>>>> 2) After this config change, node-3 eventually hits the FAILED TO 
>>>>>> RECEIVE case
>>>>>> (node-3.txt line 380). At this point node-1 and node-12 have an ARU 
>>>>>> matching
>>>>>> the high_seq_received, all other nodes have an ARU of zero.
>>>>>> 3) Node-3 entering gather seems to result in a lot of config change churn
>>>>>> across the cluster.
>>>>>> 4) While processing the config changes on node-3, the CPG downlist it 
>>>>>> uses
>>>>>> contains itself. When node-3 sends leave events for the nodes in the 
>>>>>> downlist
>>>>>> (including itself), it sets its own cpd state to CPD_STATE_UNJOINED and 
>>>>>> clears
>>>>>> the cpd->group_name. This means it no longer sends any CPG events to the 
>>>>>> CPG
>>>>>> client.
>>>>>>
>>>>>> We tried cherry-picking this commit to fix the problem (#4) with the CPG 
>>>>>> client.
>>>>>> http://www.corosync.org/git/?p=corosync.git;a=commit;h=956a1dcb4236acbba37c07e2ac0b6c9ffcb32577
>>>>>> It helped a bit, but didn't fix it completely. We've made an interim 
>>>>>> change
>>>>>> (attached) to avoid this problem.
>>>>>>
>>>>>> We're using corosync v1.3.1 on an embedded linux system (with a low-spec 
>>>>>> CPU).
>>>>>> Corosync is running over a basic ethernet interface (no 
>>>>>> hubs/routers/etc).
>>>>>>
>>>>>> Any help would be appreciated. Let me know if there's any other debug I 
>>>>>> can
>>>>>> provide.
>>>>>>
>>>>
>>>>
>>>>
>>>>>> Thanks,
>>>>>> Tim
>>>>>>
>>>>> _______________________________________________
>>>>> Openais mailing list
>>>>> Openais@lists.linux-foundation.org
>>>>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>>
>>>>
>>
>>

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Problems forming cluster on corosync startup

Reply via email to