Re: [Openais] FAILED TO RECEIVE followed by cluster failure

Steven Dake Mon, 18 Jul 2011 22:20:42 -0700

On 07/18/2011 07:55 PM, Keisuke MORI wrote:
> Hi,
> 
> 2011/7/19 Steven Dake <[email protected]>:
>> On 07/18/2011 10:38 AM, Jed Smith wrote:
>>> Thank you for your reply.
>>>
>>> On Mon, Jul 18, 2011 at 1:18 PM, Digimer <[email protected]> wrote:
>>>> Is it possible that the switch dropped the multicast group, and didn't
>>>> reform it fast enough to prevent the cluster from partitioning?
>>>
>>> Our network guy says that the switches do not look at multicast
>>> traffic, they merely broadcast it in our environment.
>>>
>>
>> unlikely.  I expect what is happening is your switch is delaying
>> multicast packets compared to the unicast token.  This causes
>> retransmits.  There is a bug in older versions of our totem
>> implementation that increase the fail to recv counter incorrectly.  In
>> newer versions we have worked around this flaw in the original totem
>> specification (which expects multicast can be flushed before a token
>> receipt, which is an invalid assertion).
>>
>> My recommendation to you is to update to a 1.3 or 1.4 series.   Both of
>> these have very tight maintenance rules around what goes in (ie: its not
>> tip development work).
>>
>> Once you have a version that doesn't have known bugs, I'd recommend
>> increasing fail recv const to some large value, such as 5000.  See:
>>
>> http://www.mail-archive.com/[email protected]/msg05924.html
> 
> We had discovered that the issue in that report was caused by a misbehavior
> of IGMP snooping feature in bridge interface;
> http://www.spinics.net/lists/netdev/msg166960.html
> 
> Because of this, the bridge interface sometimes fails to handle IGMP
> packet properly
> and multicast traffic may not be forwarded for a while although
> unicast traffic goes fine,
> which makes corosync confused.
> 
> RHEL6.0 is affected at least, but RHEL5 is not affected because RHEL5 kernel
> does not implement IGMP snooping yet.
> 
> 
> You can workaroud it by either;
> 1) disabling IGMP snooping feature
>      ex. echo 0 > /sys/class/net/br0/bridge/multicast_snooping
> 2) not to use bridge interface for corosync multicast traffic
> 
> 
> When we encountered to this issue, we had assigned a multicast address to
> a bridge interface on top of a bonding interface.
> Changing to assign the IP address onto a bonding interface did solve it.
> Increasing fail_recv_const did not actually solve it; it just
> "delayed" to occur.
> 
> Hope it helps.
>


Thanks for the report.  I believe our workarounds for delayed multicast
packets will mask that kernel oddness, but can't guarantee it.  I'm
certain someone will find that information of value.

Regards
-steve
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] FAILED TO RECEIVE followed by cluster failure

Reply via email to