Re: [Openais] FAILED TO RECEIVE followed by cluster failure

Keisuke MORI Mon, 18 Jul 2011 19:58:00 -0700

Hi,

2011/7/19 Steven Dake <sd...@redhat.com>:
> On 07/18/2011 10:38 AM, Jed Smith wrote:
>> Thank you for your reply.
>>
>> On Mon, Jul 18, 2011 at 1:18 PM, Digimer <li...@alteeve.com> wrote:
>>> Is it possible that the switch dropped the multicast group, and didn't
>>> reform it fast enough to prevent the cluster from partitioning?
>>
>> Our network guy says that the switches do not look at multicast
>> traffic, they merely broadcast it in our environment.
>>
>
> unlikely.  I expect what is happening is your switch is delaying
> multicast packets compared to the unicast token.  This causes
> retransmits.  There is a bug in older versions of our totem
> implementation that increase the fail to recv counter incorrectly.  In
> newer versions we have worked around this flaw in the original totem
> specification (which expects multicast can be flushed before a token
> receipt, which is an invalid assertion).
>
> My recommendation to you is to update to a 1.3 or 1.4 series.   Both of
> these have very tight maintenance rules around what goes in (ie: its not
> tip development work).
>
> Once you have a version that doesn't have known bugs, I'd recommend
> increasing fail recv const to some large value, such as 5000.  See:
>
> http://www.mail-archive.com/openais@lists.linux-foundation.org/msg05924.html


We had discovered that the issue in that report was caused by a misbehavior
of IGMP snooping feature in bridge interface;
http://www.spinics.net/lists/netdev/msg166960.html

Because of this, the bridge interface sometimes fails to handle IGMP
packet properly
and multicast traffic may not be forwarded for a while although
unicast traffic goes fine,
which makes corosync confused.

RHEL6.0 is affected at least, but RHEL5 is not affected because RHEL5 kernel
does not implement IGMP snooping yet.


You can workaroud it by either;
1) disabling IGMP snooping feature
     ex. echo 0 > /sys/class/net/br0/bridge/multicast_snooping
2) not to use bridge interface for corosync multicast traffic


When we encountered to this issue, we had assigned a multicast address to
a bridge interface on top of a bonding interface.
Changing to assign the IP address onto a bonding interface did solve it.
Increasing fail_recv_const did not actually solve it; it just
"delayed" to occur.

Hope it helps.

-- 
Keisuke MORI
_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] FAILED TO RECEIVE followed by cluster failure

Reply via email to