On 07/18/2011 07:55 PM, Keisuke MORI wrote: > Hi, > > 2011/7/19 Steven Dake <[email protected]>: >> On 07/18/2011 10:38 AM, Jed Smith wrote: >>> Thank you for your reply. >>> >>> On Mon, Jul 18, 2011 at 1:18 PM, Digimer <[email protected]> wrote: >>>> Is it possible that the switch dropped the multicast group, and didn't >>>> reform it fast enough to prevent the cluster from partitioning? >>> >>> Our network guy says that the switches do not look at multicast >>> traffic, they merely broadcast it in our environment. >>> >> >> unlikely. I expect what is happening is your switch is delaying >> multicast packets compared to the unicast token. This causes >> retransmits. There is a bug in older versions of our totem >> implementation that increase the fail to recv counter incorrectly. In >> newer versions we have worked around this flaw in the original totem >> specification (which expects multicast can be flushed before a token >> receipt, which is an invalid assertion). >> >> My recommendation to you is to update to a 1.3 or 1.4 series. Both of >> these have very tight maintenance rules around what goes in (ie: its not >> tip development work). >> >> Once you have a version that doesn't have known bugs, I'd recommend >> increasing fail recv const to some large value, such as 5000. See: >> >> http://www.mail-archive.com/[email protected]/msg05924.html > > We had discovered that the issue in that report was caused by a misbehavior > of IGMP snooping feature in bridge interface; > http://www.spinics.net/lists/netdev/msg166960.html > > Because of this, the bridge interface sometimes fails to handle IGMP > packet properly > and multicast traffic may not be forwarded for a while although > unicast traffic goes fine, > which makes corosync confused. > > RHEL6.0 is affected at least, but RHEL5 is not affected because RHEL5 kernel > does not implement IGMP snooping yet. > > > You can workaroud it by either; > 1) disabling IGMP snooping feature > ex. echo 0 > /sys/class/net/br0/bridge/multicast_snooping > 2) not to use bridge interface for corosync multicast traffic > > > When we encountered to this issue, we had assigned a multicast address to > a bridge interface on top of a bonding interface. > Changing to assign the IP address onto a bonding interface did solve it. > Increasing fail_recv_const did not actually solve it; it just > "delayed" to occur. > > Hope it helps. >
Thanks for the report. I believe our workarounds for delayed multicast packets will mask that kernel oddness, but can't guarantee it. I'm certain someone will find that information of value. Regards -steve _______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
