Re: [Openais] FAILED TO RECEIVE followed by cluster failure

Steven Dake Mon, 18 Jul 2011 16:01:14 -0700

On 07/18/2011 10:38 AM, Jed Smith wrote:
> Thank you for your reply.
> 
> On Mon, Jul 18, 2011 at 1:18 PM, Digimer <li...@alteeve.com> wrote:
>> Is it possible that the switch dropped the multicast group, and didn't
>> reform it fast enough to prevent the cluster from partitioning?
> 
> Our network guy says that the switches do not look at multicast
> traffic, they merely broadcast it in our environment.
>


unlikely.  I expect what is happening is your switch is delaying
multicast packets compared to the unicast token.  This causes
retransmits.  There is a bug in older versions of our totem
implementation that increase the fail to recv counter incorrectly.  In
newer versions we have worked around this flaw in the original totem
specification (which expects multicast can be flushed before a token
receipt, which is an invalid assertion).

My recommendation to you is to update to a 1.3 or 1.4 series.   Both of
these have very tight maintenance rules around what goes in (ie: its not
tip development work).

Once you have a version that doesn't have known bugs, I'd recommend
increasing fail recv const to some large value, such as 5000.  See:

http://www.mail-archive.com/openais@lists.linux-foundation.org/msg05924.html

It would be nice if the debian maintainers would update their packages
to latest upstream.  We release z streams for a reason, usually the
reason being someone has had a field failure resulting in a complete
cluster outage).  Y stream releases are a bit more liberal in terms of
additional features.

File a bug with your distro and ask them to use an upstream release
which is recent and supported upstream (1.2.y upstream support fell off
once we released 1.4.y - we support 2 y streams).

Thanks
-steve

> Thanks,
> 

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] FAILED TO RECEIVE followed by cluster failure

Reply via email to