Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

Ulrich Windl Wed, 10 Jul 2013 23:42:35 -0700

>>> Lars Marowsky-Bree <l...@suse.com> schrieb am 10.07.2013 um 23:56 in
Nachricht
<20130710215655.ge5...@suse.de>:
> On 2013-07-10T14:33:12, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de>
wrote:
> 
>> > Network problems in hypervisors though also have a tendency to be, well,
>> > due to the hypervisor, or some network cards (broadcom?).
>> 
>> Yes:
>> driver: bnx2
>> version: 2.1.11
>> firmware-version: bc 5.2.3 NCSI 2.0.12
> 
> For a really silly idea, but can you swap the network cards for a test?
> Say, with Intel NICs, or even another Broadcom model?


Unfortunately no: The 4-way NIC is onboard, and all slots are full.

> 
>> > Can this be reproduced with another high network load pattern? Packet
>> > loss etc?
>> No, but TCP handles packet loss more gracefully than the cluster, it
seems.
> 
> A single lost packet shouldn't cause that, I think. (There may, of
> course, also be more problems hidden in corosync.) Anything showing up
> on the ifconfig stats or with a ping flood?

I noticed significant "dropped" RX frames, like 8%. But that was on a bonding
device. Interestingly the physical interfaces involved did not all have so many
RX packet drops. Maybe the bug is in the bonding code...
bond0: RX packets:211727910 errors:0 dropped:18996906 overruns:0 frame:0
eth1: RX packets:192885954 errors:0 dropped:21 overruns:0 frame:0
eth4: RX packets:18841956 errors:0 dropped:18841956 overruns:0 frame:0

Both cards are identical. I wonder: If bonding mode is "fault-tolerance
(active-backup)", is it normal then to see such statistics. ethtool -S reports
a high number for "rx_filtered_packets"...

> 
> Some network card and hypervisor combos apparently don't play well with
> multicast, either. You could also try switching to unicast
> communication.

I would only do that as a last resort.

> 
> And if this reproduces, you could try the SP3 update which ought to be
> mirrored out now (which includes a corosync update and a kernel refresh;
> corosync 1.4.6 is already in the maintenance queue).

In the near future, yes.

Regards,
Ulrich

> 
> Sometimes I think the worst part about distributed processes is that it
> has to rely on networking. But then I remember it relies on human
> programmers too and the network isn't looking so bad any more ;-)
> 
> 
>> >> Is there any perspective to see the light at the end of the tunnel? The

>> >> problems should be easily reproducable.
>> > Bugs that get reported have a chance of being fixed ;-)
>> One more bug and my suport engineer kills me ;-)
> 
> There's already a bounty on your head, it can't get any worse ;-)
> 
> 
> Regards,
>     Lars
> 
> -- 
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,

> HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org 
> http://lists.linux-ha.org/mailman/listinfo/linux-ha 
> See also: http://linux-ha.org/ReportingProblems 


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

Reply via email to