>>> Lars Marowsky-Bree <l...@suse.com> schrieb am 10.07.2013 um 23:56 in Nachricht <20130710215655.ge5...@suse.de>: > On 2013-07-10T14:33:12, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote: > >> > Network problems in hypervisors though also have a tendency to be, well, >> > due to the hypervisor, or some network cards (broadcom?). >> >> Yes: >> driver: bnx2 >> version: 2.1.11 >> firmware-version: bc 5.2.3 NCSI 2.0.12 > > For a really silly idea, but can you swap the network cards for a test? > Say, with Intel NICs, or even another Broadcom model?
Unfortunately no: The 4-way NIC is onboard, and all slots are full. > >> > Can this be reproduced with another high network load pattern? Packet >> > loss etc? >> No, but TCP handles packet loss more gracefully than the cluster, it seems. > > A single lost packet shouldn't cause that, I think. (There may, of > course, also be more problems hidden in corosync.) Anything showing up > on the ifconfig stats or with a ping flood? I noticed significant "dropped" RX frames, like 8%. But that was on a bonding device. Interestingly the physical interfaces involved did not all have so many RX packet drops. Maybe the bug is in the bonding code... bond0: RX packets:211727910 errors:0 dropped:18996906 overruns:0 frame:0 eth1: RX packets:192885954 errors:0 dropped:21 overruns:0 frame:0 eth4: RX packets:18841956 errors:0 dropped:18841956 overruns:0 frame:0 Both cards are identical. I wonder: If bonding mode is "fault-tolerance (active-backup)", is it normal then to see such statistics. ethtool -S reports a high number for "rx_filtered_packets"... > > Some network card and hypervisor combos apparently don't play well with > multicast, either. You could also try switching to unicast > communication. I would only do that as a last resort. > > And if this reproduces, you could try the SP3 update which ought to be > mirrored out now (which includes a corosync update and a kernel refresh; > corosync 1.4.6 is already in the maintenance queue). In the near future, yes. Regards, Ulrich > > Sometimes I think the worst part about distributed processes is that it > has to rely on networking. But then I remember it relies on human > programmers too and the network isn't looking so bad any more ;-) > > >> >> Is there any perspective to see the light at the end of the tunnel? The >> >> problems should be easily reproducable. >> > Bugs that get reported have a chance of being fixed ;-) >> One more bug and my suport engineer kills me ;-) > > There's already a bounty on your head, it can't get any worse ;-) > > > Regards, > Lars > > -- > Architect Storage/HA > SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, > HRB 21284 (AG Nürnberg) > "Experience is the name everyone gives to their mistakes." -- Oscar Wilde > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems