Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM
Seeing the high dropping quote... (just compare this to the other NIC) - have you tried a new cable? Maybe it's a cheap hardware problem... -Ursprüngliche Nachricht- Von: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Lars Marowsky-Bree Gesendet: Donnerstag, 11. Juli 2013 11:20 An: General Linux-HA mailing list Betreff: Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM On 2013-07-11T08:41:33, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: For a really silly idea, but can you swap the network cards for a test? Say, with Intel NICs, or even another Broadcom model? Unfortunately no: The 4-way NIC is onboard, and all slots are full. Too bad. But then you could really try raising a support request about the network driver, perhaps one of the kernel/networking gurus has an idea. RX packet drops. Maybe the bug is in the bonding code... bond0: RX packets:211727910 errors:0 dropped:18996906 overruns:0 frame:0 eth1: RX packets:192885954 errors:0 dropped:21 overruns:0 frame:0 eth4: RX packets:18841956 errors:0 dropped:18841956 overruns:0 frame:0 Both cards are identical. I wonder: If bonding mode is fault-tolerance (active-backup), is it normal then to see such statistics. ethtool -S reports a high number for rx_filtered_packets... Possibly. It'd be interesting to know what packets get dropped; this means you have approx. 10% of your traffic on the backup link. I wonder if all the nodes/switches/etc agree on what is the backup port and what isn't ...? If 10% of the communication ends up on the wrong NIC, that surely would mess up a number of recovery protocols. An alternative test case would be to see how the system behaves if you disable bonding - or if the names should stay the same, only one NIC in the bond. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM
On 2013-07-12T11:05:32, Wengatz Herbert herbert.weng...@baaderbank.de wrote: Seeing the high dropping quote... (just compare this to the other NIC) - have you tried a new cable? Maybe it's a cheap hardware problem... The drop rate is normal. A slave NIC in a bonded active/passive configuration will drop all packets. I do wonder why there's so much traffic on a supposedly passive NIC, though. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM
Hmmm. Please correct me, if I'm wrong: As I understand it, you have a number of packets, that go to BOTH NICs. Depending, on which one is the active or the passive one, the sum of all dropped packets should be equal to the number of received packets (plusminus some drops for other reasons). So if one card drops 10% of the packets, the other should drop 90% of the packets. - This is not the case here. Regards, Herbert -Ursprüngliche Nachricht- Von: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Lars Marowsky-Bree Gesendet: Freitag, 12. Juli 2013 11:09 An: General Linux-HA mailing list Betreff: Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM On 2013-07-12T11:05:32, Wengatz Herbert herbert.weng...@baaderbank.de wrote: Seeing the high dropping quote... (just compare this to the other NIC) - have you tried a new cable? Maybe it's a cheap hardware problem... The drop rate is normal. A slave NIC in a bonded active/passive configuration will drop all packets. I do wonder why there's so much traffic on a supposedly passive NIC, though. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM
Lars Marowsky-Bree l...@suse.com schrieb am 12.07.2013 um 11:08 in Nachricht 20130712090853.gm19...@suse.de: On 2013-07-12T11:05:32, Wengatz Herbert herbert.weng...@baaderbank.de wrote: Seeing the high dropping quote... (just compare this to the other NIC) - have you tried a new cable? Maybe it's a cheap hardware problem... The drop rate is normal. A slave NIC in a bonded active/passive configuration will drop all packets. I do wonder why there's so much traffic on a supposedly passive NIC, though. Lars, that depends on the uptime. I think our network guys had updated the firmware of some switches, casuing a switch reboot and a switch to a different bonding slave, I guess. Regards, Ulrich Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM (dropped packets on binding device)
Wengatz Herbert herbert.weng...@baaderbank.de schrieb am 12.07.2013 um 11:19 in Nachricht e0a8d3556d452c42977202b2d60934660431fae...@msx2.baag: Hmmm. Please correct me, if I'm wrong: As I understand it, you have a number of packets, that go to BOTH NICs. Depending, on which one is the active or the passive one, the sum of all dropped packets should be equal to the number of received packets (plusminus some drops for other reasons). So if one card drops 10% of the packets, the other should drop 90% of the packets. - This is not the case here. I haven't added all the numbers, but it's also quite confusing that the dropped packets are pushed up to the bonding master: If dropped packets is part of the bonding implementation, the number of dropped packets should be hidden at the bonding level. If you have a bonding device with four slaves in active/passive (being paranoid), you should see three times as much dropped packets than received packets, right? (I adjusted the subject for this discussion) Regards, Herbert -Ursprüngliche Nachricht- Von: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Lars Marowsky-Bree Gesendet: Freitag, 12. Juli 2013 11:09 An: General Linux-HA mailing list Betreff: Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM On 2013-07-12T11:05:32, Wengatz Herbert herbert.weng...@baaderbank.de wrote: Seeing the high dropping quote... (just compare this to the other NIC) - have you tried a new cable? Maybe it's a cheap hardware problem... The drop rate is normal. A slave NIC in a bonded active/passive configuration will drop all packets. I do wonder why there's so much traffic on a supposedly passive NIC, though. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM
Lars Marowsky-Bree l...@suse.com schrieb am 10.07.2013 um 23:56 in Nachricht 20130710215655.ge5...@suse.de: On 2013-07-10T14:33:12, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Network problems in hypervisors though also have a tendency to be, well, due to the hypervisor, or some network cards (broadcom?). Yes: driver: bnx2 version: 2.1.11 firmware-version: bc 5.2.3 NCSI 2.0.12 For a really silly idea, but can you swap the network cards for a test? Say, with Intel NICs, or even another Broadcom model? Unfortunately no: The 4-way NIC is onboard, and all slots are full. Can this be reproduced with another high network load pattern? Packet loss etc? No, but TCP handles packet loss more gracefully than the cluster, it seems. A single lost packet shouldn't cause that, I think. (There may, of course, also be more problems hidden in corosync.) Anything showing up on the ifconfig stats or with a ping flood? I noticed significant dropped RX frames, like 8%. But that was on a bonding device. Interestingly the physical interfaces involved did not all have so many RX packet drops. Maybe the bug is in the bonding code... bond0: RX packets:211727910 errors:0 dropped:18996906 overruns:0 frame:0 eth1: RX packets:192885954 errors:0 dropped:21 overruns:0 frame:0 eth4: RX packets:18841956 errors:0 dropped:18841956 overruns:0 frame:0 Both cards are identical. I wonder: If bonding mode is fault-tolerance (active-backup), is it normal then to see such statistics. ethtool -S reports a high number for rx_filtered_packets... Some network card and hypervisor combos apparently don't play well with multicast, either. You could also try switching to unicast communication. I would only do that as a last resort. And if this reproduces, you could try the SP3 update which ought to be mirrored out now (which includes a corosync update and a kernel refresh; corosync 1.4.6 is already in the maintenance queue). In the near future, yes. Regards, Ulrich Sometimes I think the worst part about distributed processes is that it has to rely on networking. But then I remember it relies on human programmers too and the network isn't looking so bad any more ;-) Is there any perspective to see the light at the end of the tunnel? The problems should be easily reproducable. Bugs that get reported have a chance of being fixed ;-) One more bug and my suport engineer kills me ;-) There's already a bounty on your head, it can't get any worse ;-) Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM
On 2013-07-11T08:41:33, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: For a really silly idea, but can you swap the network cards for a test? Say, with Intel NICs, or even another Broadcom model? Unfortunately no: The 4-way NIC is onboard, and all slots are full. Too bad. But then you could really try raising a support request about the network driver, perhaps one of the kernel/networking gurus has an idea. RX packet drops. Maybe the bug is in the bonding code... bond0: RX packets:211727910 errors:0 dropped:18996906 overruns:0 frame:0 eth1: RX packets:192885954 errors:0 dropped:21 overruns:0 frame:0 eth4: RX packets:18841956 errors:0 dropped:18841956 overruns:0 frame:0 Both cards are identical. I wonder: If bonding mode is fault-tolerance (active-backup), is it normal then to see such statistics. ethtool -S reports a high number for rx_filtered_packets... Possibly. It'd be interesting to know what packets get dropped; this means you have approx. 10% of your traffic on the backup link. I wonder if all the nodes/switches/etc agree on what is the backup port and what isn't ...? If 10% of the communication ends up on the wrong NIC, that surely would mess up a number of recovery protocols. An alternative test case would be to see how the system behaves if you disable bonding - or if the names should stay the same, only one NIC in the bond. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM
On 2013-07-10T14:33:12, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Network problems in hypervisors though also have a tendency to be, well, due to the hypervisor, or some network cards (broadcom?). Yes: driver: bnx2 version: 2.1.11 firmware-version: bc 5.2.3 NCSI 2.0.12 For a really silly idea, but can you swap the network cards for a test? Say, with Intel NICs, or even another Broadcom model? Can this be reproduced with another high network load pattern? Packet loss etc? No, but TCP handles packet loss more gracefully than the cluster, it seems. A single lost packet shouldn't cause that, I think. (There may, of course, also be more problems hidden in corosync.) Anything showing up on the ifconfig stats or with a ping flood? Some network card and hypervisor combos apparently don't play well with multicast, either. You could also try switching to unicast communication. And if this reproduces, you could try the SP3 update which ought to be mirrored out now (which includes a corosync update and a kernel refresh; corosync 1.4.6 is already in the maintenance queue). Sometimes I think the worst part about distributed processes is that it has to rely on networking. But then I remember it relies on human programmers too and the network isn't looking so bad any more ;-) Is there any perspective to see the light at the end of the tunnel? The problems should be easily reproducable. Bugs that get reported have a chance of being fixed ;-) One more bug and my suport engineer kills me ;-) There's already a bounty on your head, it can't get any worse ;-) Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems