Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-12 Thread Wengatz Herbert
Seeing the high dropping quote... (just compare this to the other NIC) - have 
you tried a new cable? Maybe it's a cheap hardware problem...

-Ursprüngliche Nachricht-
Von: linux-ha-boun...@lists.linux-ha.org 
[mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Lars Marowsky-Bree
Gesendet: Donnerstag, 11. Juli 2013 11:20
An: General Linux-HA mailing list
Betreff: Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

On 2013-07-11T08:41:33, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  For a really silly idea, but can you swap the network cards for a test?
  Say, with Intel NICs, or even another Broadcom model?
 Unfortunately no: The 4-way NIC is onboard, and all slots are full.

Too bad.

But then you could really try raising a support request about the network 
driver, perhaps one of the kernel/networking gurus has an idea.

 RX packet drops. Maybe the bug is in the bonding code...
 bond0: RX packets:211727910 errors:0 dropped:18996906 overruns:0 
 frame:0
 eth1: RX packets:192885954 errors:0 dropped:21 overruns:0 frame:0
 eth4: RX packets:18841956 errors:0 dropped:18841956 overruns:0 frame:0
 
 Both cards are identical. I wonder: If bonding mode is 
 fault-tolerance (active-backup), is it normal then to see such 
 statistics. ethtool -S reports a high number for rx_filtered_packets...

Possibly. It'd be interesting to know what packets get dropped; this means you 
have approx. 10% of your traffic on the backup link. I wonder if all the 
nodes/switches/etc agree on what is the backup port and what isn't ...?

If 10% of the communication ends up on the wrong NIC, that surely would mess up 
a number of recovery protocols.

An alternative test case would be to see how the system behaves if you disable 
bonding - or if the names should stay the same, only one NIC in the bond.



Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. 
-- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-12 Thread Lars Marowsky-Bree
On 2013-07-12T11:05:32, Wengatz Herbert herbert.weng...@baaderbank.de wrote:

 Seeing the high dropping quote... (just compare this to the other NIC) - have 
 you tried a new cable? Maybe it's a cheap hardware problem...

The drop rate is normal. A slave NIC in a bonded active/passive
configuration will drop all packets.

I do wonder why there's so much traffic on a supposedly passive NIC,
though.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-12 Thread Wengatz Herbert
Hmmm.

Please correct me, if I'm wrong:
As I understand it, you have a number of packets, that go to BOTH NICs. 
Depending, on which one is the active or the passive one, the sum of all 
dropped packets should be equal to the number of received packets (plusminus 
some drops for other reasons). So if one card drops 10% of the packets, the 
other should drop 90% of the packets. - This is not the case here.

Regards,
Herbert

-Ursprüngliche Nachricht-
Von: linux-ha-boun...@lists.linux-ha.org 
[mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Lars Marowsky-Bree
Gesendet: Freitag, 12. Juli 2013 11:09
An: General Linux-HA mailing list
Betreff: Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

On 2013-07-12T11:05:32, Wengatz Herbert herbert.weng...@baaderbank.de wrote:

 Seeing the high dropping quote... (just compare this to the other NIC) - have 
 you tried a new cable? Maybe it's a cheap hardware problem...

The drop rate is normal. A slave NIC in a bonded active/passive configuration 
will drop all packets.

I do wonder why there's so much traffic on a supposedly passive NIC, though.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. 
-- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-12 Thread Ulrich Windl
 Lars Marowsky-Bree l...@suse.com schrieb am 12.07.2013 um 11:08 in
Nachricht
20130712090853.gm19...@suse.de:
 On 2013-07-12T11:05:32, Wengatz Herbert herbert.weng...@baaderbank.de
wrote:
 
 Seeing the high dropping quote... (just compare this to the other NIC) -
have 
 you tried a new cable? Maybe it's a cheap hardware problem...
 
 The drop rate is normal. A slave NIC in a bonded active/passive
 configuration will drop all packets.
 
 I do wonder why there's so much traffic on a supposedly passive NIC,
 though.

Lars,

that depends on the uptime. I think our network guys had updated the firmware
of some switches, casuing a switch reboot and a switch to a different bonding
slave, I guess.

Regards,
Ulrich

 
 
 Regards,
 Lars
 
 -- 
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,

 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM (dropped packets on binding device)

2013-07-12 Thread Ulrich Windl
 Wengatz Herbert herbert.weng...@baaderbank.de schrieb am 12.07.2013 um
11:19
in Nachricht e0a8d3556d452c42977202b2d60934660431fae...@msx2.baag:
 Hmmm.
 
 Please correct me, if I'm wrong:
 As I understand it, you have a number of packets, that go to BOTH NICs. 
 Depending, on which one is the active or the passive one, the sum of all 
 dropped packets should be equal to the number of received packets (plusminus

 some drops for other reasons). So if one card drops 10% of the packets, the

 other should drop 90% of the packets. - This is not the case here.

I haven't added all the numbers, but it's also quite confusing that the
dropped packets are pushed up to the bonding master: If dropped packets is
part of the bonding implementation, the number of dropped packets should be
hidden at the bonding level. If you have a bonding device with four slaves in
active/passive (being paranoid), you should see three times as much dropped
packets than received packets, right?

(I adjusted the subject for this discussion)

 
 Regards,
 Herbert
 
 -Ursprüngliche Nachricht-
 Von: linux-ha-boun...@lists.linux-ha.org 
 [mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Lars 
 Marowsky-Bree
 Gesendet: Freitag, 12. Juli 2013 11:09
 An: General Linux-HA mailing list
 Betreff: Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and
TOTEM
 
 On 2013-07-12T11:05:32, Wengatz Herbert herbert.weng...@baaderbank.de
wrote:
 
 Seeing the high dropping quote... (just compare this to the other NIC) -
have 
 you tried a new cable? Maybe it's a cheap hardware problem...
 
 The drop rate is normal. A slave NIC in a bonded active/passive 
 configuration will drop all packets.
 
 I do wonder why there's so much traffic on a supposedly passive NIC,
though.
 
 
 Regards,
 Lars
 
 --
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,

 HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their 
 mistakes. -- Oscar Wilde
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-11 Thread Ulrich Windl
 Lars Marowsky-Bree l...@suse.com schrieb am 10.07.2013 um 23:56 in
Nachricht
20130710215655.ge5...@suse.de:
 On 2013-07-10T14:33:12, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de
wrote:
 
  Network problems in hypervisors though also have a tendency to be, well,
  due to the hypervisor, or some network cards (broadcom?).
 
 Yes:
 driver: bnx2
 version: 2.1.11
 firmware-version: bc 5.2.3 NCSI 2.0.12
 
 For a really silly idea, but can you swap the network cards for a test?
 Say, with Intel NICs, or even another Broadcom model?

Unfortunately no: The 4-way NIC is onboard, and all slots are full.

 
  Can this be reproduced with another high network load pattern? Packet
  loss etc?
 No, but TCP handles packet loss more gracefully than the cluster, it
seems.
 
 A single lost packet shouldn't cause that, I think. (There may, of
 course, also be more problems hidden in corosync.) Anything showing up
 on the ifconfig stats or with a ping flood?

I noticed significant dropped RX frames, like 8%. But that was on a bonding
device. Interestingly the physical interfaces involved did not all have so many
RX packet drops. Maybe the bug is in the bonding code...
bond0: RX packets:211727910 errors:0 dropped:18996906 overruns:0 frame:0
eth1: RX packets:192885954 errors:0 dropped:21 overruns:0 frame:0
eth4: RX packets:18841956 errors:0 dropped:18841956 overruns:0 frame:0

Both cards are identical. I wonder: If bonding mode is fault-tolerance
(active-backup), is it normal then to see such statistics. ethtool -S reports
a high number for rx_filtered_packets...

 
 Some network card and hypervisor combos apparently don't play well with
 multicast, either. You could also try switching to unicast
 communication.

I would only do that as a last resort.

 
 And if this reproduces, you could try the SP3 update which ought to be
 mirrored out now (which includes a corosync update and a kernel refresh;
 corosync 1.4.6 is already in the maintenance queue).

In the near future, yes.

Regards,
Ulrich

 
 Sometimes I think the worst part about distributed processes is that it
 has to rely on networking. But then I remember it relies on human
 programmers too and the network isn't looking so bad any more ;-)
 
 
  Is there any perspective to see the light at the end of the tunnel? The

  problems should be easily reproducable.
  Bugs that get reported have a chance of being fixed ;-)
 One more bug and my suport engineer kills me ;-)
 
 There's already a bounty on your head, it can't get any worse ;-)
 
 
 Regards,
 Lars
 
 -- 
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,

 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-11 Thread Lars Marowsky-Bree
On 2013-07-11T08:41:33, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  For a really silly idea, but can you swap the network cards for a test?
  Say, with Intel NICs, or even another Broadcom model?
 Unfortunately no: The 4-way NIC is onboard, and all slots are full.

Too bad.

But then you could really try raising a support request about the
network driver, perhaps one of the kernel/networking gurus has an idea.

 RX packet drops. Maybe the bug is in the bonding code...
 bond0: RX packets:211727910 errors:0 dropped:18996906 overruns:0 frame:0
 eth1: RX packets:192885954 errors:0 dropped:21 overruns:0 frame:0
 eth4: RX packets:18841956 errors:0 dropped:18841956 overruns:0 frame:0
 
 Both cards are identical. I wonder: If bonding mode is fault-tolerance
 (active-backup), is it normal then to see such statistics. ethtool -S reports
 a high number for rx_filtered_packets...

Possibly. It'd be interesting to know what packets get dropped; this
means you have approx. 10% of your traffic on the backup link. I wonder
if all the nodes/switches/etc agree on what is the backup port and what
isn't ...?

If 10% of the communication ends up on the wrong NIC, that surely would
mess up a number of recovery protocols.

An alternative test case would be to see how the system behaves if you
disable bonding - or if the names should stay the same, only one NIC in
the bond.



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-10 Thread Lars Marowsky-Bree
On 2013-07-10T14:33:12, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  Network problems in hypervisors though also have a tendency to be, well,
  due to the hypervisor, or some network cards (broadcom?).
 
 Yes:
 driver: bnx2
 version: 2.1.11
 firmware-version: bc 5.2.3 NCSI 2.0.12

For a really silly idea, but can you swap the network cards for a test?
Say, with Intel NICs, or even another Broadcom model?

  Can this be reproduced with another high network load pattern? Packet
  loss etc?
 No, but TCP handles packet loss more gracefully than the cluster, it seems.

A single lost packet shouldn't cause that, I think. (There may, of
course, also be more problems hidden in corosync.) Anything showing up
on the ifconfig stats or with a ping flood?

Some network card and hypervisor combos apparently don't play well with
multicast, either. You could also try switching to unicast
communication.

And if this reproduces, you could try the SP3 update which ought to be
mirrored out now (which includes a corosync update and a kernel refresh;
corosync 1.4.6 is already in the maintenance queue).

Sometimes I think the worst part about distributed processes is that it
has to rely on networking. But then I remember it relies on human
programmers too and the network isn't looking so bad any more ;-)


  Is there any perspective to see the light at the end of the tunnel? The 
  problems should be easily reproducable.
  Bugs that get reported have a chance of being fixed ;-)
 One more bug and my suport engineer kills me ;-)

There's already a bounty on your head, it can't get any worse ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems