Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-12 Thread Wengatz Herbert
Seeing the high dropping quote... (just compare this to the other NIC) - have 
you tried a new cable? Maybe it's a cheap hardware problem...

-Ursprüngliche Nachricht-
Von: linux-ha-boun...@lists.linux-ha.org 
[mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Lars Marowsky-Bree
Gesendet: Donnerstag, 11. Juli 2013 11:20
An: General Linux-HA mailing list
Betreff: Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

On 2013-07-11T08:41:33, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  For a really silly idea, but can you swap the network cards for a test?
  Say, with Intel NICs, or even another Broadcom model?
 Unfortunately no: The 4-way NIC is onboard, and all slots are full.

Too bad.

But then you could really try raising a support request about the network 
driver, perhaps one of the kernel/networking gurus has an idea.

 RX packet drops. Maybe the bug is in the bonding code...
 bond0: RX packets:211727910 errors:0 dropped:18996906 overruns:0 
 frame:0
 eth1: RX packets:192885954 errors:0 dropped:21 overruns:0 frame:0
 eth4: RX packets:18841956 errors:0 dropped:18841956 overruns:0 frame:0
 
 Both cards are identical. I wonder: If bonding mode is 
 fault-tolerance (active-backup), is it normal then to see such 
 statistics. ethtool -S reports a high number for rx_filtered_packets...

Possibly. It'd be interesting to know what packets get dropped; this means you 
have approx. 10% of your traffic on the backup link. I wonder if all the 
nodes/switches/etc agree on what is the backup port and what isn't ...?

If 10% of the communication ends up on the wrong NIC, that surely would mess up 
a number of recovery protocols.

An alternative test case would be to see how the system behaves if you disable 
bonding - or if the names should stay the same, only one NIC in the bond.



Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. 
-- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] crm resource restart is broken (d4de3af6dd33)

2013-07-12 Thread Vladislav Bogdanov
Hi Dejan,

It seems like resource restart does not work any longer.

# crm resource restart test01-vm
INFO: ordering test01-vm to stop
Traceback (most recent call last):
  File /usr/sbin/crm, line 44, in module
main.run()
  File /usr/lib64/python2.6/site-packages/crmsh/main.py, line 442, in run
  File /usr/lib64/python2.6/site-packages/crmsh/main.py, line 349, in do_work
  File /usr/lib64/python2.6/site-packages/crmsh/main.py, line 150, in 
parse_line
  File /usr/lib64/python2.6/site-packages/crmsh/main.py, line 149, in lambda
  File /usr/lib64/python2.6/site-packages/crmsh/ui.py, line 894, in restart
  File /usr/lib64/python2.6/site-packages/crmsh/utils.py, line 429, in wait4dc
  File /usr/lib64/python2.6/site-packages/crmsh/utils.py, line 544, in 
crm_msec
  File /usr/lib64/python2.6/re.py, line 137, in match
TypeError: expected string or buffer


Vladislav
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] crm resource restart is broken (d4de3af6dd33)

2013-07-12 Thread Vladislav Bogdanov
12.07.2013 12:06, Vladislav Bogdanov wrote:
 Hi Dejan,
 
 It seems like resource restart does not work any longer.

Ah, this seems to be fixed by bb39cce17f20. Sorry for noise.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-12 Thread Lars Marowsky-Bree
On 2013-07-12T11:05:32, Wengatz Herbert herbert.weng...@baaderbank.de wrote:

 Seeing the high dropping quote... (just compare this to the other NIC) - have 
 you tried a new cable? Maybe it's a cheap hardware problem...

The drop rate is normal. A slave NIC in a bonded active/passive
configuration will drop all packets.

I do wonder why there's so much traffic on a supposedly passive NIC,
though.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-12 Thread Wengatz Herbert
Hmmm.

Please correct me, if I'm wrong:
As I understand it, you have a number of packets, that go to BOTH NICs. 
Depending, on which one is the active or the passive one, the sum of all 
dropped packets should be equal to the number of received packets (plusminus 
some drops for other reasons). So if one card drops 10% of the packets, the 
other should drop 90% of the packets. - This is not the case here.

Regards,
Herbert

-Ursprüngliche Nachricht-
Von: linux-ha-boun...@lists.linux-ha.org 
[mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Lars Marowsky-Bree
Gesendet: Freitag, 12. Juli 2013 11:09
An: General Linux-HA mailing list
Betreff: Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

On 2013-07-12T11:05:32, Wengatz Herbert herbert.weng...@baaderbank.de wrote:

 Seeing the high dropping quote... (just compare this to the other NIC) - have 
 you tried a new cable? Maybe it's a cheap hardware problem...

The drop rate is normal. A slave NIC in a bonded active/passive configuration 
will drop all packets.

I do wonder why there's so much traffic on a supposedly passive NIC, though.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. 
-- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] crm resource restart is broken (d4de3af6dd33)

2013-07-12 Thread Dejan Muhamedagic
Hi,

On Fri, Jul 12, 2013 at 12:08:12PM +0300, Vladislav Bogdanov wrote:
 12.07.2013 12:06, Vladislav Bogdanov wrote:
  Hi Dejan,
  
  It seems like resource restart does not work any longer.

Yes, I still wonder how did that one slip.

 Ah, this seems to be fixed by bb39cce17f20. Sorry for noise.

No need to be sorry, I appreciate every bug report.

Cheers,

Dejan

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Antw: crm resource restart is broken (d4de3af6dd33)

2013-07-12 Thread Ulrich Windl
 Vladislav Bogdanov bub...@hoster-ok.com schrieb am 12.07.2013 um 11:06 in
Nachricht 51dfc715.2030...@hoster-ok.com:
 Hi Dejan,
 
 It seems like resource restart does not work any longer.

BTW: The way resource restart is implemented (i.e.: stop  wait, then start) 
has a major problem: If stop causes to fence the node where the crm command is 
running, the resource will remain stopped even after the node restarted.

 
 # crm resource restart test01-vm
 INFO: ordering test01-vm to stop
 Traceback (most recent call last):
   File /usr/sbin/crm, line 44, in module
 main.run()
   File /usr/lib64/python2.6/site-packages/crmsh/main.py, line 442, in run
   File /usr/lib64/python2.6/site-packages/crmsh/main.py, line 349, in 
 do_work
   File /usr/lib64/python2.6/site-packages/crmsh/main.py, line 150, in 
 parse_line
   File /usr/lib64/python2.6/site-packages/crmsh/main.py, line 149, in 
 lambda
   File /usr/lib64/python2.6/site-packages/crmsh/ui.py, line 894, in restart
   File /usr/lib64/python2.6/site-packages/crmsh/utils.py, line 429, in 
 wait4dc
   File /usr/lib64/python2.6/site-packages/crmsh/utils.py, line 544, in 
 crm_msec
   File /usr/lib64/python2.6/re.py, line 137, in match
 TypeError: expected string or buffer
 
 
 Vladislav
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-12 Thread Ulrich Windl
 Lars Marowsky-Bree l...@suse.com schrieb am 12.07.2013 um 11:08 in
Nachricht
20130712090853.gm19...@suse.de:
 On 2013-07-12T11:05:32, Wengatz Herbert herbert.weng...@baaderbank.de
wrote:
 
 Seeing the high dropping quote... (just compare this to the other NIC) -
have 
 you tried a new cable? Maybe it's a cheap hardware problem...
 
 The drop rate is normal. A slave NIC in a bonded active/passive
 configuration will drop all packets.
 
 I do wonder why there's so much traffic on a supposedly passive NIC,
 though.

Lars,

that depends on the uptime. I think our network guys had updated the firmware
of some switches, casuing a switch reboot and a switch to a different bonding
slave, I guess.

Regards,
Ulrich

 
 
 Regards,
 Lars
 
 -- 
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,

 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] crm node delete

2013-07-12 Thread Vladislav Bogdanov
01.07.2013 17:29, Vladislav Bogdanov wrote:
 Hi,
 
 I'm trying to look if it is now safe to delete non-running nodes
 (corosync 2.3, pacemaker HEAD, crmsh tip).
 
 # crm node delete v02-d
 WARNING: 2: crm_node bad format: 7 v02-c
 WARNING: 2: crm_node bad format: 8 v02-d
 WARNING: 2: crm_node bad format: 5 v02-a
 WARNING: 2: crm_node bad format: 6 v02-b
 INFO: 2: node v02-d not found by crm_node
 INFO: 2: node v02-d deleted
 #
 
 So, I expect that crmsh still doesn't follow latest changes to 'crm_node
 -l'. Although node seems to be deleted correctly.
 
 For reference, output of crm_node -l is:
 7 v02-c
 8 v02-d
 5 v02-a
 6 v02-b
 

With latest merge of Andrew's public and private trees and crmsh tip
everything works as expected.
The only (minor but confusing) issue is:

[root@vd01-a ~]# crm_node -l
3 vd01-c
4 vd01-d
1 vd01-a
2 vd01-b
[root@vd01-a ~]# crm_node -p
vd01-c vd01-a vd01-b
[root@vd01-a ~]# crm node delete vd01-d
WARNING: crm_node --force -R vd01-d failed, rc=1

Looks like missing crm_exit(pcmk_ok) for -R in try_corosync().

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: crm resource restart is broken (d4de3af6dd33)

2013-07-12 Thread Lars Marowsky-Bree
On 2013-07-12T12:19:40, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 BTW: The way resource restart is implemented (i.e.: stop  wait, then 
 start) has a major problem: If stop causes to fence the node where the crm 
 command is running, the resource will remain stopped even after the node 
 restarted.

Yes. That's a limitation that's difficult to overcome. restart is a
multi-phase command, so if something happens to the node where it runs,
you have a problem.

But the need for a manual restart should be rare. If the resource is
running and healthy according to monitor, why should it be necessary?
;-)

(Another way to trigger a restart is to modify the instance parameters.
Set __manual_restart=1 and it'll restart.)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM (dropped packets on binding device)

2013-07-12 Thread Ulrich Windl
 Wengatz Herbert herbert.weng...@baaderbank.de schrieb am 12.07.2013 um
11:19
in Nachricht e0a8d3556d452c42977202b2d60934660431fae...@msx2.baag:
 Hmmm.
 
 Please correct me, if I'm wrong:
 As I understand it, you have a number of packets, that go to BOTH NICs. 
 Depending, on which one is the active or the passive one, the sum of all 
 dropped packets should be equal to the number of received packets (plusminus

 some drops for other reasons). So if one card drops 10% of the packets, the

 other should drop 90% of the packets. - This is not the case here.

I haven't added all the numbers, but it's also quite confusing that the
dropped packets are pushed up to the bonding master: If dropped packets is
part of the bonding implementation, the number of dropped packets should be
hidden at the bonding level. If you have a bonding device with four slaves in
active/passive (being paranoid), you should see three times as much dropped
packets than received packets, right?

(I adjusted the subject for this discussion)

 
 Regards,
 Herbert
 
 -Ursprüngliche Nachricht-
 Von: linux-ha-boun...@lists.linux-ha.org 
 [mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Lars 
 Marowsky-Bree
 Gesendet: Freitag, 12. Juli 2013 11:09
 An: General Linux-HA mailing list
 Betreff: Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and
TOTEM
 
 On 2013-07-12T11:05:32, Wengatz Herbert herbert.weng...@baaderbank.de
wrote:
 
 Seeing the high dropping quote... (just compare this to the other NIC) -
have 
 you tried a new cable? Maybe it's a cheap hardware problem...
 
 The drop rate is normal. A slave NIC in a bonded active/passive 
 configuration will drop all packets.
 
 I do wonder why there's so much traffic on a supposedly passive NIC,
though.
 
 
 Regards,
 Lars
 
 --
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,

 HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their 
 mistakes. -- Oscar Wilde
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: crm resource restart is broken (d4de3af6dd33)

2013-07-12 Thread Ulrich Windl
 Lars Marowsky-Bree l...@suse.com schrieb am 12.07.2013 um 12:23 in 
 Nachricht
20130712102340.go19...@suse.de:
 On 2013-07-12T12:19:40, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de 
 wrote:
 
 BTW: The way resource restart is implemented (i.e.: stop  wait, then 
 start) has a major problem: If stop causes to fence the node where the crm 
 command is running, the resource will remain stopped even after the node 
 restarted.
 
 Yes. That's a limitation that's difficult to overcome. restart is a
 multi-phase command, so if something happens to the node where it runs,
 you have a problem.
 
 But the need for a manual restart should be rare. If the resource is
 running and healthy according to monitor, why should it be necessary?
 ;-)
 
 (Another way to trigger a restart is to modify the instance parameters.
 Set __manual_restart=1 and it'll restart.)

once? ;-)



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: crm resource restart is broken (d4de3af6dd33)

2013-07-12 Thread Lars Marowsky-Bree
On 2013-07-12T12:26:18, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  (Another way to trigger a restart is to modify the instance parameters.
  Set __manual_restart=1 and it'll restart.)
 once? ;-)

Keep increasing it. ;-)


-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] crm node delete

2013-07-12 Thread Andrew Beekhof

On 12/07/2013, at 8:23 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 01.07.2013 17:29, Vladislav Bogdanov wrote:
 Hi,
 
 I'm trying to look if it is now safe to delete non-running nodes
 (corosync 2.3, pacemaker HEAD, crmsh tip).
 
 # crm node delete v02-d
 WARNING: 2: crm_node bad format: 7 v02-c
 WARNING: 2: crm_node bad format: 8 v02-d
 WARNING: 2: crm_node bad format: 5 v02-a
 WARNING: 2: crm_node bad format: 6 v02-b
 INFO: 2: node v02-d not found by crm_node
 INFO: 2: node v02-d deleted
 #
 
 So, I expect that crmsh still doesn't follow latest changes to 'crm_node
 -l'. Although node seems to be deleted correctly.
 
 For reference, output of crm_node -l is:
 7 v02-c
 8 v02-d
 5 v02-a
 6 v02-b
 
 
 With latest merge of Andrew's public and private trees and crmsh tip
 everything works as expected.
 The only (minor but confusing) issue is:
 
 [root@vd01-a ~]# crm_node -l
 3 vd01-c
 4 vd01-d
 1 vd01-a
 2 vd01-b
 [root@vd01-a ~]# crm_node -p
 vd01-c vd01-a vd01-b
 [root@vd01-a ~]# crm node delete vd01-d
 WARNING: crm_node --force -R vd01-d failed, rc=1
 
 Looks like missing crm_exit(pcmk_ok) for -R in try_corosync().

Done. Thanks for testing.

 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Adding node in advance

2013-07-12 Thread Vladislav Bogdanov
Hi,

I wanted to add new node into CIB in advance, before it is powered on
(to power it on in a standby mode while cl#5169 is not implemented).

So, I did
==
[root@vd01-a tmp]# cat u
node $id=4 vd01-d \
attributes standby=on virtualization=true
[root@vd01-a tmp]# crm configure load update u
ERROR: 4: invalid object id
==

Exactly the same syntax is accepted for already-known node.

this is corosync-2.3.1 with nodelist/udpu, pacemaker master and crmsh tip.

Vladislav
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: crm resource restart is broken (d4de3af6dd33)

2013-07-12 Thread Andrew Beekhof

On 12/07/2013, at 8:23 PM, Lars Marowsky-Bree l...@suse.com wrote:

 On 2013-07-12T12:19:40, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de 
 wrote:
 
 BTW: The way resource restart is implemented (i.e.: stop  wait, then 
 start) has a major problem: If stop causes to fence the node where the crm 
 command is running, the resource will remain stopped even after the node 
 restarted.
 
 Yes. That's a limitation that's difficult to overcome. restart is a
 multi-phase command, so if something happens to the node where it runs,
 you have a problem.

crm_resource --force-stop -r resource_name and look for the recovery?

You can even add -V if you want to go blind

 
 But the need for a manual restart should be rare. If the resource is
 running and healthy according to monitor, why should it be necessary?
 ;-)
 
 (Another way to trigger a restart is to modify the instance parameters.
 Set __manual_restart=1 and it'll restart.)
 
 
 Regards,
Lars
 
 -- 
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems