Re: [Pacemaker] heartbeat:anything resource not stop/monitoring after reboot

2013-09-05 Thread Andrew Beekhof

On 06/09/2013, at 1:23 AM, David Coulson  wrote:

> We patched and rebooted one of our clusters this morning - I verified that 
> pacemaker is the same as previous, plus it matches another similar cluster.
> 
> There is a resource in the cluster defined as:
> 
> primitive re-named-reload ocf:heartbeat:anything \
>params binfile="/usr/sbin/rndc" cmdline_options="reload"
> 
> This is the last resource in a group after the named:lsb and an ipaddr 
> resource, so named binds to the VIP
> 
> After the reboot the re-named-reload resource is all screwed up. The start 
> seems to work, but the monitor is failing and the stop doesn't work:
> 
> Sep  5 11:14:14 dresproddns02 lrmd[82091]:   notice: operation_finished: 
> re-named-reload_stop_0:582081 [ /usr/lib/ocf/resource.d/heartbeat/anything: 
> line 60: kill: (580334) - No such process ]
> Sep  5 11:14:14 dresproddns02 crmd[82092]:   notice: process_lrm_event: LRM 
> operation re-named-reload_stop_0 (call=33446, rc=0, cib-update=11044, 
> confirmed=true) ok
> Sep  5 11:14:15 dresproddns02 crmd[82092]:   notice: process_lrm_event: LRM 
> operation re-named-reload_start_0 (call=33450, rc=0, cib-update=11045, 
> confirmed=true) ok
> Sep  5 11:14:15 dresproddns02 lrmd[82091]:   notice: operation_finished: 
> re-named-reload_monitor_6:582121 [ 
> /usr/lib/ocf/resource.d/heartbeat/anything: line 60: kill: (582109) - No such 
> process ]
> Sep  5 11:14:15 dresproddns02 crmd[82092]:   notice: process_lrm_event: LRM 
> operation re-named-reload_monitor_6 (call=33453, rc=1, cib-update=11046, 
> confirmed=false) unknown error
> 
> The ocf-tester fails on both clusters
> 
> ocf-tester -n reload -o binfile="/usr/sbin/rndc" -o cmdline_options="reload" 
> /usr/lib/ocf/resource.d/heartbeat/anything
> Beginning tests for /usr/lib/ocf/resource.d/heartbeat/anything...
> * rc=1: Monitoring an active resource should return 0
> * rc=1: Probing an active resource should return 0
> * Your agent does not support the notify action (optional)
> * Your agent does not support the demote action (optional)
> * Your agent does not support the promote action (optional)
> * Your agent does not support master/slave (optional)
> * rc=1: Monitoring an active resource should return 0
> * rc=1: Monitoring an active resource should return 0
> * Your agent does not support the reload action (optional)
> Tests failed: /usr/lib/ocf/resource.d/heartbeat/anything failed 4 tests
> 
> So, I guess the question is really - Why is it working at all on the cluster 
> it is working on? The rndc process doesn't hang around for more than a few 
> seconds, so the monitor should never really see it running.
> 
> I did copy over the heartbeat/anything script from the working environment to 
> the broken one, and we have the same issue.
> 
> Short of writing a resource that does a start and forces a rc=0 for 
> stop/monitor, any ideas why this is behaving the way it is?

I'm guessing there is a stale pid file around, or however the pid of binfile is 
calculated is not smart enough.


> 
> David
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] heartbeat:anything resource not stop/monitoring after reboot

2013-09-05 Thread David Coulson
We patched and rebooted one of our clusters this morning - I verified 
that pacemaker is the same as previous, plus it matches another similar 
cluster.


There is a resource in the cluster defined as:

primitive re-named-reload ocf:heartbeat:anything \
params binfile="/usr/sbin/rndc" cmdline_options="reload"

This is the last resource in a group after the named:lsb and an ipaddr 
resource, so named binds to the VIP


After the reboot the re-named-reload resource is all screwed up. The 
start seems to work, but the monitor is failing and the stop doesn't work:


Sep  5 11:14:14 dresproddns02 lrmd[82091]:   notice: operation_finished: 
re-named-reload_stop_0:582081 [ 
/usr/lib/ocf/resource.d/heartbeat/anything: line 60: kill: (580334) - No 
such process ]
Sep  5 11:14:14 dresproddns02 crmd[82092]:   notice: process_lrm_event: 
LRM operation re-named-reload_stop_0 (call=33446, rc=0, 
cib-update=11044, confirmed=true) ok
Sep  5 11:14:15 dresproddns02 crmd[82092]:   notice: process_lrm_event: 
LRM operation re-named-reload_start_0 (call=33450, rc=0, 
cib-update=11045, confirmed=true) ok
Sep  5 11:14:15 dresproddns02 lrmd[82091]:   notice: operation_finished: 
re-named-reload_monitor_6:582121 [ 
/usr/lib/ocf/resource.d/heartbeat/anything: line 60: kill: (582109) - No 
such process ]
Sep  5 11:14:15 dresproddns02 crmd[82092]:   notice: process_lrm_event: 
LRM operation re-named-reload_monitor_6 (call=33453, rc=1, 
cib-update=11046, confirmed=false) unknown error


The ocf-tester fails on both clusters

ocf-tester -n reload -o binfile="/usr/sbin/rndc" -o 
cmdline_options="reload" /usr/lib/ocf/resource.d/heartbeat/anything

Beginning tests for /usr/lib/ocf/resource.d/heartbeat/anything...
* rc=1: Monitoring an active resource should return 0
* rc=1: Probing an active resource should return 0
* Your agent does not support the notify action (optional)
* Your agent does not support the demote action (optional)
* Your agent does not support the promote action (optional)
* Your agent does not support master/slave (optional)
* rc=1: Monitoring an active resource should return 0
* rc=1: Monitoring an active resource should return 0
* Your agent does not support the reload action (optional)
Tests failed: /usr/lib/ocf/resource.d/heartbeat/anything failed 4 tests

So, I guess the question is really - Why is it working at all on the 
cluster it is working on? The rndc process doesn't hang around for more 
than a few seconds, so the monitor should never really see it running.


I did copy over the heartbeat/anything script from the working 
environment to the broken one, and we have the same issue.


Short of writing a resource that does a start and forces a rc=0 for 
stop/monitor, any ideas why this is behaving the way it is?


David

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Corosync quorum not updating on split node

2013-09-05 Thread Mark Round
Just a quick follow up - I had this answered on the Corosync mailing list 
(which I guess should have been the place for this anyway). As I was blocking 
all traffic with iptables, it was also blocking lo, which caused all sorts of 
things to break. As soon as I only blocked on eth0, things started working as 
expected.

-Original Message-
From: Mark Round [mailto:mark.ro...@nccgroup.com]
Sent: 05 September 2013 11:44
To: pacemaker@oss.clusterlabs.org
Subject: [Pacemaker] Corosync quorum not updating on split node

Hi all,

I have a problem whereby when I create a network split/partition (by dropping 
traffic with iptables), the victim node for some reason does not realise it has 
split from the network.

It seems to recognise that it can't form a cluster due to network issues, but 
the status is not reflected in the output from corosync-quorumtool, and cluster 
services (via pacemaker) still continue to run. However, the other nodes in the 
rest of the cluster do realise they have lost contact with a node, no longer 
have quorum and correctly shut down services.

When I block traffic on the victim node's eth0, The remaining nodes see that 
they cannot communicate with it and shutdown :

# corosync-quorumtool -s
Version:  1.4.5
Nodes:3
Ring ID:  696
Quorum type:  corosync_votequorum
Quorate:  No
Node votes:   1
Expected votes:   7
Highest expected: 7
Total votes:  3
Quorum:   4 Activity blocked
Flags:

However, the victim node still thinks everything is fine, and maintains a view 
of the cluster prior to the split :

# corosync-quorumtool -s
Version:  1.4.5
Nodes:4
Ring ID:  716
Quorum type:  corosync_votequorum
Quorate:  Yes
Node votes:   1
Expected votes:   7
Highest expected: 7
Total votes:  4
Quorum:   4
Flags:Quorate

However, it does notice in the logs that it cannot now form cluster, as the 
following messages repeat constantly :

corosync [MAIN  ] Totem is unable to form a cluster because of an operating 
system or network fault. The most common cause of this message is that the 
local firewall is configured improperly.

I would expect at this point for it to be in it's own network partition with a 
total of 1 vote, and block activity. However, this does not seem to happen 
until just after it rejoins the cluster. When I unblock traffic and it rejoins, 
I see the victim finally realise it had lost quorum :

Sep 05 09:52:21 corosync [pcmk  ] notice: pcmk_peer_update: Transitional 
membership event on ring 720: memb=1, new=0, lost=3 Sep 05 09:52:21 corosync 
[VOTEQ ] quorum lost, blocking activity Sep 05 09:52:21 corosync [QUORUM] This 
node is within the non-primary component and will NOT provide any services.
Sep 05 09:52:21 corosync [QUORUM] Members[1]: 358898186

And a second or so later it regains quorum :

crmd:   notice: ais_dispatch_message: Membership 736: quorum acquired

So my question is why, when it realises it cannot form a cluster ("Totem in 
unable to form..."), does it not loose quorum, update the status as reflected 
by quorumtool and shutdown cluster services ?

Configuration file example and package versions/environment listed below. I'm 
using "updu" protocol as we need to avoid multicast in this environment; it 
will eventually be using a routed network. This behaviour also persists when I 
disable the pacemaker plugin and just test with corosync.

compatibility: whitetank
totem {
version: 2
secauth: off
interface {
member {
memberaddr: 10.90.100.20
}
member {
memberaddr: 10.90.100.21
}
...
... more nodes snipped
...
ringnumber: 0
bindnetaddr: 10.90.100.20
mcastport: 5405
}
transport: udpu
}
amf {
mode: disabled
}
aisexec {
user: root
group: root
}
quorum {
provider: corosync_votequorum
expected_votes: 7
}
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 0
}

Environment : CentOS 6.4
Packages from OpenSUSE : 
http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/x86_64/
# rpm -qa | egrep "^(cluster|corosync|crm|libqb|pacemaker|resource-agents)" | 
sort
cluster-glue-1.0.11-3.1.x86_64
cluster-glue-libs-1.0.11-3.1.x86_64
corosync-1.4.5-2.2.x86_64
corosynclib-1.4.5-2.2.x86_64
crmsh-1.2.6-0.rc3.3.1.x86_64
libqb0-0.14.4-1.2.x86_64
pacemaker-1.1.9-2.1.x86_64
pacemaker-cli-1.1.9-2.1.x86_64
pacemaker-cluster-libs-1.1.9-2.1.x86_64
pacemaker-libs-1.1.9-2.1.x86_64
resource-agents-3.9.5-3.1.x86_64

Regards,

-Mark





Mark Round
Senior Systems Administrator
NCC Group
Kings Court
Kingston Road
Leatherhead, KT22 7SL

Telephone: +44 1372 383815
Mobile: +44 7790 770413
Fax:
Website: www.nccgroup.com
Email:  mark.ro...@nccgroup.com
[http://www.nccgro

Re: [Pacemaker] Resource ordering/colocating question (DRBD + LVM + FS)

2013-09-05 Thread Andreas Mock
Hi Heikki,

just some comments for helping yourself.

1) The second output of crm_mon show a resource IP_database
which is not shown in the initial crm_mon output and also
not in the config. => Reduce your problem/config to the
minimum being reproducible.

2) Enable logging and look out which node is the DC.
There in the logs you find many many informations showing
what is going on. Hint: Open a terminal session with an
opened tail -f logfile. Watch it while inserting commands.
You'll get used to it.

3) The shown status of a drbd resource (crm_mon) doesn't show
you all informations of the drbd devices. Have a look at
drbd-overview on both nodes. (e.g. syncing status).

4) This setup CRIES for stonithing. Even in a test environment.
When stonith happens (this is what you see immediately) you
know something went wrong. This is a good indicator for
errors in agents or in the config. Believe me, as tedious stonithing
is the valuable it is for getting hints for bad cluster state.
On virtual machines stonithing is not as painful as on real
servers.

5) Is the drbd fencing script enabled? If yes, in certain circumstances
-INF rules are inserted to deny promoting of "wrong" nodes.
You should grep for them 'cibadmin -Q | grep '

6) crm_simulate -L -v gives you an output of the scores of
the resources on each node. I really don't know how to read it
exactly (Is there a documentation of that anywhere?), but it
gives you a hint where to look at, when resources don't start.
Especially the aggregation of stickiness values in groups are
sometimes misleading.


7) Sometimes behaviour of pacemaker changed and it is possible
that you hit a bug. But this hard to find out. Possibility:
Check a newer version.

Hope this helps.

Best regards
Andreas Mock




-Ursprüngliche Nachricht-
Von: Heikki Manninen [mailto:h...@iki.fi] 
Gesendet: Donnerstag, 5. September 2013 14:08
An: pacemaker@oss.clusterlabs.org
Betreff: [Pacemaker] Resource ordering/colocating question (DRBD + LVM + FS)

Hello,

I'm having a bit of a problem understanding what's going on with my simple
two-node demo cluster here. My resources come up correctly after restarting
the whole cluster but the LVM and Filesystem resources fail to start after a
single node restart or standby/unstandby (after node comes back online - why
do they even stop/start after the second node comes back?).

OS: CentOS 6.4 (cman stack)
Pacemaker: pacemaker-1.1.8-7.el6.x86_64
DRBD: drbd84-utils-8.4.3-1.el6.elrepo.x86_64

Everything is configured using: pcs-0.9.26-10.el6_4.1.noarch

Two DRBD resources configured and working: data01 & data02
Two nodes: pgdbsrv01.cl1.local & pgdbsrv02.cl1.local

Configuration:

node pgdbsrv01.cl1.local
node pgdbsrv02.cl1.local
primitive DRBD_data01 ocf:linbit:drbd \
 params drbd_resource="data01" \
 op monitor interval="30s"
primitive DRBD_data02 ocf:linbit:drbd \
 params drbd_resource="data02" \
 op monitor interval="30s"
primitive FS_data01 ocf:heartbeat:Filesystem \
 params device="/dev/mapper/vgdata01-lvdata01" directory="/data01"
fstype="ext4" \
 op monitor interval="30s"
primitive FS_data02 ocf:heartbeat:Filesystem \
 params device="/dev/mapper/vgdata02-lvdata02" directory="/data02"
fstype="ext4" \
 op monitor interval="30s"
primitive LVM_vgdata01 ocf:heartbeat:LVM \
 params volgrpname="vgdata01" exclusive="true" \
 op monitor interval="30s"
primitive LVM_vgdata02 ocf:heartbeat:LVM \
 params volgrpname="vgdata02" exclusive="true" \
 op monitor interval="30s"
group GRP_data01 LVM_vgdata01 FS_data01
group GRP_data02 LVM_vgdata02 FS_data02
ms DRBD_ms_data01 DRBD_data01 \
 meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
ms DRBD_ms_data02 DRBD_data02 \
 meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
colocation colocation-GRP_data01-DRBD_ms_data01-INFINITY inf: GRP_data01
DRBD_ms_data01:Master
colocation colocation-GRP_data02-DRBD_ms_data02-INFINITY inf: GRP_data02
DRBD_ms_data02:Master
order order-DRBD_data01-GRP_data01-mandatory : DRBD_data01:promote
GRP_data01:start
order order-DRBD_data02-GRP_data02-mandatory : DRBD_data02:promote
GRP_data02:start
property $id="cib-bootstrap-options" \
 dc-version="1.1.8-7.el6-394e906" \
 cluster-infrastructure="cman" \
 stonith-enabled="false" \
 no-quorum-policy="ignore" \
 migration-threshold="1"
rsc_defaults $id="rsc_defaults-options" \
 resource-stickiness="100"


1) After starting the cluster, everything runs happily:

Last updated: Tue Sep  3 00:11:13 2013
Last change: Tue Sep  3 00:05:15 2013 via cibadmin on pgdbsrv01.cl1.local
Stack: cman
Current DC: pgdbsrv02.cl1.local - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, unknown expected votes
9 Resources configured.

Online: [ pgdbsrv01.cl1.local pgdbsrv02.cl1.local ]

Full list of resources:

Master/Slave Set: DRBD_ms_data01 [DRBD_data01]
 Masters: [ pgdbsrv01.cl1.local ]

[Pacemaker] Resource ordering/colocating question (DRBD + LVM + FS)

2013-09-05 Thread Heikki Manninen
Hello,

I'm having a bit of a problem understanding what's going on with my simple 
two-node demo cluster here. My resources come up correctly after restarting the 
whole cluster but the LVM and Filesystem resources fail to start after a single 
node restart or standby/unstandby (after node comes back online - why do they 
even stop/start after the second node comes back?).

OS: CentOS 6.4 (cman stack)
Pacemaker: pacemaker-1.1.8-7.el6.x86_64
DRBD: drbd84-utils-8.4.3-1.el6.elrepo.x86_64

Everything is configured using: pcs-0.9.26-10.el6_4.1.noarch

Two DRBD resources configured and working: data01 & data02
Two nodes: pgdbsrv01.cl1.local & pgdbsrv02.cl1.local

Configuration:

node pgdbsrv01.cl1.local
node pgdbsrv02.cl1.local
primitive DRBD_data01 ocf:linbit:drbd \
 params drbd_resource="data01" \
 op monitor interval="30s"
primitive DRBD_data02 ocf:linbit:drbd \
 params drbd_resource="data02" \
 op monitor interval="30s"
primitive FS_data01 ocf:heartbeat:Filesystem \
 params device="/dev/mapper/vgdata01-lvdata01" directory="/data01" 
fstype="ext4" \
 op monitor interval="30s"
primitive FS_data02 ocf:heartbeat:Filesystem \
 params device="/dev/mapper/vgdata02-lvdata02" directory="/data02" 
fstype="ext4" \
 op monitor interval="30s"
primitive LVM_vgdata01 ocf:heartbeat:LVM \
 params volgrpname="vgdata01" exclusive="true" \
 op monitor interval="30s"
primitive LVM_vgdata02 ocf:heartbeat:LVM \
 params volgrpname="vgdata02" exclusive="true" \
 op monitor interval="30s"
group GRP_data01 LVM_vgdata01 FS_data01
group GRP_data02 LVM_vgdata02 FS_data02
ms DRBD_ms_data01 DRBD_data01 \
 meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" 
notify="true"
ms DRBD_ms_data02 DRBD_data02 \
 meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" 
notify="true"
colocation colocation-GRP_data01-DRBD_ms_data01-INFINITY inf: GRP_data01 
DRBD_ms_data01:Master
colocation colocation-GRP_data02-DRBD_ms_data02-INFINITY inf: GRP_data02 
DRBD_ms_data02:Master
order order-DRBD_data01-GRP_data01-mandatory : DRBD_data01:promote 
GRP_data01:start
order order-DRBD_data02-GRP_data02-mandatory : DRBD_data02:promote 
GRP_data02:start
property $id="cib-bootstrap-options" \
 dc-version="1.1.8-7.el6-394e906" \
 cluster-infrastructure="cman" \
 stonith-enabled="false" \
 no-quorum-policy="ignore" \
 migration-threshold="1"
rsc_defaults $id="rsc_defaults-options" \
 resource-stickiness="100"


1) After starting the cluster, everything runs happily:

Last updated: Tue Sep  3 00:11:13 2013
Last change: Tue Sep  3 00:05:15 2013 via cibadmin on pgdbsrv01.cl1.local
Stack: cman
Current DC: pgdbsrv02.cl1.local - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, unknown expected votes
9 Resources configured.

Online: [ pgdbsrv01.cl1.local pgdbsrv02.cl1.local ]

Full list of resources:

Master/Slave Set: DRBD_ms_data01 [DRBD_data01]
 Masters: [ pgdbsrv01.cl1.local ]
 Slaves: [ pgdbsrv02.cl1.local ]
Master/Slave Set: DRBD_ms_data02 [DRBD_data02]
 Masters: [ pgdbsrv01.cl1.local ]
 Slaves: [ pgdbsrv02.cl1.local ]
Resource Group: GRP_data01
 LVM_vgdata01 (ocf::heartbeat:LVM): Started pgdbsrv01.cl1.local
 FS_data01 (ocf::heartbeat:Filesystem): Started pgdbsrv01.cl1.local
Resource Group: GRP_data02
 LVM_vgdata02 (ocf::heartbeat:LVM): Started pgdbsrv01.cl1.local
 FS_data02 (ocf::heartbeat:Filesystem): Started pgdbsrv01.cl1.local

2) Putting node #1 to standby mode - after which everything runs happily on 
node pgdbsrv02.cl1.local

# pcs cluster standby pgdbsrv01.cl1.local
# pcs status
Last updated: Tue Sep  3 00:16:01 2013
Last change: Tue Sep  3 00:15:55 2013 via crm_attribute on pgdbsrv02.cl1.local
Stack: cman
Current DC: pgdbsrv02.cl1.local - partition with quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, unknown expected votes
9 Resources configured.


Node pgdbsrv01.cl1.local: standby
Online: [ pgdbsrv02.cl1.local ]

Full list of resources:

 IP_database (ocf::heartbeat:IPaddr2): Started pgdbsrv02.cl1.local
 Master/Slave Set: DRBD_ms_data01 [DRBD_data01]
 Masters: [ pgdbsrv02.cl1.local ]
 Stopped: [ DRBD_data01:1 ]
 Master/Slave Set: DRBD_ms_data02 [DRBD_data02]
 Masters: [ pgdbsrv02.cl1.local ]
 Stopped: [ DRBD_data02:1 ]
 Resource Group: GRP_data01
 LVM_vgdata01 (ocf::heartbeat:LVM): Started pgdbsrv02.cl1.local
 FS_data01 (ocf::heartbeat:Filesystem): Started pgdbsrv02.cl1.local
 Resource Group: GRP_data02
 LVM_vgdata02 (ocf::heartbeat:LVM): Started pgdbsrv02.cl1.local
 FS_data02 (ocf::heartbeat:Filesystem): Started pgdbsrv02.cl1.local

3) Putting node #1 back online - it seems that all the resources stop (?) and 
then DRBD gets promoted successfully on node #2 but LVM and FS resources never 
start

# pcs cluster unstandby pgdbsrv01.cl1.local
# pcs status
Last updated: Tue Sep  3 00:17:00 2013
Last change: Tue Sep

Re: [Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-09-05 Thread Christine Caulfield

On 05/09/13 11:33, Andrew Beekhof wrote:


On 05/09/2013, at 6:37 PM, Christine Caulfield  wrote:


On 03/09/13 22:03, Andrew Beekhof wrote:


On 03/09/2013, at 11:49 PM, Christine Caulfield  wrote:


On 03/09/13 05:20, Andrew Beekhof wrote:


On 02/09/2013, at 5:27 PM, Andrey Groshev  wrote:




30.08.2013, 07:18, "Andrew Beekhof" :

On 29/08/2013, at 7:31 PM, Andrey Groshev  wrote:


  29.08.2013, 12:25, "Andrey Groshev" :

  29.08.2013, 02:55, "Andrew Beekhof" :

   On 28/08/2013, at 5:38 PM, Andrey Groshev  wrote:

28.08.2013, 04:06, "Andrew Beekhof" :

On 27/08/2013, at 1:13 PM, Andrey Groshev  wrote:

 27.08.2013, 05:39, "Andrew Beekhof" :

 On 26/08/2013, at 3:09 PM, Andrey Groshev  wrote:

  26.08.2013, 03:34, "Andrew Beekhof" :

  On 23/08/2013, at 9:39 PM, Andrey Groshev  wrote:

   Hello,

   Today I try remake my test cluster from cman to corosync2.
   I drew attention to the following:
   If I reset cluster with cman through cibadmin --erase --force
   In cib is still there exist names of nodes.

  Yes, the cluster puts back entries for all the nodes it know about 
automagically.

   cibadmin -Ql
   .
  



  
   

   Even if cman and pacemaker running only one node.

  I'm assuming all three are configured in cluster.conf?

  Yes, there exist list nodes.

   And if I do too on cluster with corosync2
   I see only names of nodes which run corosync and pacemaker.

  Since you're not included your config, I can only guess that your 
corosync.conf does not have a nodelist.
  If it did, you should get the same behaviour.

  I try and expected_node and nodelist.

 And it didn't work? What version of pacemaker?

 It does not work as I expected.

Thats because you've used IP addresses in the node list.
ie.

node {
  ring0_addr: 10.76.157.17
}

try including the node name as well, eg.

node {
  name: dev-cluster2-node2
  ring0_addr: 10.76.157.17
}

The same thing.

   I don't know what to say.  I tested it here yesterday and it worked as 
expected.

  I found that the reason that You and I have different results - I did not 
have reverse DNS zone for these nodes.
  I know what it should be, but (PACEMAKER + CMAN) worked without a reverse 
area!

  Hasty. Deleted all. Reinstalled. Configured. Not working again. Damn!


It would have surprised me... pacemaker 1.1.11 doesn't do any dns lookups - 
reverse or otherwise.
Can you set

  PCMK_trace_files=corosync.c

in your environment and retest?

On RHEL6 that means putting the following in /etc/sysconfig/pacemaker
   export PCMK_trace_files=corosync.c

It should produce additional logging[1] that will help diagnose the issue.

[1] http://blog.clusterlabs.org/blog/2013/pacemaker-logging/



Hello, Andrew.

You are a little misunderstood me.


No, I understood you fine.


I wrote that I rushed to judgment.
After I did the reverse DNS zone, the cluster behaved correctly.
BUT after I took apart the cluster dropped configs and restarted on the new 
cluster,
cluster again don't showed all the nodes in the nodes (only node with running 
pacemaker).

A small portion of the log. Full log
In which (I thought) there is something interesting.

Aug 30 12:31:11 [9986] dev-cluster2-node4cib: (  corosync.c:423   )   trace: 
check_message_sanity:  Verfied message 4: (dest=:cib, 
from=dev-cluster2-node4:cib.9986, compressed=0, size=1551, total=2143)
Aug 30 12:31:11 [9989] dev-cluster2-node4  attrd: (  corosync.c:96)   
trace: corosync_node_name:Checking 172793107 vs 0 from 
nodelist.node.0.nodeid
Aug 30 12:31:11 [9989] dev-cluster2-node4  attrd: (  ipcc.c:378   )   
debug: qb_ipcc_disconnect:qb_ipcc_disconnect()
Aug 30 12:31:11 [9989] dev-cluster2-node4  attrd: (ringbuffer.c:294   )   
debug: qb_rb_close:   Closing ringbuffer: 
/dev/shm/qb-cmap-request-9616-9989-27-header
Aug 30 12:31:11 [9989] dev-cluster2-node4  attrd: (ringbuffer.c:294   )   
debug: qb_rb_close:   Closing ringbuffer: 
/dev/shm/qb-cmap-response-9616-9989-27-header
Aug 30 12:31:11 [9989] dev-cluster2-node4  attrd: (ringbuffer.c:294   )   
debug: qb_rb_close:   Closing ringbuffer: 
/dev/shm/qb-cmap-event-9616-9989-27-header
Aug 30 12:31:11 [9989] dev-cluster2-node4  attrd: (  corosync.c:134   )  
notice: corosync_node_name:Unable to get node name for nodeid 172793107


I wonder if you need to be including the nodeid too. ie.

node {
  name: dev-cluster2-node2
  ring0_addr: 10.76.157.17
  nodeid: 2
}

I _thought_ that was implicit.
Chrissie: is "nodelist.node.%d.nodeid" always available for corosync2 or only 
if explicitly defined in the config?




You do need to specify a nodeid if you don't want corosync to imply it from the 
IP address (or you're using IPv6). corosync won't imply a nodeif from the order 
of the nodes in coro

[Pacemaker] Corosync quorum not updating on split node

2013-09-05 Thread Mark Round
Hi all,

I have a problem whereby when I create a network split/partition (by dropping 
traffic with iptables), the victim node for some reason does not realise it has 
split from the network.

It seems to recognise that it can't form a cluster due to network issues, but 
the status is not reflected in the output from corosync-quorumtool, and cluster 
services (via pacemaker) still continue to run. However, the other nodes in the 
rest of the cluster do realise they have lost contact with a node, no longer 
have quorum and correctly shut down services.

When I block traffic on the victim node's eth0, The remaining nodes see that 
they cannot communicate with it and shutdown :

# corosync-quorumtool -s
Version:  1.4.5
Nodes:3
Ring ID:  696
Quorum type:  corosync_votequorum
Quorate:  No
Node votes:   1
Expected votes:   7
Highest expected: 7
Total votes:  3
Quorum:   4 Activity blocked
Flags:

However, the victim node still thinks everything is fine, and maintains a view 
of the cluster prior to the split :

# corosync-quorumtool -s
Version:  1.4.5
Nodes:4
Ring ID:  716
Quorum type:  corosync_votequorum
Quorate:  Yes
Node votes:   1
Expected votes:   7
Highest expected: 7
Total votes:  4
Quorum:   4
Flags:Quorate

However, it does notice in the logs that it cannot now form cluster, as the 
following messages repeat constantly :

corosync [MAIN  ] Totem is unable to form a cluster because of an operating 
system or network fault. The most common cause of this message is that the 
local firewall is configured improperly.

I would expect at this point for it to be in it's own network partition with a 
total of 1 vote, and block activity. However, this does not seem to happen 
until just after it rejoins the cluster. When I unblock traffic and it rejoins, 
I see the victim finally realise it had lost quorum :

Sep 05 09:52:21 corosync [pcmk  ] notice: pcmk_peer_update: Transitional 
membership event on ring 720: memb=1, new=0, lost=3
Sep 05 09:52:21 corosync [VOTEQ ] quorum lost, blocking activity
Sep 05 09:52:21 corosync [QUORUM] This node is within the non-primary component 
and will NOT provide any services.
Sep 05 09:52:21 corosync [QUORUM] Members[1]: 358898186

And a second or so later it regains quorum :

crmd:   notice: ais_dispatch_message: Membership 736: quorum acquired

So my question is why, when it realises it cannot form a cluster ("Totem in 
unable to form..."), does it not loose quorum, update the status as reflected 
by quorumtool and shutdown cluster services ?

Configuration file example and package versions/environment listed below. I'm 
using "updu" protocol as we need to avoid multicast in this environment; it 
will eventually be using a routed network. This behaviour also persists when I 
disable the pacemaker plugin and just test with corosync.

compatibility: whitetank
totem {
version: 2
secauth: off
interface {
member {
memberaddr: 10.90.100.20
}
member {
memberaddr: 10.90.100.21
}
...
... more nodes snipped
...
ringnumber: 0
bindnetaddr: 10.90.100.20
mcastport: 5405
}
transport: udpu
}
amf {
mode: disabled
}
aisexec {
user: root
group: root
}
quorum {
provider: corosync_votequorum
expected_votes: 7
}
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 0
}

Environment : CentOS 6.4
Packages from OpenSUSE : 
http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/x86_64/
# rpm -qa | egrep "^(cluster|corosync|crm|libqb|pacemaker|resource-agents)" | 
sort
cluster-glue-1.0.11-3.1.x86_64
cluster-glue-libs-1.0.11-3.1.x86_64
corosync-1.4.5-2.2.x86_64
corosynclib-1.4.5-2.2.x86_64
crmsh-1.2.6-0.rc3.3.1.x86_64
libqb0-0.14.4-1.2.x86_64
pacemaker-1.1.9-2.1.x86_64
pacemaker-cli-1.1.9-2.1.x86_64
pacemaker-cluster-libs-1.1.9-2.1.x86_64
pacemaker-libs-1.1.9-2.1.x86_64
resource-agents-3.9.5-3.1.x86_64

Regards,

-Mark





Mark Round
Senior Systems Administrator
NCC Group
Kings Court
Kingston Road
Leatherhead, KT22 7SL

Telephone: +44 1372 383815
Mobile: +44 7790 770413
Fax:
Website: www.nccgroup.com
Email:  mark.ro...@nccgroup.com
[http://www.nccgroup.com/media/192418/nccgrouplogo.jpg] 



This email is sent for and on behalf of NCC Group. NCC Group is the trading 
name of NCC Group Performance Testing Limited (Registered in England CRN: 
4069379). Registered Office: Manchester Technology Centre, Oxford Road, 
Manchester, M1 7EF. The ultimate holding company is NCC Group plc (Registered 
in England CRN: 4627044).

Confidentiality: This e-mail contains proprietary information, some or all of 
which may be confide

Re: [Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-09-05 Thread Andrew Beekhof

On 05/09/2013, at 6:37 PM, Christine Caulfield  wrote:

> On 03/09/13 22:03, Andrew Beekhof wrote:
>> 
>> On 03/09/2013, at 11:49 PM, Christine Caulfield  wrote:
>> 
>>> On 03/09/13 05:20, Andrew Beekhof wrote:
 
 On 02/09/2013, at 5:27 PM, Andrey Groshev  wrote:
 
> 
> 
> 30.08.2013, 07:18, "Andrew Beekhof" :
>> On 29/08/2013, at 7:31 PM, Andrey Groshev  wrote:
>> 
>>>  29.08.2013, 12:25, "Andrey Groshev" :
  29.08.2013, 02:55, "Andrew Beekhof" :
>   On 28/08/2013, at 5:38 PM, Andrey Groshev  wrote:
>>28.08.2013, 04:06, "Andrew Beekhof" :
>>>On 27/08/2013, at 1:13 PM, Andrey Groshev  
>>> wrote:
 27.08.2013, 05:39, "Andrew Beekhof" :
> On 26/08/2013, at 3:09 PM, Andrey Groshev  
> wrote:
>>  26.08.2013, 03:34, "Andrew Beekhof" :
>>>  On 23/08/2013, at 9:39 PM, Andrey Groshev 
>>>  wrote:
   Hello,
 
   Today I try remake my test cluster from cman to 
 corosync2.
   I drew attention to the following:
   If I reset cluster with cman through cibadmin --erase 
 --force
   In cib is still there exist names of nodes.
>>>  Yes, the cluster puts back entries for all the nodes it 
>>> know about automagically.
   cibadmin -Ql
   .
  
>>> uname="dev-cluster2-node2"/>
>>> uname="dev-cluster2-node4"/>
>>> uname="dev-cluster2-node3"/>
  
   
 
   Even if cman and pacemaker running only one node.
>>>  I'm assuming all three are configured in cluster.conf?
>>  Yes, there exist list nodes.
   And if I do too on cluster with corosync2
   I see only names of nodes which run corosync and 
 pacemaker.
>>>  Since you're not included your config, I can only guess 
>>> that your corosync.conf does not have a nodelist.
>>>  If it did, you should get the same behaviour.
>>  I try and expected_node and nodelist.
> And it didn't work? What version of pacemaker?
 It does not work as I expected.
>>>Thats because you've used IP addresses in the node list.
>>>ie.
>>> 
>>>node {
>>>  ring0_addr: 10.76.157.17
>>>}
>>> 
>>>try including the node name as well, eg.
>>> 
>>>node {
>>>  name: dev-cluster2-node2
>>>  ring0_addr: 10.76.157.17
>>>}
>>The same thing.
>   I don't know what to say.  I tested it here yesterday and it worked 
> as expected.
  I found that the reason that You and I have different results - I did 
 not have reverse DNS zone for these nodes.
  I know what it should be, but (PACEMAKER + CMAN) worked without a 
 reverse area!
>>>  Hasty. Deleted all. Reinstalled. Configured. Not working again. Damn!
>> 
>> It would have surprised me... pacemaker 1.1.11 doesn't do any dns 
>> lookups - reverse or otherwise.
>> Can you set
>> 
>>  PCMK_trace_files=corosync.c
>> 
>> in your environment and retest?
>> 
>> On RHEL6 that means putting the following in /etc/sysconfig/pacemaker
>>   export PCMK_trace_files=corosync.c
>> 
>> It should produce additional logging[1] that will help diagnose the 
>> issue.
>> 
>> [1] http://blog.clusterlabs.org/blog/2013/pacemaker-logging/
>> 
> 
> Hello, Andrew.
> 
> You are a little misunderstood me.
 
 No, I understood you fine.
 
> I wrote that I rushed to judgment.
> After I did the reverse DNS zone, the cluster behaved correctly.
> BUT after I took apart the cluster dropped configs and restarted on the 
> new cluster,
> cluster again don't showed all the nodes in the nodes (only node with 
> running pacemaker).
> 
> A small portion of the log. Full log
> In which (I thought) there is something interesting.
> 
> Aug 30 12:31:11 [9986] dev-cluster2-node4cib: (  corosync.c:423   
> )   trace: check_message_sanity:  Verfied message 4: (dest=:cib, 
> from=dev-cluster2-node4:cib.9986, compressed=0, size=1551, total=2143)
> Aug 30 12:31:11 [9989] dev-cluster2-node4  attrd: (  corosync.c:96
> )   trace: corosync_node_name:Checking 172793107 vs 0 from 
> nodelist.node.0.nodeid
> Aug 30 12:31:11 [9989] dev-cluster2-node4  a

Re: [Pacemaker] Howto recover from node state UNCLEAN (online)

2013-09-05 Thread Lars Marowsky-Bree
On 2013-09-05T12:23:23, Andreas Mock  wrote:

> - resource monitoring failed on node 1
>   => stop of resource on node 1 failed 
>   => stonith off node 1 worked
> - more or less parallel as resource is clone resource
>   resource monitoring failed on node 2
>   => stop of resource on node 2 failed
>   => stonith of node 2 failed as stonith resource agent on
>  node 1 is unreachable caused by stonithing of node1
> 
> - Error message stating, giving up stonithing.
> => node 2 in the state above
> 
> Interestingly: a "service stop pacemaker" doesn't work
> as pacemaker seems to be blocked by this node state.
> 
> The questions:
> 1) How to recover from this state without rebooting?

A cleanup on the failed resource(s) (after fixing the problem with them,
that is) should do it.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Howto recover from node state UNCLEAN (online)

2013-09-05 Thread Andreas Mock
Hi all,

is there a way to recover from node state UNCLEAN (online) without
rebooting?

Background: 
- RHEL6.4
- cman-cluster with pacemaker
- stonith enabled and working

- resource monitoring failed on node 1
  => stop of resource on node 1 failed 
  => stonith off node 1 worked
- more or less parallel as resource is clone resource
  resource monitoring failed on node 2
  => stop of resource on node 2 failed
  => stonith of node 2 failed as stonith resource agent on
 node 1 is unreachable caused by stonithing of node1

- Error message stating, giving up stonithing.
=> node 2 in the state above

Interestingly: a "service stop pacemaker" doesn't work
as pacemaker seems to be blocked by this node state.

The questions:
1) How to recover from this state without rebooting?
2) Is self-stonithing allowed meanwhile, so that
a self-stonithing device could be added in a fencing
topology?

Best regards
Andreas Mock

   


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-09-05 Thread Christine Caulfield

On 03/09/13 22:03, Andrew Beekhof wrote:


On 03/09/2013, at 11:49 PM, Christine Caulfield  wrote:


On 03/09/13 05:20, Andrew Beekhof wrote:


On 02/09/2013, at 5:27 PM, Andrey Groshev  wrote:




30.08.2013, 07:18, "Andrew Beekhof" :

On 29/08/2013, at 7:31 PM, Andrey Groshev  wrote:


  29.08.2013, 12:25, "Andrey Groshev" :

  29.08.2013, 02:55, "Andrew Beekhof" :

   On 28/08/2013, at 5:38 PM, Andrey Groshev  wrote:

28.08.2013, 04:06, "Andrew Beekhof" :

On 27/08/2013, at 1:13 PM, Andrey Groshev  wrote:

 27.08.2013, 05:39, "Andrew Beekhof" :

 On 26/08/2013, at 3:09 PM, Andrey Groshev  wrote:

  26.08.2013, 03:34, "Andrew Beekhof" :

  On 23/08/2013, at 9:39 PM, Andrey Groshev  wrote:

   Hello,

   Today I try remake my test cluster from cman to corosync2.
   I drew attention to the following:
   If I reset cluster with cman through cibadmin --erase --force
   In cib is still there exist names of nodes.

  Yes, the cluster puts back entries for all the nodes it know about 
automagically.

   cibadmin -Ql
   .
  



  
   

   Even if cman and pacemaker running only one node.

  I'm assuming all three are configured in cluster.conf?

  Yes, there exist list nodes.

   And if I do too on cluster with corosync2
   I see only names of nodes which run corosync and pacemaker.

  Since you're not included your config, I can only guess that your 
corosync.conf does not have a nodelist.
  If it did, you should get the same behaviour.

  I try and expected_node and nodelist.

 And it didn't work? What version of pacemaker?

 It does not work as I expected.

Thats because you've used IP addresses in the node list.
ie.

node {
  ring0_addr: 10.76.157.17
}

try including the node name as well, eg.

node {
  name: dev-cluster2-node2
  ring0_addr: 10.76.157.17
}

The same thing.

   I don't know what to say.  I tested it here yesterday and it worked as 
expected.

  I found that the reason that You and I have different results - I did not 
have reverse DNS zone for these nodes.
  I know what it should be, but (PACEMAKER + CMAN) worked without a reverse 
area!

  Hasty. Deleted all. Reinstalled. Configured. Not working again. Damn!


It would have surprised me... pacemaker 1.1.11 doesn't do any dns lookups - 
reverse or otherwise.
Can you set

  PCMK_trace_files=corosync.c

in your environment and retest?

On RHEL6 that means putting the following in /etc/sysconfig/pacemaker
   export PCMK_trace_files=corosync.c

It should produce additional logging[1] that will help diagnose the issue.

[1] http://blog.clusterlabs.org/blog/2013/pacemaker-logging/



Hello, Andrew.

You are a little misunderstood me.


No, I understood you fine.


I wrote that I rushed to judgment.
After I did the reverse DNS zone, the cluster behaved correctly.
BUT after I took apart the cluster dropped configs and restarted on the new 
cluster,
cluster again don't showed all the nodes in the nodes (only node with running 
pacemaker).

A small portion of the log. Full log
In which (I thought) there is something interesting.

Aug 30 12:31:11 [9986] dev-cluster2-node4cib: (  corosync.c:423   )   trace: 
check_message_sanity:  Verfied message 4: (dest=:cib, 
from=dev-cluster2-node4:cib.9986, compressed=0, size=1551, total=2143)
Aug 30 12:31:11 [9989] dev-cluster2-node4  attrd: (  corosync.c:96)   
trace: corosync_node_name:Checking 172793107 vs 0 from 
nodelist.node.0.nodeid
Aug 30 12:31:11 [9989] dev-cluster2-node4  attrd: (  ipcc.c:378   )   
debug: qb_ipcc_disconnect:qb_ipcc_disconnect()
Aug 30 12:31:11 [9989] dev-cluster2-node4  attrd: (ringbuffer.c:294   )   
debug: qb_rb_close:   Closing ringbuffer: 
/dev/shm/qb-cmap-request-9616-9989-27-header
Aug 30 12:31:11 [9989] dev-cluster2-node4  attrd: (ringbuffer.c:294   )   
debug: qb_rb_close:   Closing ringbuffer: 
/dev/shm/qb-cmap-response-9616-9989-27-header
Aug 30 12:31:11 [9989] dev-cluster2-node4  attrd: (ringbuffer.c:294   )   
debug: qb_rb_close:   Closing ringbuffer: 
/dev/shm/qb-cmap-event-9616-9989-27-header
Aug 30 12:31:11 [9989] dev-cluster2-node4  attrd: (  corosync.c:134   )  
notice: corosync_node_name:Unable to get node name for nodeid 172793107


I wonder if you need to be including the nodeid too. ie.

node {
  name: dev-cluster2-node2
  ring0_addr: 10.76.157.17
  nodeid: 2
}

I _thought_ that was implicit.
Chrissie: is "nodelist.node.%d.nodeid" always available for corosync2 or only 
if explicitly defined in the config?




You do need to specify a nodeid if you don't want corosync to imply it from the 
IP address (or you're using IPv6). corosync won't imply a nodeif from the order 
of the nodes in corosync.conf - that's not reliable enough.


Right, but is that implied nodeid available as "nodelist.n