Re: [ClusterLabs] Create ressource to monitor each IPSEC VPN

2017-03-24 Thread Ken Gaillot
On 03/09/2017 01:44 AM, Damien Bras wrote:
> Hi,
> 
>  
> 
> We have a 2 nodes cluster with ipsec (libreswan).
> 
> Actually we have a resource to monitor the service ipsec (via system).
> 
>  
> 
> But now I would like to monitor each VPN. Is there a way to do that ?
> Which agent could I use for that ?
> 
>  
> 
> Thanks in advance for your help.
> 
> Damien

I'm not aware of any existing OCF agent for libreswan. You can always
manage any service via its OS launcher (systemd or lsb). If the OS's
status check isn't sufficient, you could additionally use
ocf:pacemaker:ping to monitor an IP address only available across the
VPN, to set a node attribute that you could maybe use somehow.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pending actions

2017-03-24 Thread Ken Gaillot
On 03/07/2017 04:13 PM, Jehan-Guillaume de Rorthais wrote:
> Hi,
> 
> Occasionally, I find my cluster with one pending action not being executed for
> some minutes (I guess until the "PEngine Recheck Timer" elapse).
> 
> Running "crm_simulate -SL" shows the pending actions.
> 
> I'm still confused about how it can happens, why it happens and how to avoid
> this.

It's most likely a bug in the crmd, which schedules PE runs.

> Earlier today, I started my test cluster with 3 nodes and a master/slave
> resource[1], all with positive master score (1001, 1000 and 990), and the
> cluster kept the promote action as a pending action for 15 minutes. 
> 
> You will find in attachment the first 3 pengine inputs executed after the
> cluster startup.
> 
> What are the consequences if I set cluster-recheck-interval to 30s as 
> instance?

The cluster would consume more CPU and I/O continually recalculating the
cluster state.

It would be nice to have some guidelines for cluster-recheck-interval
based on real-world usage, but it's just going by gut feeling at this
point. The cluster automatically recalculates when something
"interesting" happens -- a node comes or goes, a monitor fails, a node
attribute changes, etc. The cluster-recheck-interval is (1) a failsafe
for buggy situations like this, and (2) the maximum granularity of many
time-based checks such as rules. I would personally use at least 5
minutes, though less is probably reasonable, especially with simple
configurations (number of nodes/resources/constraints).

> Thanks in advance for your lights :)
> 
> Regards,
> 
> [1] here is the setup:
> http://dalibo.github.io/PAF/Quick_Start-CentOS-7.html#cluster-resource-creation-and-management

Feel free to open a bug report and include some logs around the time of
the incident (most importantly from the DC).

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-24 Thread Seth Reid
> On 24/03/17 04:44 PM, Seth Reid wrote:
> > I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its not in
> > production yet because I'm having a problem during fencing. When I
> > disable the network interface of any one machine, the disabled machines
> > is properly fenced leaving me, briefly, with a two node cluster. A
> > second node is then fenced off immediately, and the remaining node
> > appears to try to fence itself off. This leave two nodes with
> > corosync/pacemaker stopped, and the remaining machine still in the
> > cluster but showing an offline node and an UNCLEAN node. What can be
> > causing this behavior?
>
> It looks like the fence attempt failed, leaving the cluster hung. When
> you say all nodes were fenced, did all nodes actually reboot? Or did the
> two surviving nodes just lock up? If the later, then that is the proper
> response to a failed fence (DLM stays blocked).
>

The action if "off", so we aren't rebooting. The logs do still say reboot
though. In terms of actual fencing, only node 2 gets fenced, in that its
keys get removed from the shared volume. Node 1's keys don't get removed so
that is the failed fence. Node2 fence succeeds.

Of the remaining nodes, node 1 is offline in that corosync and pacemaker
are no longer running, so it can't access cluster resources. Node 3 shows
node 1 as Online but in a clean state. Neither node 1 or node 3 can write
to the cluster, but node 3 still has corosync and pacemaker running.

Here are the commands I used to build the cluster. I meant to put these in
the original post.

(single machine)$> pcs property set no-quorum-policy=freeze
(single machine)$> pcs property set stonith-enabled=true
(single machine)$> pcs property set symmetric-cluster=true
(single machine)$> pcs cluster enable --all
(single machine)$> pcs stonith create fence_wh fence_scsi
debug="/var/log/cluster/fence-debug.log" vgs_path="/sbin/vgs"
sg_persist_path="/usr/bin/sg_persist" sg_turs_path="/usr/bin/sg_turs"
pcmk_reboot_action="off" pcmk_host_list="b013-cl b014-cl b015-cl"
pcmk_monitor_action="metadata" meta provides="unfencing" --force
(single machine)$> pcs resource create dlm ocf:pacemaker:controld op
monitor interval=30s on-fail=fence clone interleave=true ordered=true
(single machine)$> pcs resource create clvmd ocf:heartbeat:clvm op monitor
interval=30s on-fail=fence clone interleave=true ordered=true
(single machine)$> pcs constraint order start dlm-clone then clvmd-clone
(single machine)$> pcs constraint colocation add clvmd-clone with dlm-clone
(single machine)$> mkfs.gfs2 -p lock_dlm -t webhosts:share_data -j 3
/dev/mapper/share-data
(single machine)$> pcs resource create gfs2share Filesystem
device="/dev/mapper/share-data" directory="/share" fstype="gfs2"
options="noatime,nodiratime" op monitor interval=10s on-fail=fence clone
interleave=true
(single machine)$> pcs constraint order start clvmd-clone then
gfs2share-clone
(single machine)$> pcs constraint colocation add gfs2share-clone with
clvmd-clone


>
> > Each machine has a dedicated network interface for the cluster, and
> > there is a vlan on the switch devoted to just these interfaces.
> > In the following, I disabled the interface on node id 2 (b014). Node 1
> > (b013) is fenced as well. Node 2 (b015) is still up.
> >
> > Logs from b013:
> > Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v debian-sa1 >
> > /dev/null && debian-sa1 1 1)
> > Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor
> > failed, forming new configuration.
> > Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed,
> > forming new configuration.
> > Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new membership
> > (192.168.100.13:576 ) was formed. Members
> left: 2
> > Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to receive
> > the leave message. failed: 2
> > Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership
> > (192.168.100.13:576 ) was formed. Members
> left: 2
> > Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive the
> > leave message. failed: 2
> > Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc: Node
> > b014-cl[2] - state is now lost (was member)
> > Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc: Node
> > b014-cl[2] - state is now lost (was member)
> > Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from the
> > membership list
> > Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2
> > and/or uname=b014-cl from the membership cache
> > Mar 24 16:35:17 b013 pacemakerd[2187]:   notice: crm_reap_unseen_nodes:
> > Node b014-cl[2] - state is now lost (was member)
> > Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2 from the
> > membership list
> > Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with id=2
> > and/or uname=b014-cl from the membership cache
> > Mar 24 16:35:17 b013 stonith-ng[2221]:

Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-24 Thread Ken Gaillot
On 03/24/2017 03:52 PM, Digimer wrote:
> On 24/03/17 04:44 PM, Seth Reid wrote:
>> I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its not in
>> production yet because I'm having a problem during fencing. When I
>> disable the network interface of any one machine, the disabled machines
>> is properly fenced leaving me, briefly, with a two node cluster. A
>> second node is then fenced off immediately, and the remaining node
>> appears to try to fence itself off. This leave two nodes with
>> corosync/pacemaker stopped, and the remaining machine still in the
>> cluster but showing an offline node and an UNCLEAN node. What can be
>> causing this behavior?
> 
> It looks like the fence attempt failed, leaving the cluster hung. When
> you say all nodes were fenced, did all nodes actually reboot? Or did the
> two surviving nodes just lock up? If the later, then that is the proper
> response to a failed fence (DLM stays blocked).

See comments inline ...

> 
>> Each machine has a dedicated network interface for the cluster, and
>> there is a vlan on the switch devoted to just these interfaces.
>> In the following, I disabled the interface on node id 2 (b014). Node 1
>> (b013) is fenced as well. Node 2 (b015) is still up.
>>
>> Logs from b013:
>> Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v debian-sa1 >
>> /dev/null && debian-sa1 1 1)
>> Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor
>> failed, forming new configuration.
>> Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed,
>> forming new configuration.
>> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new membership
>> (192.168.100.13:576 ) was formed. Members left: 2
>> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to receive
>> the leave message. failed: 2
>> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership
>> (192.168.100.13:576 ) was formed. Members left: 2
>> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive the
>> leave message. failed: 2
>> Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc: Node
>> b014-cl[2] - state is now lost (was member)
>> Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc: Node
>> b014-cl[2] - state is now lost (was member)
>> Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from the
>> membership list
>> Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2
>> and/or uname=b014-cl from the membership cache
>> Mar 24 16:35:17 b013 pacemakerd[2187]:   notice: crm_reap_unseen_nodes:
>> Node b014-cl[2] - state is now lost (was member)
>> Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2 from the
>> membership list
>> Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with id=2
>> and/or uname=b014-cl from the membership cache
>> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: crm_update_peer_proc:
>> Node b014-cl[2] - state is now lost (was member)
>> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Removing b014-cl/2 from
>> the membership list
>> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Purged 1 peers with
>> id=2 and/or uname=b014-cl from the membership cache
>> Mar 24 16:35:17 b013 dlm_controld[2727]: 3091 fence request 2 pid 19223
>> nodedown time 1490387717 fence_all dlm_stonith
>> Mar 24 16:35:17 b013 kernel: [ 3091.800118] dlm: closing connection to
>> node 2
>> Mar 24 16:35:17 b013 crmd[2227]:   notice: crm_reap_unseen_nodes: Node
>> b014-cl[2] - state is now lost (was member)
>> Mar 24 16:35:17 b013 dlm_stonith: stonith_api_time: Found 0 entries for
>> 2/(null): 0 in progress, 0 completed
>> Mar 24 16:35:18 b013 stonith-ng[2221]:   notice: Operation reboot of
>> b014-cl by b015-cl for stonith-api.19223@b013-cl.7aeb2ffb: OK
>> Mar 24 16:35:18 b013 stonith-api[19223]: stonith_api_kick: Node 2/(null)
>> kicked: reboot

It looks like the fencing of b014-cl is reported as successful above ...

>> Mar 24 16:35:18 b013 kernel: [ 3092.421495] dlm: closing connection to
>> node 3
>> Mar 24 16:35:18 b013 kernel: [ 3092.422246] dlm: closing connection to
>> node 1
>> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace share_data
>> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace clvmd
>> Mar 24 16:35:18 b013 kernel: [ 3092.426545] dlm: dlm user daemon left 2
>> lockspaces
>> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Main process exited,
>> code=exited, status=255/n/a

... but then DLM and corosync exit on this node. Pacemaker can only
exit, and the node gets fenced.

What does your fencing configuration look like?

>> Mar 24 16:35:18 b013 cib[2220]:error: Connection to the CPG API
>> failed: Library error (2)
>> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Unit entered failed
>> state.
>> Mar 24 16:35:18 b013 attrd[2223]:error: Connection to cib_rw failed
>> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Failed with result
>> 'exit-code'

Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-24 Thread Digimer
On 24/03/17 04:44 PM, Seth Reid wrote:
> I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its not in
> production yet because I'm having a problem during fencing. When I
> disable the network interface of any one machine, the disabled machines
> is properly fenced leaving me, briefly, with a two node cluster. A
> second node is then fenced off immediately, and the remaining node
> appears to try to fence itself off. This leave two nodes with
> corosync/pacemaker stopped, and the remaining machine still in the
> cluster but showing an offline node and an UNCLEAN node. What can be
> causing this behavior?

It looks like the fence attempt failed, leaving the cluster hung. When
you say all nodes were fenced, did all nodes actually reboot? Or did the
two surviving nodes just lock up? If the later, then that is the proper
response to a failed fence (DLM stays blocked).

> Each machine has a dedicated network interface for the cluster, and
> there is a vlan on the switch devoted to just these interfaces.
> In the following, I disabled the interface on node id 2 (b014). Node 1
> (b013) is fenced as well. Node 2 (b015) is still up.
> 
> Logs from b013:
> Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v debian-sa1 >
> /dev/null && debian-sa1 1 1)
> Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor
> failed, forming new configuration.
> Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed,
> forming new configuration.
> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new membership
> (192.168.100.13:576 ) was formed. Members left: 2
> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to receive
> the leave message. failed: 2
> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership
> (192.168.100.13:576 ) was formed. Members left: 2
> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive the
> leave message. failed: 2
> Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from the
> membership list
> Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2
> and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 pacemakerd[2187]:   notice: crm_reap_unseen_nodes:
> Node b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2 from the
> membership list
> Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with id=2
> and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: crm_update_peer_proc:
> Node b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Removing b014-cl/2 from
> the membership list
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Purged 1 peers with
> id=2 and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 dlm_controld[2727]: 3091 fence request 2 pid 19223
> nodedown time 1490387717 fence_all dlm_stonith
> Mar 24 16:35:17 b013 kernel: [ 3091.800118] dlm: closing connection to
> node 2
> Mar 24 16:35:17 b013 crmd[2227]:   notice: crm_reap_unseen_nodes: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 dlm_stonith: stonith_api_time: Found 0 entries for
> 2/(null): 0 in progress, 0 completed
> Mar 24 16:35:18 b013 stonith-ng[2221]:   notice: Operation reboot of
> b014-cl by b015-cl for stonith-api.19223@b013-cl.7aeb2ffb: OK
> Mar 24 16:35:18 b013 stonith-api[19223]: stonith_api_kick: Node 2/(null)
> kicked: reboot
> Mar 24 16:35:18 b013 kernel: [ 3092.421495] dlm: closing connection to
> node 3
> Mar 24 16:35:18 b013 kernel: [ 3092.422246] dlm: closing connection to
> node 1
> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace share_data
> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace clvmd
> Mar 24 16:35:18 b013 kernel: [ 3092.426545] dlm: dlm user daemon left 2
> lockspaces
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Main process exited,
> code=exited, status=255/n/a
> Mar 24 16:35:18 b013 cib[2220]:error: Connection to the CPG API
> failed: Library error (2)
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Unit entered failed
> state.
> Mar 24 16:35:18 b013 attrd[2223]:error: Connection to cib_rw failed
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Failed with result
> 'exit-code'.
> Mar 24 16:35:18 b013 attrd[2223]:error: Connection to
> cib_rw[0x560754147990] closed (I/O condition=17)
> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Main process exited,
> code=exited, status=107/n/a
> Mar 24 16:35:18 b013 pacemakerd[2187]:error: Connection to the CPG
> API failed: Library error (2)
> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Unit 

[ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-24 Thread Seth Reid
I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its not in
production yet because I'm having a problem during fencing. When I disable
the network interface of any one machine, the disabled machines is properly
fenced leaving me, briefly, with a two node cluster. A second node is then
fenced off immediately, and the remaining node appears to try to fence
itself off. This leave two nodes with corosync/pacemaker stopped, and the
remaining machine still in the cluster but showing an offline node and an
UNCLEAN node. What can be causing this behavior?

Each machine has a dedicated network interface for the cluster, and there
is a vlan on the switch devoted to just these interfaces.
In the following, I disabled the interface on node id 2 (b014). Node 1
(b013) is fenced as well. Node 2 (b015) is still up.

Logs from b013:
Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v debian-sa1 >
/dev/null && debian-sa1 1 1)
Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor failed,
forming new configuration.
Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed, forming
new configuration.
Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new membership (
192.168.100.13:576) was formed. Members left: 2
Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to receive the
leave message. failed: 2
Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership (
192.168.100.13:576) was formed. Members left: 2
Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive the leave
message. failed: 2
Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc: Node
b014-cl[2] - state is now lost (was member)
Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc: Node
b014-cl[2] - state is now lost (was member)
Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from the
membership list
Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2 and/or
uname=b014-cl from the membership cache
Mar 24 16:35:17 b013 pacemakerd[2187]:   notice: crm_reap_unseen_nodes:
Node b014-cl[2] - state is now lost (was member)
Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2 from the
membership list
Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with id=2 and/or
uname=b014-cl from the membership cache
Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: crm_update_peer_proc: Node
b014-cl[2] - state is now lost (was member)
Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Removing b014-cl/2 from
the membership list
Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Purged 1 peers with id=2
and/or uname=b014-cl from the membership cache
Mar 24 16:35:17 b013 dlm_controld[2727]: 3091 fence request 2 pid 19223
nodedown time 1490387717 fence_all dlm_stonith
Mar 24 16:35:17 b013 kernel: [ 3091.800118] dlm: closing connection to node
2
Mar 24 16:35:17 b013 crmd[2227]:   notice: crm_reap_unseen_nodes: Node
b014-cl[2] - state is now lost (was member)
Mar 24 16:35:17 b013 dlm_stonith: stonith_api_time: Found 0 entries for
2/(null): 0 in progress, 0 completed
Mar 24 16:35:18 b013 stonith-ng[2221]:   notice: Operation reboot of
b014-cl by b015-cl for stonith-api.19223@b013-cl.7aeb2ffb: OK
Mar 24 16:35:18 b013 stonith-api[19223]: stonith_api_kick: Node 2/(null)
kicked: reboot
Mar 24 16:35:18 b013 kernel: [ 3092.421495] dlm: closing connection to node
3
Mar 24 16:35:18 b013 kernel: [ 3092.422246] dlm: closing connection to node
1
Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace share_data
Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace clvmd
Mar 24 16:35:18 b013 kernel: [ 3092.426545] dlm: dlm user daemon left 2
lockspaces
Mar 24 16:35:18 b013 systemd[1]: corosync.service: Main process exited,
code=exited, status=255/n/a
Mar 24 16:35:18 b013 cib[2220]:error: Connection to the CPG API failed:
Library error (2)
Mar 24 16:35:18 b013 systemd[1]: corosync.service: Unit entered failed
state.
Mar 24 16:35:18 b013 attrd[2223]:error: Connection to cib_rw failed
Mar 24 16:35:18 b013 systemd[1]: corosync.service: Failed with result
'exit-code'.
Mar 24 16:35:18 b013 attrd[2223]:error: Connection to
cib_rw[0x560754147990] closed (I/O condition=17)
Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Main process exited,
code=exited, status=107/n/a
Mar 24 16:35:18 b013 pacemakerd[2187]:error: Connection to the CPG API
failed: Library error (2)
Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Unit entered failed
state.
Mar 24 16:35:18 b013 attrd[2223]:   notice: Disconnecting client
0x560754149000, pid=2227...
Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Failed with result
'exit-code'.
Mar 24 16:35:18 b013 lrmd[]:  warning: new_event_notification
(-2227-8): Bad file descriptor (9)
Mar 24 16:35:18 b013 stonith-ng[2221]:error: Connection to cib_rw failed
Mar 24 16:35:18 b013 stonith-ng[2221]:error: Connection to
cib_rw[0x5579c03ecdd0] closed (I/O condition=17)
Mar 24 16:35:18 b013 lrmd[]:   

Re: [ClusterLabs] error: The cib process (17858) exited: Key has expired (127)

2017-03-24 Thread Ken Gaillot
On 03/24/2017 11:06 AM, Rens Houben wrote:
> I activated debug=cib, and retried.
> 
> New log file up at
> http://proteus.systemec.nl/~shadur/pacemaker/pacemaker_2.log.txt
>  ;
> unfortunately, while that *is* more information I'm not seeing anything
> that looks like it could be the cause, although it shouldn't be reading
> any config files yet because there shouldn't be any *to* read...

If there's no config file, pacemaker will create an empty one and use
that, so it still goes through the mechanics of validating it and
writing it out.

Debug doesn't give us much -- just one additional message before it dies:

Mar 24 16:59:27 [20266] castorcib:debug: activateCibXml:
Triggering CIB write for start op

You might want to look at the system log around that time to see if
something else is going wrong. If you have SELinux enabled, check the
audit log for denials.

> As to the misleading error message, it gets weirder: I grabbed a copy of
> the source code via apt-get source, and the phrase 'key has expired'
> does not occur anywhere in any file according to find ./ -type f -exec
> grep -il 'key has expired' {} \; so I have absolutely NO idea where it's
> coming from...

Right, it's not part of pacemaker, it's just the standard system error
message for errno 127. But the exit status isn't an errno, so that's not
the right interpretation. I can't find any code path in the cib that
would return 127, so I don't know what the right intepretation would be.

> 
> --
> Rens Houben
> Systemec Internet Services
> 
> SYSTEMEC BV
> 
> Marinus Dammeweg 25, 5928 PW Venlo
> Postbus 3290, 5902 RG Venlo
> Industrienummer: 6817
> Nederland
> 
> T: 077-3967572 (Support)
> K.V.K. nummer: 12027782 (Venlo)
> 
> Systemec Datacenter Venlo & Nettetal 
> 
> Systemec Helpdesk   Helpdesk
> 
> 
> Aanmelden nieuwsbrief 
>  Aanmelden nieuwsbrief 
> 
> Volg ons op: Systemec Twitter  Systemec
> Facebook  Systemec Linkedin
>  Systemec Youtube
> 
> 
> 
> 
> Van: Ken Gaillot 
> Verzonden: vrijdag 24 maart 2017 16:49
> Aan: users@clusterlabs.org
> Onderwerp: Re: [ClusterLabs] error: The cib process (17858) exited: Key
> has expired (127)
> 
> On 03/24/2017 08:06 AM, Rens Houben wrote:
>> I recently upgraded a two-node cluster (named 'castor' and 'pollux'
>> because I should not be allowed to think up computer names before I've
>> had my morning caffeine) from Debian wheezy to Jessie after the
>> backports for corosync and pacemaker finally made it in. However, one of
>> the two servers failed to start correctly for no really obvious reason.
>>
>> Given as how it'd been years since I last set them up  and had forgotten
>> pretty much everything about it in the interim I decided to purge
>> corosync and pacemaker on both systems and run with clean installs instead.
>>
>> This worked on pollux, but not on castor. Even after going pack,
>> re-purging, removing everything legacy in /var/lib/heartbeat and
>> emptying both directories, castor still refuses to bring up pacemaker.
>>
>>
>> I put the full log of a start attempt up at
>> http://proteus.systemec.nl/~shadur/pacemaker/pacemaker.log.txt
> 
>> , but
>> this is the excerpt that I /think/ is causing the failure:
>>
>> Mar 24 13:59:05 [25495] castor pacemakerd:error: pcmk_child_exit:The
>> cib process (25502) exited: Key has expired (127)
>> Mar 24 13:59:05 [25495] castor pacemakerd:   notice:
>> pcmk_process_exit:Respawning failed child process: cib
>>
>> I don't see any entries from cib in the log that suggest anything's
>> going wrong, though, and I'm running out of ideas on where to look next.
> 
> The "Key has expired" message is misleading. (Pacemaker really needs an
> overhaul of the exit codes it can return, so these messages can be
> reliable, but there are always more important things to take care of ...)
> 
> Pacemaker is getting 127 as the exit status of cib, and interpreting
> that as a standard system error number, but it probably isn't one. I
> don't actually see any way that the cib can return 127, so I'm not sure
> what that might indicate.
> 
> In any case, the cib is mysteriously dying whenever it tries to start,
> apparently without logging why or dumping core. (Do you have cores
> disabled at the OS level?)
> 
>> Does anyone have any suggestions as to how to coax more information out
>> of the processes and into the log files so I'll have a clue to work with?
> 
> Try it again with PCMK_debug=cib in /etc/default/pa

Re: [ClusterLabs] error: The cib process (17858) exited: Key has expired (127)

2017-03-24 Thread Rens Houben
I activated debug=cib, and retried.

New log file up at 
http://proteus.systemec.nl/~shadur/pacemaker/pacemaker_2.log.txt ; 
unfortunately, while that *is* more information I'm not seeing anything that 
looks like it could be the cause, although it shouldn't be reading any config 
files yet because there shouldn't be any *to* read...

As to the misleading error message, it gets weirder: I grabbed a copy of the 
source code via apt-get source, and the phrase 'key has expired' does not occur 
anywhere in any file according to find ./ -type f -exec grep -il 'key has 
expired' {} \; so I have absolutely NO idea where it's coming from...

--
Rens Houben
Systemec Internet Services

SYSTEMEC BV


Marinus Dammeweg 25, 5928 PW Venlo
Postbus 3290, 5902 RG Venlo
Industrienummer: 6817
Nederland

T: 077-3967572 (Support)
K.V.K. nummer: 12027782 (Venlo)


[Systemec Datacenter Venlo & Nettetal]


[Systemec Helpdesk]  
Helpdesk


[Aanmelden nieuwsbrief]  Aanmelden 
nieuwsbrief


Volg ons op: [Systemec Twitter]   [Systemec 
Facebook]   [Systemec Linkedin] 
  [Systemec Youtube] 




Van: Ken Gaillot 
Verzonden: vrijdag 24 maart 2017 16:49
Aan: users@clusterlabs.org
Onderwerp: Re: [ClusterLabs] error: The cib process (17858) exited: Key has 
expired (127)

On 03/24/2017 08:06 AM, Rens Houben wrote:
> I recently upgraded a two-node cluster (named 'castor' and 'pollux'
> because I should not be allowed to think up computer names before I've
> had my morning caffeine) from Debian wheezy to Jessie after the
> backports for corosync and pacemaker finally made it in. However, one of
> the two servers failed to start correctly for no really obvious reason.
>
> Given as how it'd been years since I last set them up  and had forgotten
> pretty much everything about it in the interim I decided to purge
> corosync and pacemaker on both systems and run with clean installs instead.
>
> This worked on pollux, but not on castor. Even after going pack,
> re-purging, removing everything legacy in /var/lib/heartbeat and
> emptying both directories, castor still refuses to bring up pacemaker.
>
>
> I put the full log of a start attempt up at
> http://proteus.systemec.nl/~shadur/pacemaker/pacemaker.log.txt
> , but
> this is the excerpt that I /think/ is causing the failure:
>
> Mar 24 13:59:05 [25495] castor pacemakerd:error: pcmk_child_exit:The
> cib process (25502) exited: Key has expired (127)
> Mar 24 13:59:05 [25495] castor pacemakerd:   notice:
> pcmk_process_exit:Respawning failed child process: cib
>
> I don't see any entries from cib in the log that suggest anything's
> going wrong, though, and I'm running out of ideas on where to look next.

The "Key has expired" message is misleading. (Pacemaker really needs an
overhaul of the exit codes it can return, so these messages can be
reliable, but there are always more important things to take care of ...)

Pacemaker is getting 127 as the exit status of cib, and interpreting
that as a standard system error number, but it probably isn't one. I
don't actually see any way that the cib can return 127, so I'm not sure
what that might indicate.

In any case, the cib is mysteriously dying whenever it tries to start,
apparently without logging why or dumping core. (Do you have cores
disabled at the OS level?)

> Does anyone have any suggestions as to how to coax more information out
> of the processes and into the log files so I'll have a clue to work with?

Try it again with PCMK_debug=cib in /etc/default/pacemaker. That should
give more log messages.

>
> Regards,
>
> --
> Rens Houben
> Systemec Internet Services
>
> SYSTEMEC BV
>
> Marinus Dammeweg 25, 5928 PW Venlo
> Postbus 3290, 5902 RG Venlo
> Industrienummer: 6817
> Nederland
>
> T: 077-3967572 (Support)
> K.V.K. nummer: 12027782 (Venlo)
>
> Systemec Datacenter Venlo & Nettetal 
>
> Systemec Helpdesk   Helpdesk
> 
>
> Aanmelden nieuwsbrief 
>  Aanmelden nieuwsbrief 
>
> Volg ons op: Systemec Twitter  Systemec
> Facebook  Systemec Linkedin
>  Systemec Youtube
> 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Re: [ClusterLabs] stonith in dual HMC environment

2017-03-24 Thread Ken Gaillot
On 03/22/2017 09:42 AM, Alexander Markov wrote:
> 
>> Please share your config along with the logs from the nodes that were
>> effected.
> 
> I'm starting to think it's not about how to define stonith resources. If
> the whole box is down with all the logical partitions defined, then HMC
> cannot define if LPAR (partition) is really dead or just inaccessible.
> This leads to UNCLEAN OFFLINE node status and pacemaker refusal to do
> anything until it's resolved. Am I right? Anyway, the simples pacemaker
> config from my partitions is below.

Yes, it looks like you are correct. The fence agent is returning an
error when pacemaker tries to use it to reboot crmapp02. From the stderr
in the logs, the message is "ssh: connect to host 10.1.2.9 port 22: No
route to host".

The first thing I'd try is making sure you can fence each node from the
command line by manually running the fence agent. I'm not sure how to do
that for the "stonith:" type agents.

Once that's working, make sure the cluster can do the same, by manually
running "stonith_admin -B $NODE" for each $NODE.

> 
> primitive sap_ASCS SAPInstance \
> params InstanceName=CAP_ASCS01_crmapp \
> op monitor timeout=60 interval=120 depth=0
> primitive sap_D00 SAPInstance \
> params InstanceName=CAP_D00_crmapp \
> op monitor timeout=60 interval=120 depth=0
> primitive sap_ip IPaddr2 \
> params ip=10.1.12.2 nic=eth0 cidr_netmask=24

> primitive st_ch_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.9 \
> op start interval=0 timeout=300
> primitive st_hq_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.8 \
> op start interval=0 timeout=300

I see you have two stonith devices defined, but they don't specify which
nodes they can fence -- pacemaker will assume that either device can be
used to fence either node.

> group g_sap sap_ip sap_ASCS sap_D00 \
> meta target-role=Started

> location l_ch_hq_hmc st_ch_hmc -inf: crmapp01
> location l_st_hq_hmc st_hq_hmc -inf: crmapp02

These constraints restrict which node monitors which device, not which
node the device can fence.

Assuming st_ch_hmc is intended to fence crmapp01, this will make sure
that crmapp02 monitors that device -- but you also want something like
pcmk_host_list=crmapp01 in the device configuration.

> location prefer_node_1 g_sap 100: crmapp01
> property cib-bootstrap-options: \
> stonith-enabled=true \
> no-quorum-policy=ignore \
> placement-strategy=balanced \
> expected-quorum-votes=2 \
> dc-version=1.1.12-f47ea56 \
> cluster-infrastructure="classic openais (with plugin)" \
> last-lrm-refresh=1490009096 \
> maintenance-mode=false
> rsc_defaults rsc-options: \
> resource-stickiness=200 \
> migration-threshold=3
> op_defaults op-options: \
> timeout=600 \
> record-pending=true
> 
> Logs are pretty much going in circle: stonith cannot reset logical
> partition via HMC, node stays unclean offline, resources are shown to
> stay on node that is down.
> 
> 
> stonith-ng:error: log_operation:Operation 'reboot' [6942] (call
> 6 from crmd.4568) for host 'crmapp02' with device 'st_ch_hmc:0'
> Trying: st_ch_hmc:0
> stonith-ng:  warning: log_operation:st_ch_hmc:0:6942 [ Performing:
> stonith -t ibmhmc -T reset crmapp02 ]
> stonith-ng:  warning: log_operation:st_ch_hmc:0:6942 [ failed:
> crmapp02 3 ]
> stonith-ng: info: internal_stonith_action_execute:  Attempt 2 to
> execute fence_legacy (reboot). remaining timeout is 59
> stonith-ng: info: update_remaining_timeout: Attempted to
> execute agent fence_legacy (reboot) the maximum number of times (2)
> 
> stonith-ng:error: log_operation:Operation 'reboot' [6955] (call
> 6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc' re
> Trying: st_hq_hmc
> stonith-ng:  warning: log_operation:st_hq_hmc:6955 [ Performing:
> stonith -t ibmhmc -T reset crmapp02 ]
> stonith-ng:  warning: log_operation:st_hq_hmc:6955 [ failed:
> crmapp02 8 ]
> stonith-ng: info: internal_stonith_action_execute:  Attempt 2 to
> execute fence_legacy (reboot). remaining timeout is 60
> stonith-ng: info: update_remaining_timeout: Attempted to
> execute agent fence_legacy (reboot) the maximum number of times (2)
> 
> stonith-ng:error: log_operation:Operation 'reboot' [6976] (call
> 6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc:0'
> 
> stonith-ng:  warning: log_operation:st_hq_hmc:0:6976 [ Performing:
> stonith -t ibmhmc -T reset crmapp02 ]
> stonith-ng:  warning: log_operation:st_hq_hmc:0:6976 [ failed:
> crmapp02 8 ]
> stonith-ng:   notice: stonith_choose_peer:  Couldn't find anyone to
> fence crmapp02 with 
> stonith-ng: info: call_remote_stonith:  None of the 1 peers are
> capable of terminating crmapp02 for crmd.4568 (1)
> stonith-ng:error: remote_op_done:   Operation reboot of crmapp02 by
>  for crmd.4568@crmapp01.6bf66b9c: No route to host
> crmd:   notice: tengine_stonith_callback: Stonith ope

Re: [ClusterLabs] error: The cib process (17858) exited: Key has expired (127)

2017-03-24 Thread Ken Gaillot
On 03/24/2017 08:06 AM, Rens Houben wrote:
> I recently upgraded a two-node cluster (named 'castor' and 'pollux'
> because I should not be allowed to think up computer names before I've
> had my morning caffeine) from Debian wheezy to Jessie after the
> backports for corosync and pacemaker finally made it in. However, one of
> the two servers failed to start correctly for no really obvious reason.
> 
> Given as how it'd been years since I last set them up  and had forgotten
> pretty much everything about it in the interim I decided to purge
> corosync and pacemaker on both systems and run with clean installs instead.
> 
> This worked on pollux, but not on castor. Even after going pack,
> re-purging, removing everything legacy in /var/lib/heartbeat and
> emptying both directories, castor still refuses to bring up pacemaker.
> 
> 
> I put the full log of a start attempt up at
> http://proteus.systemec.nl/~shadur/pacemaker/pacemaker.log.txt
> , but
> this is the excerpt that I /think/ is causing the failure:
> 
> Mar 24 13:59:05 [25495] castor pacemakerd:error: pcmk_child_exit:The
> cib process (25502) exited: Key has expired (127)
> Mar 24 13:59:05 [25495] castor pacemakerd:   notice:
> pcmk_process_exit:Respawning failed child process: cib
> 
> I don't see any entries from cib in the log that suggest anything's
> going wrong, though, and I'm running out of ideas on where to look next.

The "Key has expired" message is misleading. (Pacemaker really needs an
overhaul of the exit codes it can return, so these messages can be
reliable, but there are always more important things to take care of ...)

Pacemaker is getting 127 as the exit status of cib, and interpreting
that as a standard system error number, but it probably isn't one. I
don't actually see any way that the cib can return 127, so I'm not sure
what that might indicate.

In any case, the cib is mysteriously dying whenever it tries to start,
apparently without logging why or dumping core. (Do you have cores
disabled at the OS level?)

> Does anyone have any suggestions as to how to coax more information out
> of the processes and into the log files so I'll have a clue to work with?

Try it again with PCMK_debug=cib in /etc/default/pacemaker. That should
give more log messages.

> 
> Regards,
> 
> --
> Rens Houben
> Systemec Internet Services
> 
> SYSTEMEC BV
> 
> Marinus Dammeweg 25, 5928 PW Venlo
> Postbus 3290, 5902 RG Venlo
> Industrienummer: 6817
> Nederland
> 
> T: 077-3967572 (Support)
> K.V.K. nummer: 12027782 (Venlo)
> 
> Systemec Datacenter Venlo & Nettetal 
> 
> Systemec Helpdesk   Helpdesk
> 
> 
> Aanmelden nieuwsbrief 
>  Aanmelden nieuwsbrief 
> 
> Volg ons op: Systemec Twitter  Systemec
> Facebook  Systemec Linkedin
>  Systemec Youtube
> 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] error: The cib process (17858) exited: Key has expired (127)

2017-03-24 Thread Rens Houben
I recently upgraded a two-node cluster (named 'castor' and 'pollux' because I 
should not be allowed to think up computer names before I've had my morning 
caffeine) from Debian wheezy to Jessie after the backports for corosync and 
pacemaker finally made it in. However, one of the two servers failed to start 
correctly for no really obvious reason.

Given as how it'd been years since I last set them up  and had forgotten pretty 
much everything about it in the interim I decided to purge corosync and 
pacemaker on both systems and run with clean installs instead.

This worked on pollux, but not on castor. Even after going pack, re-purging, 
removing everything legacy in /var/lib/heartbeat and emptying both directories, 
castor still refuses to bring up pacemaker.


I put the full log of a start attempt up at 
http://proteus.systemec.nl/~shadur/pacemaker/pacemaker.log.txt, but this is the 
excerpt that I /think/ is causing the failure:

Mar 24 13:59:05 [25495] castor pacemakerd:error: pcmk_child_exit:The cib 
process (25502) exited: Key has expired (127)
Mar 24 13:59:05 [25495] castor pacemakerd:   notice: 
pcmk_process_exit:Respawning failed child process: cib

I don't see any entries from cib in the log that suggest anything's going 
wrong, though, and I'm running out of ideas on where to look next.

Does anyone have any suggestions as to how to coax more information out of the 
processes and into the log files so I'll have a clue to work with?

Regards,

--
Rens Houben
Systemec Internet Services

SYSTEMEC BV


Marinus Dammeweg 25, 5928 PW Venlo
Postbus 3290, 5902 RG Venlo
Industrienummer: 6817
Nederland

T: 077-3967572 (Support)
K.V.K. nummer: 12027782 (Venlo)


[Systemec Datacenter Venlo & Nettetal]


[Systemec Helpdesk]  
Helpdesk


[Aanmelden nieuwsbrief]  Aanmelden 
nieuwsbrief


Volg ons op: [Systemec Twitter]   [Systemec 
Facebook]   [Systemec Linkedin] 
  [Systemec Youtube] 


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org