[ClusterLabs] start service after filesystemressource

2015-11-20 Thread haseningo
Hi,

I want to start several services after the drbd ressource an the filessystem is avaiable. This is my current configuration:

 

node $id="184548773" host-1 \
    attributes standby="on"
node $id="184548774" host-2 \
    attributes standby="on"
primitive collectd lsb:collectd \
    op monitor interval="10" timeout="30" \
    op start interval="0" timeout="120" \
    op stop interval="0" timeout="120"
primitive failover-ip1 ocf:heartbeat:IPaddr \
    params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \
    op monitor interval="10s"
primitive failover-ip2 ocf:heartbeat:IPaddr \
    params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \
    op monitor interval="10s"
primitive failover-ip3 ocf:heartbeat:IPaddr \
    params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \
    op monitor interval="10s"
primitive res_drbd_export ocf:linbit:drbd \
    params drbd_resource="hermes"
primitive res_fs ocf:heartbeat:Filesystem \
    params device="/dev/drbd0" directory="/mnt" fstype="ext4"
group mygroup failover-ip1 failover-ip2 failover-ip3 collectd
ms ms_drbd_export res_drbd_export \
    meta notify="true" master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
location cli-prefer-collectd collectd inf: host-1
location cli-prefer-failover-ip1 failover-ip1 inf: host-1
location cli-prefer-failover-ip2 failover-ip2 inf: host-1
location cli-prefer-failover-ip3 failover-ip3 inf: host-1
location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1
location cli-prefer-res_fs res_fs inf: host-1
colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master
order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start
property $id="cib-bootstrap-options" \
    dc-version="1.1.10-42f2063" \
    cluster-infrastructure="corosync" \
    stonith-enabled="false" \
    no-quorum-policy="ignore" \
    last-lrm-refresh="1447686090"
#vim:set syntax=pcmk

 

I don't found the right way, to order the startup of new services (example collectd), after the /mnt is mounted. Can you help me?


 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] start service after filesystemressource

2015-11-20 Thread Ken Gaillot
On 11/20/2015 07:38 AM, haseni...@gmx.de wrote:
> Hi,
> I want to start several services after the drbd ressource an the filessystem 
> is 
> avaiable. This is my current configuration:
> node $id="184548773" host-1 \
>  attributes standby="on"
> node $id="184548774" host-2 \
>  attributes standby="on"
> primitive collectd lsb:collectd \
>  op monitor interval="10" timeout="30" \
>  op start interval="0" timeout="120" \
>  op stop interval="0" timeout="120"
> primitive failover-ip1 ocf:heartbeat:IPaddr \
>  params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \
>  op monitor interval="10s"
> primitive failover-ip2 ocf:heartbeat:IPaddr \
>  params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \
>  op monitor interval="10s"
> primitive failover-ip3 ocf:heartbeat:IPaddr \
>  params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \
>  op monitor interval="10s"
> primitive res_drbd_export ocf:linbit:drbd \
>  params drbd_resource="hermes"
> primitive res_fs ocf:heartbeat:Filesystem \
>  params device="/dev/drbd0" directory="/mnt" fstype="ext4"
> group mygroup failover-ip1 failover-ip2 failover-ip3 collectd
> ms ms_drbd_export res_drbd_export \
>  meta notify="true" master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1"
> location cli-prefer-collectd collectd inf: host-1
> location cli-prefer-failover-ip1 failover-ip1 inf: host-1
> location cli-prefer-failover-ip2 failover-ip2 inf: host-1
> location cli-prefer-failover-ip3 failover-ip3 inf: host-1
> location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1
> location cli-prefer-res_fs res_fs inf: host-1

A word of warning, these "cli-" constraints were added automatically
when you ran CLI commands to move resources to specific hosts. You have
to clear these when you're done with whatever the move was for,
otherwise the resources will only run on those nodes from now on.

If you're using pcs, "pcs resource clear " will do it.

> colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master
> order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start
> property $id="cib-bootstrap-options" \
>  dc-version="1.1.10-42f2063" \
>  cluster-infrastructure="corosync" \
>  stonith-enabled="false" \
>  no-quorum-policy="ignore" \
>  last-lrm-refresh="1447686090"
> #vim:set syntax=pcmk
> I don't found the right way, to order the startup of new services (example 
> collectd), after the /mnt is mounted. Can you help me?

As other posters mentioned, order constraints and/or groups will do
that. Exact syntax depends on what CLI tools you use, check their man
pages for details.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] start service after filesystemressource

2015-11-20 Thread Andrei Borzenkov

20.11.2015 16:38, haseni...@gmx.de пишет:

Hi,
I want to start several services after the drbd ressource an the filessystem is
avaiable. This is my current configuration:
node $id="184548773" host-1 \
  attributes standby="on"
node $id="184548774" host-2 \
  attributes standby="on"
primitive collectd lsb:collectd \
  op monitor interval="10" timeout="30" \
  op start interval="0" timeout="120" \
  op stop interval="0" timeout="120"
primitive failover-ip1 ocf:heartbeat:IPaddr \
  params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \
  op monitor interval="10s"
primitive failover-ip2 ocf:heartbeat:IPaddr \
  params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \
  op monitor interval="10s"
primitive failover-ip3 ocf:heartbeat:IPaddr \
  params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \
  op monitor interval="10s"
primitive res_drbd_export ocf:linbit:drbd \
  params drbd_resource="hermes"
primitive res_fs ocf:heartbeat:Filesystem \
  params device="/dev/drbd0" directory="/mnt" fstype="ext4"
group mygroup failover-ip1 failover-ip2 failover-ip3 collectd
ms ms_drbd_export res_drbd_export \
  meta notify="true" master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1"
location cli-prefer-collectd collectd inf: host-1
location cli-prefer-failover-ip1 failover-ip1 inf: host-1
location cli-prefer-failover-ip2 failover-ip2 inf: host-1
location cli-prefer-failover-ip3 failover-ip3 inf: host-1
location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1
location cli-prefer-res_fs res_fs inf: host-1
colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master
order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start
property $id="cib-bootstrap-options" \
  dc-version="1.1.10-42f2063" \
  cluster-infrastructure="corosync" \
  stonith-enabled="false" \
  no-quorum-policy="ignore" \
  last-lrm-refresh="1447686090"
#vim:set syntax=pcmk
I don't found the right way, to order the startup of new services (example
collectd), after the /mnt is mounted.


Just order them after res_fs, same as you order res_fs after 
ms_drbd_export. Or may be I misunderstand your question?



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] start service after filesystemressource

2015-11-20 Thread emmanuel segura
using group is more simple

example:

group mygroup resource1 resource2 resource 3
order o_drbd_before_services inf: ms_drbd_export:promote mygroup:start

2015-11-20 15:45 GMT+01:00 Andrei Borzenkov :
> 20.11.2015 16:38, haseni...@gmx.de пишет:
>
>> Hi,
>> I want to start several services after the drbd ressource an the
>> filessystem is
>> avaiable. This is my current configuration:
>> node $id="184548773" host-1 \
>>   attributes standby="on"
>> node $id="184548774" host-2 \
>>   attributes standby="on"
>> primitive collectd lsb:collectd \
>>   op monitor interval="10" timeout="30" \
>>   op start interval="0" timeout="120" \
>>   op stop interval="0" timeout="120"
>> primitive failover-ip1 ocf:heartbeat:IPaddr \
>>   params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \
>>   op monitor interval="10s"
>> primitive failover-ip2 ocf:heartbeat:IPaddr \
>>   params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \
>>   op monitor interval="10s"
>> primitive failover-ip3 ocf:heartbeat:IPaddr \
>>   params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \
>>   op monitor interval="10s"
>> primitive res_drbd_export ocf:linbit:drbd \
>>   params drbd_resource="hermes"
>> primitive res_fs ocf:heartbeat:Filesystem \
>>   params device="/dev/drbd0" directory="/mnt" fstype="ext4"
>> group mygroup failover-ip1 failover-ip2 failover-ip3 collectd
>> ms ms_drbd_export res_drbd_export \
>>   meta notify="true" master-max="1" master-node-max="1"
>> clone-max="2"
>> clone-node-max="1"
>> location cli-prefer-collectd collectd inf: host-1
>> location cli-prefer-failover-ip1 failover-ip1 inf: host-1
>> location cli-prefer-failover-ip2 failover-ip2 inf: host-1
>> location cli-prefer-failover-ip3 failover-ip3 inf: host-1
>> location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1
>> location cli-prefer-res_fs res_fs inf: host-1
>> colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master
>> order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start
>> property $id="cib-bootstrap-options" \
>>   dc-version="1.1.10-42f2063" \
>>   cluster-infrastructure="corosync" \
>>   stonith-enabled="false" \
>>   no-quorum-policy="ignore" \
>>   last-lrm-refresh="1447686090"
>> #vim:set syntax=pcmk
>> I don't found the right way, to order the startup of new services (example
>> collectd), after the /mnt is mounted.
>
>
> Just order them after res_fs, same as you order res_fs after ms_drbd_export.
> Or may be I misunderstand your question?
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker crash and fencing failure

2015-11-20 Thread Brian Campbell
I've been trying to debug and do a root cause analysis for a cascading
series of failures that a customer hit a couple of days ago, that
caused their filesystem to be unavailable for a couple of hours.

The original failure was in our own distributed filesystem backend, a
fork of LizardFS, which is in turn a fork of MoosFS This history is
mostly only important in reading the logs, where "efs", "lizardfs",
and "mfs" all generally refer to the same services, just different
generations of naming them as not all daemons, scripts, and packages
have been renamed.

There are two master servers that handle metadata operations, running
Pacemaker to elect which one is the current primary and which one is a
replica, and a number of chunkservers that store file chunks and
simply connect to the current running master via a virtual IP. A bug
in doing a checksum scan on the chunkservers caused them to leak file
descriptors and become unresponsive, so while the master server was up
and healthy, no actual filesystem operations could occur. (This bug is
now fixed by the way, and the fix deployed to the customer, but we
want to debug why the later failures occurred that caused them to
continue to have downtime).

The customer saw that things were unresponsive, and tried doing the
simplest thing they could to try to resolve it, migrate the services
to the other master. This succeeded, as the checksum scan had been
initiated by the first master and so switching over to the replica
caused all of the extra file descriptors to be closed and the
chunkservers to become responsive again.

However, due to one backup service that is not yet managed via
Pacemaker and thus is only running on the first master, they decided
to migrate back to the first master. This was when they ran into a
Pacemaker problem.

At the time of the problem, es-efs-master1 is the server that was
originally the master when the first problem happened, and which they
are trying to migrate the services back to. es-efs-master2 is the one
actively running the services, and also happens to be the DC at the
time to that's where to look for pengine messages.

On master2, you can see the point when the user tried to migrate back
to master1 based on the pengine decisions:

(by the way, apologies for the long message with large log excerpts; I
was trying to balance enough detail with not overwhelming, it can be
hard to keep it short when explaining these kinds of complicated
failures across a number of machines)

Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Forcing editshare.stack.7c645b0e-
46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after
a failed demote action
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started
es-efs-master2 -> es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote 
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
-> Master es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Demote  
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master
-> Slave es-efs-master2)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice:
process_pe_message: Calculated Transition 1481601:
/var/lib/pacemaker/pengine/pe-input-1355.bz2
Nov 18 08:28:28 es-efs-master2 stonith-ng[1920]:  warning:
cib_process_diff: Diff 0.2754083.1 -> 0.2754083.2 from local not
applied to 0.2754083.1: Failed application of an update diff
Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice:
process_lrm_event: LRM operation
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
(call=400, rc=0, cib-update=0, confirmed=true) ok
Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice: run_graph:
Transition 1481601 (Complete=5, Pending=0, Fired=0, Skipped=15,
Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-1355.bz2):
Stopped
Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op demote for
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
on es-efs-master2: unknown error (1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Forcing 
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
to stop after a failed demote action
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started
es-efs-master2 -> es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote 
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
-> Master es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Demote  
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master
-> Slave