Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Valentin Vidic
On Mon, Mar 12, 2018 at 04:31:46PM +0100, Klaus Wenninger wrote:
> Nope. Whenever the cluster is completely down...
> Otherwise nodes would come up - if not seeing each other -
> happily with both starting all services because they don't
> know what already had been running on the other node.
> Technically it wouldn't even be possible to remember that
> they've seen once as Corosync doesn't have "non-volatile-storage"
> apart from the config-file.

Interesting, I have the following config in a test cluster:

nodelist {
node {
ring0_addr: sid1
nodeid: 1
}

node {
ring0_addr: sid2
nodeid: 2
}
}

quorum {

# Enable and configure quorum subsystem (default: off)
# see also corosync.conf.5 and votequorum.5
provider: corosync_votequorum
expected_votes: 1
two_node: 1
}

And the behaviour when both nodes are down seems to be:

1. One node up
2. Fence other node
3. Start services

Mar 12 18:15:01 sid1 crmd[555]:   notice: Connecting to cluster infrastructure: 
corosync
Mar 12 18:15:01 sid1 crmd[555]:   notice: Quorum acquired
Mar 12 18:15:01 sid1 crmd[555]:   notice: Node sid1 state is now member
Mar 12 18:15:01 sid1 crmd[555]:   notice: State transition S_STARTING -> 
S_PENDING
Mar 12 18:15:23 sid1 crmd[555]:  warning: Input I_DC_TIMEOUT received in state 
S_PENDING from crm_timer_popped
Mar 12 18:15:23 sid1 crmd[555]:   notice: State transition S_ELECTION -> 
S_INTEGRATION
Mar 12 18:15:23 sid1 crmd[555]:  warning: Input I_ELECTION_DC received in state 
S_INTEGRATION from do_election_check
Mar 12 18:15:23 sid1 crmd[555]:   notice: Result of probe operation for 
stonith-sbd on sid1: 7 (not running)
Mar 12 18:15:23 sid1 crmd[555]:   notice: Result of probe operation for dlm on 
sid1: 7 (not running)
Mar 12 18:15:23 sid1 crmd[555]:   notice: Result of probe operation for 
admin-ip on sid1: 7 (not running)
Mar 12 18:15:23 sid1 crmd[555]:   notice: Result of probe operation for 
clusterfs on sid1: 7 (not running)
Mar 12 18:15:57 sid1 stonith-ng[551]:   notice: Operation 'reboot' [1454] (call 
2 from crmd.555) for host 'sid2' with device 'stonith-sbd' returned: 0 (OK)
Mar 12 18:15:57 sid1 stonith-ng[551]:   notice: Operation reboot of sid2 by 
sid1 for crmd.555@sid1.ece4f9c5: OK
Mar 12 18:15:57 sid1 crmd[555]:   notice: Node sid2 state is now lost
Mar 12 18:15:58 sid1 crmd[555]:   notice: Result of start operation for dlm on 
sid1: 0 (ok)
Mar 12 18:15:58 sid1 crmd[555]:   notice: Result of start operation for 
admin-ip on sid1: 0 (ok)
Mar 12 18:15:58 sid1 crmd[555]:   notice: Result of start operation for 
stonith-sbd on sid1: 0 (ok)
Mar 12 18:15:58 sid1 crmd[555]:   notice: Result of start operation for 
clusterfs on sid1: 0 (ok)
Mar 12 18:15:58 sid1 crmd[555]:   notice: Transition 0 (Complete=18, Pending=0, 
Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-warn-32.bz2): Complete
Mar 12 18:15:58 sid1 crmd[555]:   notice: State transition S_TRANSITION_ENGINE 
-> S_IDLE

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Klaus Wenninger
On 03/12/2018 04:17 PM, Valentin Vidic wrote:
> On Mon, Mar 12, 2018 at 01:58:21PM +0100, Klaus Wenninger wrote:
>> But isn't dlm directly interfering with corosync so
>> that it would get the quorum state from there?
>> As you have 2-node set probably on a 2-node-cluster
>> this would - after both nodes down - wait for all
>> nodes up first.
> Isn't wait_for_all only used during cluster installation?

Nope. Whenever the cluster is completely down...
Otherwise nodes would come up - if not seeing each other -
happily with both starting all services because they don't
know what already had been running on the other node.
Technically it wouldn't even be possible to remember that
they've seen once as Corosync doesn't have "non-volatile-storage"
apart from the config-file.

Regards,
Klaus

>
> votequorum(5):
>
> "When WFA is enabled, the cluster will be quorate for the first time
> only after all nodes have been visible at least once at the same time."
>

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Valentin Vidic
On Mon, Mar 12, 2018 at 01:58:21PM +0100, Klaus Wenninger wrote:
> But isn't dlm directly interfering with corosync so
> that it would get the quorum state from there?
> As you have 2-node set probably on a 2-node-cluster
> this would - after both nodes down - wait for all
> nodes up first.

Isn't wait_for_all only used during cluster installation?

votequorum(5):

"When WFA is enabled, the cluster will be quorate for the first time
only after all nodes have been visible at least once at the same time."

-- 
Valentin
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Klaus Wenninger
On 03/12/2018 01:44 PM, Muhammad Sharfuddin wrote:
> Hi Klaus,
>
> primitive sbd-stonith stonith:external/sbd \
>     op monitor interval=3000 timeout=20 \
>     op start interval=0 timeout=240 \
>     op stop interval=0 timeout=100 \
>     params sbd_device="/dev/mapper/sbd" \
>     meta target-role=Started

Makes more sense now.
Using pcmk_delay_max would probably be useful here
to prevent a fence-race.
That stonith-resource was not in your resource-list below ...

>
> property cib-bootstrap-options: \
>     have-watchdog=true \
>     stonith-enabled=true \
>     no-quorum-policy=ignore \
>     stonith-timeout=90 \
>     startup-fencing=true

You've set no-quorum-policy=ignore for pacemaker.
Whether this is a good idea or not in your setup is
written on another page.
But isn't dlm directly interfering with corosync so
that it would get the quorum state from there?
As you have 2-node set probably on a 2-node-cluster
this would - after both nodes down - wait for all
nodes up first.

Regards,
Klaus

>
> # ps -eaf |grep sbd
> root  6129 1  0 17:35 ?    00:00:00 sbd: inquisitor
> root  6133  6129  0 17:35 ?    00:00:00 sbd: watcher:
> /dev/mapper/sbd - slot: 1 - uuid: 6e80a337-95db-4608-bd62-d59517f39103
> root  6134  6129  0 17:35 ?    00:00:00 sbd: watcher: Pacemaker
> root  6135  6129  0 17:35 ?    00:00:00 sbd: watcher: Cluster
>
> This cluster does not start ocfs2 resources when I first intentionally
> crashed(reboot) both the nodes, then try to start ocfs2 resource while
> one node is  offline.
>
> To fix the issue, I have one permanent solution, bring the other
> node(offline) online and things get fixed automatically, i.e ocfs2
> resources mounts.
>
> -- 
> Regards,
> Muhammad Sharfuddin
>
> On 3/12/2018 5:25 PM, Klaus Wenninger wrote:
>> Hi Muhammad!
>>
>> Could you be a little bit more elaborate on your fencing-setup!
>> I read about you using SBD but I don't see any sbd-fencing-resource.
>> For the case you wanted to use watchdog-fencing with SBD this
>> would require stonith-watchdog-timeout property to be set.
>> But watchdog-fencing relies on quorum (without 2-node trickery)
>> and thus wouldn't work on a 2-node-cluster anyway.
>>
>> Didn't read through the whole thread - so I might be missing
>> something ...
>>
>> Regards,
>> Klaus
>>
>> On 03/12/2018 12:51 PM, Muhammad Sharfuddin wrote:
>>> Hello Gang,
>>>
>>> as informed, previously cluster was fixed to start the ocfs2
>>> resources by
>>>
>>> a) crm resource start dlm
>>>
>>> b) mount/umount  the ocfs2 file system manually. (this step was the
>>> fix)
>>>
>>> and then starting the clone group(which include dlm, ocfs2 file
>>> systems) worked fine:
>>>
>>> c) crm resource start base-clone.
>>>
>>> Now I crash the nodes intentionally and then keep only one node
>>> online, again cluster stopped starting the ocfs2 resources. I again
>>> tried to follow your instructions i.e
>>>
>>> i) crm resource start dlm
>>>
>>> then try to mount the ocfs2 file system manually which got hanged this
>>> time(previously manually mounting helped me):
>>>
>>> # cat /proc/3966/stack
>>> [] do_uevent+0x7e/0x200 [dlm]
>>> [] new_lockspace+0x80a/0xa70 [dlm]
>>> [] dlm_new_lockspace+0x69/0x160 [dlm]
>>> [] user_cluster_connect+0xc8/0x350 [ocfs2_stack_user]
>>> [] ocfs2_cluster_connect+0x192/0x240
>>> [ocfs2_stackglue]
>>> [] ocfs2_dlm_init+0x31c/0x570 [ocfs2]
>>> [] ocfs2_fill_super+0xb33/0x1200 [ocfs2]
>>> [] mount_bdev+0x1a0/0x1e0
>>> [] mount_fs+0x3a/0x170
>>> [] vfs_kern_mount+0x62/0x110
>>> [] do_mount+0x213/0xcd0
>>> [] SyS_mount+0x85/0xd0
>>> [] entry_SYSCALL_64_fastpath+0x1e/0xb6
>>> [] 0x
>>>
>>> I killed the mount.ocfs2 process stop(crm resource stop dlm) the dlm
>>> process, and then try to start(crm resource start dlm) the dlm(which
>>> previously always get started successfully), this time dlm didn't
>>> start and I checked the dlm_controld process
>>>
>>> cat /proc/3754/stack
>>> [] poll_schedule_timeout+0x45/0x60
>>> [] do_sys_poll+0x38c/0x4f0
>>> [] SyS_poll+0x5d/0xe0
>>> [] entry_SYSCALL_64_fastpath+0x1e/0xb6
>>> [] 0x
>>>
>>> Nutshell:
>>>
>>> 1 - this cluster is configured to run when single node is online
>>>
>>> 2 - this cluster does not start the ocfs2 resources after a crash when
>>> only one node is online.
>>>
>>> -- 
>>> Regards,
>>> Muhammad Sharfuddin | +923332144823 | nds.com.pk
>>>
>>> On 3/12/2018 12:41 PM, Gang He wrote:

> Hello Gang,
>
> to follow your instructions, I started the dlm resource via:
>
>    crm resource start dlm
>
> then mount/unmount the ocfs2 file system manually..(which seems to be
> the fix of the situation).
>
> Now resources are getting started properly on a single node.. I am
> happy
> as the issue is fixed, but at the same time I am lost because I have
> no idea
>
> how things get fixed here(merely by mounting/unmounting the ocfs2
> 

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Muhammad Sharfuddin

Hi Klaus,

primitive sbd-stonith stonith:external/sbd \
    op monitor interval=3000 timeout=20 \
    op start interval=0 timeout=240 \
    op stop interval=0 timeout=100 \
    params sbd_device="/dev/mapper/sbd" \
    meta target-role=Started

property cib-bootstrap-options: \
    have-watchdog=true \
    stonith-enabled=true \
    no-quorum-policy=ignore \
    stonith-timeout=90 \
    startup-fencing=true

# ps -eaf |grep sbd
root  6129 1  0 17:35 ?    00:00:00 sbd: inquisitor
root  6133  6129  0 17:35 ?    00:00:00 sbd: watcher: 
/dev/mapper/sbd - slot: 1 - uuid: 6e80a337-95db-4608-bd62-d59517f39103

root  6134  6129  0 17:35 ?    00:00:00 sbd: watcher: Pacemaker
root  6135  6129  0 17:35 ?    00:00:00 sbd: watcher: Cluster

This cluster does not start ocfs2 resources when I first intentionally 
crashed(reboot) both the nodes, then try to start ocfs2 resource while 
one node is  offline.


To fix the issue, I have one permanent solution, bring the other 
node(offline) online and things get fixed automatically, i.e ocfs2 
resources mounts.


--
Regards,
Muhammad Sharfuddin

On 3/12/2018 5:25 PM, Klaus Wenninger wrote:

Hi Muhammad!

Could you be a little bit more elaborate on your fencing-setup!
I read about you using SBD but I don't see any sbd-fencing-resource.
For the case you wanted to use watchdog-fencing with SBD this
would require stonith-watchdog-timeout property to be set.
But watchdog-fencing relies on quorum (without 2-node trickery)
and thus wouldn't work on a 2-node-cluster anyway.

Didn't read through the whole thread - so I might be missing something ...

Regards,
Klaus

On 03/12/2018 12:51 PM, Muhammad Sharfuddin wrote:

Hello Gang,

as informed, previously cluster was fixed to start the ocfs2 resources by

a) crm resource start dlm

b) mount/umount  the ocfs2 file system manually. (this step was the fix)

and then starting the clone group(which include dlm, ocfs2 file
systems) worked fine:

c) crm resource start base-clone.

Now I crash the nodes intentionally and then keep only one node
online, again cluster stopped starting the ocfs2 resources. I again
tried to follow your instructions i.e

i) crm resource start dlm

then try to mount the ocfs2 file system manually which got hanged this
time(previously manually mounting helped me):

# cat /proc/3966/stack
[] do_uevent+0x7e/0x200 [dlm]
[] new_lockspace+0x80a/0xa70 [dlm]
[] dlm_new_lockspace+0x69/0x160 [dlm]
[] user_cluster_connect+0xc8/0x350 [ocfs2_stack_user]
[] ocfs2_cluster_connect+0x192/0x240 [ocfs2_stackglue]
[] ocfs2_dlm_init+0x31c/0x570 [ocfs2]
[] ocfs2_fill_super+0xb33/0x1200 [ocfs2]
[] mount_bdev+0x1a0/0x1e0
[] mount_fs+0x3a/0x170
[] vfs_kern_mount+0x62/0x110
[] do_mount+0x213/0xcd0
[] SyS_mount+0x85/0xd0
[] entry_SYSCALL_64_fastpath+0x1e/0xb6
[] 0x

I killed the mount.ocfs2 process stop(crm resource stop dlm) the dlm
process, and then try to start(crm resource start dlm) the dlm(which
previously always get started successfully), this time dlm didn't
start and I checked the dlm_controld process

cat /proc/3754/stack
[] poll_schedule_timeout+0x45/0x60
[] do_sys_poll+0x38c/0x4f0
[] SyS_poll+0x5d/0xe0
[] entry_SYSCALL_64_fastpath+0x1e/0xb6
[] 0x

Nutshell:

1 - this cluster is configured to run when single node is online

2 - this cluster does not start the ocfs2 resources after a crash when
only one node is online.

--
Regards,
Muhammad Sharfuddin | +923332144823 | nds.com.pk

On 3/12/2018 12:41 PM, Gang He wrote:



Hello Gang,

to follow your instructions, I started the dlm resource via:

   crm resource start dlm

then mount/unmount the ocfs2 file system manually..(which seems to be
the fix of the situation).

Now resources are getting started properly on a single node.. I am
happy
as the issue is fixed, but at the same time I am lost because I have
no idea

how things get fixed here(merely by mounting/unmounting the ocfs2 file
systems)

>From your description.
I just wonder  the DLM resource does not work normally under that
situation.
Yan/Bin, do you have any comments about two-node cluster? which
configuration settings will affect corosync quorum/DLM ?


Thanks
Gang



--
Regards,
Muhammad Sharfuddin

On 3/12/2018 10:59 AM, Gang He wrote:

Hello Muhammad,

Usually, ocfs2 resource startup failure is caused by mount command
timeout

(or hanged).

The sample debugging method is,
remove ocfs2 resource from crm first,
then mount this file system manually, see if the mount command will be

timeout or hanged.

If this command is hanged, please watch where is mount.ocfs2
process hanged

via "cat /proc/xxx/stack" command.

If the back trace is stopped at DLM kernel module, usually the root
cause is

cluster configuration problem.

Thanks
Gang



On 3/12/2018 7:32 AM, Gang He wrote:

Hello Muhammad,

I think this problem is not in ocfs2, the cause looks like the
cluster

quorum is missed.

For two-node cluster (does not 

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Klaus Wenninger
Hi Muhammad!

Could you be a little bit more elaborate on your fencing-setup!
I read about you using SBD but I don't see any sbd-fencing-resource.
For the case you wanted to use watchdog-fencing with SBD this
would require stonith-watchdog-timeout property to be set.
But watchdog-fencing relies on quorum (without 2-node trickery)
and thus wouldn't work on a 2-node-cluster anyway.

Didn't read through the whole thread - so I might be missing something ...

Regards,
Klaus

On 03/12/2018 12:51 PM, Muhammad Sharfuddin wrote:
> Hello Gang,
>
> as informed, previously cluster was fixed to start the ocfs2 resources by
>
> a) crm resource start dlm
>
> b) mount/umount  the ocfs2 file system manually. (this step was the fix)
>
> and then starting the clone group(which include dlm, ocfs2 file
> systems) worked fine:
>
> c) crm resource start base-clone.
>
> Now I crash the nodes intentionally and then keep only one node
> online, again cluster stopped starting the ocfs2 resources. I again
> tried to follow your instructions i.e
>
> i) crm resource start dlm
>
> then try to mount the ocfs2 file system manually which got hanged this
> time(previously manually mounting helped me):
>
> # cat /proc/3966/stack
> [] do_uevent+0x7e/0x200 [dlm]
> [] new_lockspace+0x80a/0xa70 [dlm]
> [] dlm_new_lockspace+0x69/0x160 [dlm]
> [] user_cluster_connect+0xc8/0x350 [ocfs2_stack_user]
> [] ocfs2_cluster_connect+0x192/0x240 [ocfs2_stackglue]
> [] ocfs2_dlm_init+0x31c/0x570 [ocfs2]
> [] ocfs2_fill_super+0xb33/0x1200 [ocfs2]
> [] mount_bdev+0x1a0/0x1e0
> [] mount_fs+0x3a/0x170
> [] vfs_kern_mount+0x62/0x110
> [] do_mount+0x213/0xcd0
> [] SyS_mount+0x85/0xd0
> [] entry_SYSCALL_64_fastpath+0x1e/0xb6
> [] 0x
>
> I killed the mount.ocfs2 process stop(crm resource stop dlm) the dlm
> process, and then try to start(crm resource start dlm) the dlm(which
> previously always get started successfully), this time dlm didn't
> start and I checked the dlm_controld process
>
> cat /proc/3754/stack
> [] poll_schedule_timeout+0x45/0x60
> [] do_sys_poll+0x38c/0x4f0
> [] SyS_poll+0x5d/0xe0
> [] entry_SYSCALL_64_fastpath+0x1e/0xb6
> [] 0x
>
> Nutshell:
>
> 1 - this cluster is configured to run when single node is online
>
> 2 - this cluster does not start the ocfs2 resources after a crash when
> only one node is online.
>
> -- 
> Regards,
> Muhammad Sharfuddin | +923332144823 | nds.com.pk
>
> On 3/12/2018 12:41 PM, Gang He wrote:
>>
>>
>>> Hello Gang,
>>>
>>> to follow your instructions, I started the dlm resource via:
>>>
>>>   crm resource start dlm
>>>
>>> then mount/unmount the ocfs2 file system manually..(which seems to be
>>> the fix of the situation).
>>>
>>> Now resources are getting started properly on a single node.. I am
>>> happy
>>> as the issue is fixed, but at the same time I am lost because I have
>>> no idea
>>>
>>> how things get fixed here(merely by mounting/unmounting the ocfs2 file
>>> systems)
>> >From your description.
>> I just wonder  the DLM resource does not work normally under that
>> situation.
>> Yan/Bin, do you have any comments about two-node cluster? which
>> configuration settings will affect corosync quorum/DLM ?
>>
>>
>> Thanks
>> Gang
>>
>>
>>>
>>> -- 
>>> Regards,
>>> Muhammad Sharfuddin
>>>
>>> On 3/12/2018 10:59 AM, Gang He wrote:
 Hello Muhammad,

 Usually, ocfs2 resource startup failure is caused by mount command
 timeout
>>> (or hanged).
 The sample debugging method is,
 remove ocfs2 resource from crm first,
 then mount this file system manually, see if the mount command will be
>>> timeout or hanged.
 If this command is hanged, please watch where is mount.ocfs2
 process hanged
>>> via "cat /proc/xxx/stack" command.
 If the back trace is stopped at DLM kernel module, usually the root
 cause is
>>> cluster configuration problem.

 Thanks
 Gang


> On 3/12/2018 7:32 AM, Gang He wrote:
>> Hello Muhammad,
>>
>> I think this problem is not in ocfs2, the cause looks like the
>> cluster
> quorum is missed.
>> For two-node cluster (does not three-node cluster), if one node
>> is offline,
> the quorum will be missed by default.
>> So, you should configure two-node related quorum setting
>> according to the
> pacemaker manual.
>> Then, DLM can work normal, and ocfs2 resource can start up.
> Yes its configured accordingly, no-quorum is set to "ignore".
>
> property cib-bootstrap-options: \
>     have-watchdog=true \
>     stonith-enabled=true \
>     stonith-timeout=80 \
>     startup-fencing=true \
>     no-quorum-policy=ignore
>
>> Thanks
>> Gang
>>
>>
>>> Hi,
>>>
>>> This two node cluster starts resources when both nodes are
>>> online but
>>> does not start the ocfs2 resources
>>>
>>> when one node is offline. e.g if I gracefully stop the cluster

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Muhammad Sharfuddin

@Ulrich,

issue I am facing is that when both nodes get crashed and then if I keep 
one node offline, the online node doesn't start the ocfs2 resources.


--
Regards,
Muhammad Sharfuddin

On 3/12/2018 4:51 PM, Muhammad Sharfuddin wrote:

Hello Gang,

as informed, previously cluster was fixed to start the ocfs2 resources by

a) crm resource start dlm

b) mount/umount  the ocfs2 file system manually. (this step was the fix)

and then starting the clone group(which include dlm, ocfs2 file 
systems) worked fine:


c) crm resource start base-clone.

Now I crash the nodes intentionally and then keep only one node 
online, again cluster stopped starting the ocfs2 resources. I again 
tried to follow your instructions i.e


i) crm resource start dlm

then try to mount the ocfs2 file system manually which got hanged this 
time(previously manually mounting helped me):


# cat /proc/3966/stack
[] do_uevent+0x7e/0x200 [dlm]
[] new_lockspace+0x80a/0xa70 [dlm]
[] dlm_new_lockspace+0x69/0x160 [dlm]
[] user_cluster_connect+0xc8/0x350 [ocfs2_stack_user]
[] ocfs2_cluster_connect+0x192/0x240 [ocfs2_stackglue]
[] ocfs2_dlm_init+0x31c/0x570 [ocfs2]
[] ocfs2_fill_super+0xb33/0x1200 [ocfs2]
[] mount_bdev+0x1a0/0x1e0
[] mount_fs+0x3a/0x170
[] vfs_kern_mount+0x62/0x110
[] do_mount+0x213/0xcd0
[] SyS_mount+0x85/0xd0
[] entry_SYSCALL_64_fastpath+0x1e/0xb6
[] 0x

I killed the mount.ocfs2 process stop(crm resource stop dlm) the dlm 
process, and then try to start(crm resource start dlm) the dlm(which 
previously always get started successfully), this time dlm didn't 
start and I checked the dlm_controld process


cat /proc/3754/stack
[] poll_schedule_timeout+0x45/0x60
[] do_sys_poll+0x38c/0x4f0
[] SyS_poll+0x5d/0xe0
[] entry_SYSCALL_64_fastpath+0x1e/0xb6
[] 0x

Nutshell:

1 - this cluster is configured to run when single node is online

2 - this cluster does not start the ocfs2 resources after a crash when 
only one node is online.


--
Regards,
Muhammad Sharfuddin | +923332144823 | nds.com.pk

On 3/12/2018 12:41 PM, Gang He wrote:




Hello Gang,

to follow your instructions, I started the dlm resource via:

  crm resource start dlm

then mount/unmount the ocfs2 file system manually..(which seems to be
the fix of the situation).

Now resources are getting started properly on a single node.. I am 
happy
as the issue is fixed, but at the same time I am lost because I have 
no idea


how things get fixed here(merely by mounting/unmounting the ocfs2 file
systems)

>From your description.
I just wonder  the DLM resource does not work normally under that 
situation.
Yan/Bin, do you have any comments about two-node cluster? which 
configuration settings will affect corosync quorum/DLM ?



Thanks
Gang




--
Regards,
Muhammad Sharfuddin

On 3/12/2018 10:59 AM, Gang He wrote:

Hello Muhammad,

Usually, ocfs2 resource startup failure is caused by mount command 
timeout

(or hanged).

The sample debugging method is,
remove ocfs2 resource from crm first,
then mount this file system manually, see if the mount command will be

timeout or hanged.
If this command is hanged, please watch where is mount.ocfs2 
process hanged

via "cat /proc/xxx/stack" command.
If the back trace is stopped at DLM kernel module, usually the root 
cause is

cluster configuration problem.


Thanks
Gang



On 3/12/2018 7:32 AM, Gang He wrote:

Hello Muhammad,

I think this problem is not in ocfs2, the cause looks like the 
cluster

quorum is missed.
For two-node cluster (does not three-node cluster), if one node 
is offline,

the quorum will be missed by default.
So, you should configure two-node related quorum setting 
according to the

pacemaker manual.

Then, DLM can work normal, and ocfs2 resource can start up.

Yes its configured accordingly, no-quorum is set to "ignore".

property cib-bootstrap-options: \
    have-watchdog=true \
    stonith-enabled=true \
    stonith-timeout=80 \
    startup-fencing=true \
    no-quorum-policy=ignore


Thanks
Gang



Hi,

This two node cluster starts resources when both nodes are 
online but

does not start the ocfs2 resources

when one node is offline. e.g if I gracefully stop the cluster 
resources

then stop the pacemaker service on

either node, and try to start the ocfs2 resource on the online 
node, it

fails.

logs:

pipci001 pengine[17732]:   notice: Start dlm:0#011(pipci001)
pengine[17732]:   notice: Start p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Start p-fsusrsap:0#011(pipci001)
pipci001 pengine[17732]:   notice: Calculated transition 2, saving
inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
pipci001 crmd[17733]:   notice: Processing graph 2
(ref=pe_calc-dc-1520613202-31) derived from
/var/lib/pacemaker/pengine/pe-input-339.bz2
crmd[17733]:   notice: Initiating start operation dlm_start_0 
locally on

pipci001
lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
dlm_controld[19019]: 4575 dlm_controld 4.0.7 started

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Muhammad Sharfuddin

Hello Gang,

as informed, previously cluster was fixed to start the ocfs2 resources by

a) crm resource start dlm

b) mount/umount  the ocfs2 file system manually. (this step was the fix)

and then starting the clone group(which include dlm, ocfs2 file systems) 
worked fine:


c) crm resource start base-clone.

Now I crash the nodes intentionally and then keep only one node online, 
again cluster stopped starting the ocfs2 resources. I again tried to 
follow your instructions i.e


i) crm resource start dlm

then try to mount the ocfs2 file system manually which got hanged this 
time(previously manually mounting helped me):


# cat /proc/3966/stack
[] do_uevent+0x7e/0x200 [dlm]
[] new_lockspace+0x80a/0xa70 [dlm]
[] dlm_new_lockspace+0x69/0x160 [dlm]
[] user_cluster_connect+0xc8/0x350 [ocfs2_stack_user]
[] ocfs2_cluster_connect+0x192/0x240 [ocfs2_stackglue]
[] ocfs2_dlm_init+0x31c/0x570 [ocfs2]
[] ocfs2_fill_super+0xb33/0x1200 [ocfs2]
[] mount_bdev+0x1a0/0x1e0
[] mount_fs+0x3a/0x170
[] vfs_kern_mount+0x62/0x110
[] do_mount+0x213/0xcd0
[] SyS_mount+0x85/0xd0
[] entry_SYSCALL_64_fastpath+0x1e/0xb6
[] 0x

I killed the mount.ocfs2 process stop(crm resource stop dlm) the dlm 
process, and then try to start(crm resource start dlm) the dlm(which 
previously always get started successfully), this time dlm didn't start 
and I checked the dlm_controld process


cat /proc/3754/stack
[] poll_schedule_timeout+0x45/0x60
[] do_sys_poll+0x38c/0x4f0
[] SyS_poll+0x5d/0xe0
[] entry_SYSCALL_64_fastpath+0x1e/0xb6
[] 0x

Nutshell:

1 - this cluster is configured to run when single node is online

2 - this cluster does not start the ocfs2 resources after a crash when 
only one node is online.


--
Regards,
Muhammad Sharfuddin | +923332144823 | nds.com.pk

On 3/12/2018 12:41 PM, Gang He wrote:




Hello Gang,

to follow your instructions, I started the dlm resource via:

  crm resource start dlm

then mount/unmount the ocfs2 file system manually..(which seems to be
the fix of the situation).

Now resources are getting started properly on a single node.. I am happy
as the issue is fixed, but at the same time I am lost because I have no idea

how things get fixed here(merely by mounting/unmounting the ocfs2 file
systems)

>From your description.
I just wonder  the DLM resource does not work normally under that situation.
Yan/Bin, do you have any comments about two-node cluster? which configuration 
settings will affect corosync quorum/DLM ?


Thanks
Gang




--
Regards,
Muhammad Sharfuddin

On 3/12/2018 10:59 AM, Gang He wrote:

Hello Muhammad,

Usually, ocfs2 resource startup failure is caused by mount command timeout

(or hanged).

The sample debugging method is,
remove ocfs2 resource from crm first,
then mount this file system manually, see if the mount command will be

timeout or hanged.

If this command is hanged, please watch where is mount.ocfs2 process hanged

via "cat /proc/xxx/stack" command.

If the back trace is stopped at DLM kernel module, usually the root cause is

cluster configuration problem.


Thanks
Gang



On 3/12/2018 7:32 AM, Gang He wrote:

Hello Muhammad,

I think this problem is not in ocfs2, the cause looks like the cluster

quorum is missed.

For two-node cluster (does not three-node cluster), if one node is offline,

the quorum will be missed by default.

So, you should configure two-node related quorum setting according to the

pacemaker manual.

Then, DLM can work normal, and ocfs2 resource can start up.

Yes its configured accordingly, no-quorum is set to "ignore".

property cib-bootstrap-options: \
have-watchdog=true \
stonith-enabled=true \
stonith-timeout=80 \
startup-fencing=true \
no-quorum-policy=ignore


Thanks
Gang



Hi,

This two node cluster starts resources when both nodes are online but
does not start the ocfs2 resources

when one node is offline. e.g if I gracefully stop the cluster resources
then stop the pacemaker service on

either node, and try to start the ocfs2 resource on the online node, it
fails.

logs:

pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
pipci001 pengine[17732]:   notice: Calculated transition 2, saving
inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
pipci001 crmd[17733]:   notice: Processing graph 2
(ref=pe_calc-dc-1520613202-31) derived from
/var/lib/pacemaker/pengine/pe-input-339.bz2
crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on
pipci001
lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69
pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0 (ok)
crmd[17733]:   notice: Initiating monitor operation 

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Gang He



>>> 
> Hello Gang,
> 
> to follow your instructions, I started the dlm resource via:
> 
>  crm resource start dlm
> 
> then mount/unmount the ocfs2 file system manually..(which seems to be 
> the fix of the situation).
> 
> Now resources are getting started properly on a single node.. I am happy 
> as the issue is fixed, but at the same time I am lost because I have no idea
> 
> how things get fixed here(merely by mounting/unmounting the ocfs2 file 
> systems)

>From your description.
I just wonder  the DLM resource does not work normally under that situation.
Yan/Bin, do you have any comments about two-node cluster? which configuration 
settings will affect corosync quorum/DLM ?


Thanks
Gang


> 
> 
> --
> Regards,
> Muhammad Sharfuddin
> 
> On 3/12/2018 10:59 AM, Gang He wrote:
>> Hello Muhammad,
>>
>> Usually, ocfs2 resource startup failure is caused by mount command timeout 
> (or hanged).
>> The sample debugging method is,
>> remove ocfs2 resource from crm first,
>> then mount this file system manually, see if the mount command will be 
> timeout or hanged.
>> If this command is hanged, please watch where is mount.ocfs2 process hanged 
> via "cat /proc/xxx/stack" command.
>> If the back trace is stopped at DLM kernel module, usually the root cause is 
> cluster configuration problem.
>>
>>
>> Thanks
>> Gang
>>
>>
>>> On 3/12/2018 7:32 AM, Gang He wrote:
 Hello Muhammad,

 I think this problem is not in ocfs2, the cause looks like the cluster
>>> quorum is missed.
 For two-node cluster (does not three-node cluster), if one node is offline,
>>> the quorum will be missed by default.
 So, you should configure two-node related quorum setting according to the
>>> pacemaker manual.
 Then, DLM can work normal, and ocfs2 resource can start up.
>>> Yes its configured accordingly, no-quorum is set to "ignore".
>>>
>>> property cib-bootstrap-options: \
>>>have-watchdog=true \
>>>stonith-enabled=true \
>>>stonith-timeout=80 \
>>>startup-fencing=true \
>>>no-quorum-policy=ignore
>>>
 Thanks
 Gang


> Hi,
>
> This two node cluster starts resources when both nodes are online but
> does not start the ocfs2 resources
>
> when one node is offline. e.g if I gracefully stop the cluster resources
> then stop the pacemaker service on
>
> either node, and try to start the ocfs2 resource on the online node, it
> fails.
>
> logs:
>
> pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
> pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
> pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
> pipci001 pengine[17732]:   notice: Calculated transition 2, saving
> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
> pipci001 crmd[17733]:   notice: Processing graph 2
> (ref=pe_calc-dc-1520613202-31) derived from
> /var/lib/pacemaker/pengine/pe-input-339.bz2
> crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on
> pipci001
> lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
> lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69
> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
> crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0 
> (ok)
> crmd[17733]:   notice: Initiating monitor operation dlm_monitor_6
> locally on pipci001
> crmd[17733]:   notice: Initiating start operation p-fssapmnt_start_0
> locally on pipci001
> lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:start call_id:71
> Filesystem(p-fssapmnt)[19052]: INFO: Running start for
> /dev/mapper/sapmnt on /sapmnt
> kernel: [ 4576.529938] dlm: Using TCP for communications
> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
> the lockspace group.
> dlm_controld[19019]: 4629 fence work wait for quorum
> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
> lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052) timed out
> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
> event done -512 0
> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join
> failed -512 0
> lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out after 6ms
> lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start call_id:71
> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
> crmd[17733]:error: Result of start operation for p-fssapmnt on
> pipci001: Timed Out
> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
> (target: 0 vs. rc: 1): Error
> crmd[17733]:   notice: Transition aborted by operation
> 

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Muhammad Sharfuddin

Hello Gang,

to follow your instructions, I started the dlm resource via:

    crm resource start dlm

then mount/unmount the ocfs2 file system manually..(which seems to be 
the fix of the situation).


Now resources are getting started properly on a single node.. I am happy 
as the issue is fixed, but at the same time I am lost because I have no idea


how things get fixed here(merely by mounting/unmounting the ocfs2 file 
systems)



--
Regards,
Muhammad Sharfuddin

On 3/12/2018 10:59 AM, Gang He wrote:

Hello Muhammad,

Usually, ocfs2 resource startup failure is caused by mount command timeout (or 
hanged).
The sample debugging method is,
remove ocfs2 resource from crm first,
then mount this file system manually, see if the mount command will be timeout 
or hanged.
If this command is hanged, please watch where is mount.ocfs2 process hanged via "cat 
/proc/xxx/stack" command.
If the back trace is stopped at DLM kernel module, usually the root cause is 
cluster configuration problem.


Thanks
Gang



On 3/12/2018 7:32 AM, Gang He wrote:

Hello Muhammad,

I think this problem is not in ocfs2, the cause looks like the cluster

quorum is missed.

For two-node cluster (does not three-node cluster), if one node is offline,

the quorum will be missed by default.

So, you should configure two-node related quorum setting according to the

pacemaker manual.

Then, DLM can work normal, and ocfs2 resource can start up.

Yes its configured accordingly, no-quorum is set to "ignore".

property cib-bootstrap-options: \
   have-watchdog=true \
   stonith-enabled=true \
   stonith-timeout=80 \
   startup-fencing=true \
   no-quorum-policy=ignore


Thanks
Gang



Hi,

This two node cluster starts resources when both nodes are online but
does not start the ocfs2 resources

when one node is offline. e.g if I gracefully stop the cluster resources
then stop the pacemaker service on

either node, and try to start the ocfs2 resource on the online node, it
fails.

logs:

pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
pipci001 pengine[17732]:   notice: Calculated transition 2, saving
inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
pipci001 crmd[17733]:   notice: Processing graph 2
(ref=pe_calc-dc-1520613202-31) derived from
/var/lib/pacemaker/pengine/pe-input-339.bz2
crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on
pipci001
lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69
pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0 (ok)
crmd[17733]:   notice: Initiating monitor operation dlm_monitor_6
locally on pipci001
crmd[17733]:   notice: Initiating start operation p-fssapmnt_start_0
locally on pipci001
lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:start call_id:71
Filesystem(p-fssapmnt)[19052]: INFO: Running start for
/dev/mapper/sapmnt on /sapmnt
kernel: [ 4576.529938] dlm: Using TCP for communications
kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
the lockspace group.
dlm_controld[19019]: 4629 fence work wait for quorum
dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052) timed out
kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
event done -512 0
kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join
failed -512 0
lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out after 6ms
lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start call_id:71
pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
crmd[17733]:error: Result of start operation for p-fssapmnt on
pipci001: Timed Out
crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
(target: 0 vs. rc: 1): Error
crmd[17733]:   notice: Transition aborted by operation
p-fssapmnt_start_0 'modify' on pipci001: Event failed
crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
(target: 0 vs. rc: 1): Error
crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0, Fired=0,
Skipped=0, Incomplete=6,
Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
required
pengine[17732]:   notice: On loss of CCM Quorum: Ignore
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Forcing base-clone away from pipci001 after
100 failures (max=2)

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-12 Thread Gang He
Hello Muhammad,

Usually, ocfs2 resource startup failure is caused by mount command timeout (or 
hanged).
The sample debugging method is, 
remove ocfs2 resource from crm first,
then mount this file system manually, see if the mount command will be timeout 
or hanged.  
If this command is hanged, please watch where is mount.ocfs2 process hanged via 
"cat /proc/xxx/stack" command.
If the back trace is stopped at DLM kernel module, usually the root cause is 
cluster configuration problem.


Thanks
Gang


>>> 
> On 3/12/2018 7:32 AM, Gang He wrote:
>> Hello Muhammad,
>>
>> I think this problem is not in ocfs2, the cause looks like the cluster 
> quorum is missed.
>> For two-node cluster (does not three-node cluster), if one node is offline, 
> the quorum will be missed by default.
>> So, you should configure two-node related quorum setting according to the 
> pacemaker manual.
>> Then, DLM can work normal, and ocfs2 resource can start up.
> Yes its configured accordingly, no-quorum is set to "ignore".
> 
> property cib-bootstrap-options: \
>   have-watchdog=true \
>   stonith-enabled=true \
>   stonith-timeout=80 \
>   startup-fencing=true \
>   no-quorum-policy=ignore
> 
>>
>> Thanks
>> Gang
>>
>>
>>> Hi,
>>>
>>> This two node cluster starts resources when both nodes are online but
>>> does not start the ocfs2 resources
>>>
>>> when one node is offline. e.g if I gracefully stop the cluster resources
>>> then stop the pacemaker service on
>>>
>>> either node, and try to start the ocfs2 resource on the online node, it
>>> fails.
>>>
>>> logs:
>>>
>>> pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
>>> pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
>>> pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
>>> pipci001 pengine[17732]:   notice: Calculated transition 2, saving
>>> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
>>> pipci001 crmd[17733]:   notice: Processing graph 2
>>> (ref=pe_calc-dc-1520613202-31) derived from
>>> /var/lib/pacemaker/pengine/pe-input-339.bz2
>>> crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on
>>> pipci001
>>> lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
>>> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
>>> lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69
>>> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
>>> crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0 (ok)
>>> crmd[17733]:   notice: Initiating monitor operation dlm_monitor_6
>>> locally on pipci001
>>> crmd[17733]:   notice: Initiating start operation p-fssapmnt_start_0
>>> locally on pipci001
>>> lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:start call_id:71
>>> Filesystem(p-fssapmnt)[19052]: INFO: Running start for
>>> /dev/mapper/sapmnt on /sapmnt
>>> kernel: [ 4576.529938] dlm: Using TCP for communications
>>> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
>>> the lockspace group.
>>> dlm_controld[19019]: 4629 fence work wait for quorum
>>> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
>>> lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052) timed out
>>> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
>>> event done -512 0
>>> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join
>>> failed -512 0
>>> lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out after 6ms
>>> lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start call_id:71
>>> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
>>> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
>>> crmd[17733]:error: Result of start operation for p-fssapmnt on
>>> pipci001: Timed Out
>>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
>>> (target: 0 vs. rc: 1): Error
>>> crmd[17733]:   notice: Transition aborted by operation
>>> p-fssapmnt_start_0 'modify' on pipci001: Event failed
>>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
>>> (target: 0 vs. rc: 1): Error
>>> crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0, Fired=0,
>>> Skipped=0, Incomplete=6,
>>> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
>>> pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
>>> required
>>> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
>>> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
>>> pipci001: unknown error (1)
>>> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
>>> pipci001: unknown error (1)
>>> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
>>> 100 failures (max=2)
>>> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
>>> 100 failures (max=2)
>>> pengine[17732]:   notice: Stopdlm:0#011(pipci001)
>>> pengine[17732]:   notice: Stop

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-11 Thread Muhammad Sharfuddin

On 3/12/2018 7:32 AM, Gang He wrote:

Hello Muhammad,

I think this problem is not in ocfs2, the cause looks like the cluster quorum 
is missed.
For two-node cluster (does not three-node cluster), if one node is offline, the 
quorum will be missed by default.
So, you should configure two-node related quorum setting according to the 
pacemaker manual.
Then, DLM can work normal, and ocfs2 resource can start up.

Yes its configured accordingly, no-quorum is set to "ignore".

property cib-bootstrap-options: \
 have-watchdog=true \
 stonith-enabled=true \
 stonith-timeout=80 \
 startup-fencing=true \
         no-quorum-policy=ignore



Thanks
Gang



Hi,

This two node cluster starts resources when both nodes are online but
does not start the ocfs2 resources

when one node is offline. e.g if I gracefully stop the cluster resources
then stop the pacemaker service on

either node, and try to start the ocfs2 resource on the online node, it
fails.

logs:

pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
pipci001 pengine[17732]:   notice: Calculated transition 2, saving
inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
pipci001 crmd[17733]:   notice: Processing graph 2
(ref=pe_calc-dc-1520613202-31) derived from
/var/lib/pacemaker/pengine/pe-input-339.bz2
crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on
pipci001
lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69
pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0 (ok)
crmd[17733]:   notice: Initiating monitor operation dlm_monitor_6
locally on pipci001
crmd[17733]:   notice: Initiating start operation p-fssapmnt_start_0
locally on pipci001
lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:start call_id:71
Filesystem(p-fssapmnt)[19052]: INFO: Running start for
/dev/mapper/sapmnt on /sapmnt
kernel: [ 4576.529938] dlm: Using TCP for communications
kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
the lockspace group.
dlm_controld[19019]: 4629 fence work wait for quorum
dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052) timed out
kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
event done -512 0
kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join
failed -512 0
lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out after 6ms
lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start call_id:71
pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
crmd[17733]:error: Result of start operation for p-fssapmnt on
pipci001: Timed Out
crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
(target: 0 vs. rc: 1): Error
crmd[17733]:   notice: Transition aborted by operation
p-fssapmnt_start_0 'modify' on pipci001: Event failed
crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
(target: 0 vs. rc: 1): Error
crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0, Fired=0,
Skipped=0, Incomplete=6,
Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
required
pengine[17732]:   notice: On loss of CCM Quorum: Ignore
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Forcing base-clone away from pipci001 after
100 failures (max=2)
pengine[17732]:  warning: Forcing base-clone away from pipci001 after
100 failures (max=2)
pengine[17732]:   notice: Stopdlm:0#011(pipci001)
pengine[17732]:   notice: Stopp-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Calculated transition 3, saving inputs in
/var/lib/pacemaker/pengine/pe-input-340.bz2
pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
required
pengine[17732]:   notice: On loss of CCM Quorum: Ignore
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Forcing base-clone away from pipci001 after
100 failures (max=2)
pipci001 pengine[17732]:  warning: Forcing base-clone away from pipci001
after 100 failures (max=2)
pengine[17732]:   notice: Stopdlm:0#011(pipci001)
pengine[17732]:   notice: Stopp-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Calculated transition 4, saving inputs in

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-11 Thread Gang He
Hello Muhammad,

I think this problem is not in ocfs2, the cause looks like the cluster quorum 
is missed.
For two-node cluster (does not three-node cluster), if one node is offline, the 
quorum will be missed by default.
So, you should configure two-node related quorum setting according to the 
pacemaker manual.
Then, DLM can work normal, and ocfs2 resource can start up.


Thanks
Gang 


>>> 
> Hi,
> 
> This two node cluster starts resources when both nodes are online but 
> does not start the ocfs2 resources
> 
> when one node is offline. e.g if I gracefully stop the cluster resources 
> then stop the pacemaker service on
> 
> either node, and try to start the ocfs2 resource on the online node, it 
> fails.
> 
> logs:
> 
> pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
> pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
> pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
> pipci001 pengine[17732]:   notice: Calculated transition 2, saving 
> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
> pipci001 crmd[17733]:   notice: Processing graph 2 
> (ref=pe_calc-dc-1520613202-31) derived from 
> /var/lib/pacemaker/pengine/pe-input-339.bz2
> crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on 
> pipci001
> lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
> lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69 
> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
> crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0 (ok)
> crmd[17733]:   notice: Initiating monitor operation dlm_monitor_6 
> locally on pipci001
> crmd[17733]:   notice: Initiating start operation p-fssapmnt_start_0 
> locally on pipci001
> lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:start call_id:71
> Filesystem(p-fssapmnt)[19052]: INFO: Running start for 
> /dev/mapper/sapmnt on /sapmnt
> kernel: [ 4576.529938] dlm: Using TCP for communications
> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining 
> the lockspace group.
> dlm_controld[19019]: 4629 fence work wait for quorum
> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
> lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052) timed out
> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group 
> event done -512 0
> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join 
> failed -512 0
> lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out after 6ms
> lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start call_id:71 
> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
> crmd[17733]:error: Result of start operation for p-fssapmnt on 
> pipci001: Timed Out
> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed 
> (target: 0 vs. rc: 1): Error
> crmd[17733]:   notice: Transition aborted by operation 
> p-fssapmnt_start_0 'modify' on pipci001: Event failed
> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed 
> (target: 0 vs. rc: 1): Error
> crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0, Fired=0, 
> Skipped=0, Incomplete=6, 
> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
> pengine[17732]:   notice: Watchdog will be used via SBD if fencing is 
> required
> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on 
> pipci001: unknown error (1)
> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on 
> pipci001: unknown error (1)
> pengine[17732]:  warning: Forcing base-clone away from pipci001 after 
> 100 failures (max=2)
> pengine[17732]:  warning: Forcing base-clone away from pipci001 after 
> 100 failures (max=2)
> pengine[17732]:   notice: Stopdlm:0#011(pipci001)
> pengine[17732]:   notice: Stopp-fssapmnt:0#011(pipci001)
> pengine[17732]:   notice: Calculated transition 3, saving inputs in 
> /var/lib/pacemaker/pengine/pe-input-340.bz2
> pengine[17732]:   notice: Watchdog will be used via SBD if fencing is 
> required
> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on 
> pipci001: unknown error (1)
> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on 
> pipci001: unknown error (1)
> pengine[17732]:  warning: Forcing base-clone away from pipci001 after 
> 100 failures (max=2)
> pipci001 pengine[17732]:  warning: Forcing base-clone away from pipci001 
> after 100 failures (max=2)
> pengine[17732]:   notice: Stopdlm:0#011(pipci001)
> pengine[17732]:   notice: Stopp-fssapmnt:0#011(pipci001)
> pengine[17732]:   notice: Calculated transition 4, saving inputs in 
> /var/lib/pacemaker/pengine/pe-input-341.bz2
> crmd[17733]:   notice: Processing 

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-09 Thread Muhammad Sharfuddin

On 3/10/2018 10:00 AM, Andrei Borzenkov wrote:

09.03.2018 19:55, Muhammad Sharfuddin пишет:

Hi,

This two node cluster starts resources when both nodes are online but
does not start the ocfs2 resources

when one node is offline. e.g if I gracefully stop the cluster resources
then stop the pacemaker service on

either node, and try to start the ocfs2 resource on the online node, it
fails.

logs:

pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
pipci001 pengine[17732]:   notice: Calculated transition 2, saving
inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
pipci001 crmd[17733]:   notice: Processing graph 2
(ref=pe_calc-dc-1520613202-31) derived from
/var/lib/pacemaker/pengine/pe-input-339.bz2
crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on
pipci001
lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69
pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0
(ok)
crmd[17733]:   notice: Initiating monitor operation dlm_monitor_6
locally on pipci001
crmd[17733]:   notice: Initiating start operation p-fssapmnt_start_0
locally on pipci001
lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:start call_id:71
Filesystem(p-fssapmnt)[19052]: INFO: Running start for
/dev/mapper/sapmnt on /sapmnt
kernel: [ 4576.529938] dlm: Using TCP for communications
kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
the lockspace group.
dlm_controld[19019]: 4629 fence work wait for quorum
dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052) timed out

That sounds like the problem. It attempts to fence the other node, but
you do not have any fencing resources configured so it cannot work. You
need to ensure you have working fencing agent in your configuration.
sbd is being perfectly used in this cluster and after multiple failed 
attempts to start the ocfs2

resource, this standalone online node gets fenced too

logs:
pengine[17732]:  warning: Scheduling Node pipci001 for STONITH
pengine[17732]:   notice: Stop of failed resource dlm:0 is implicit 
after pipci001 is fenced

pengine[17732]:   notice:  * Fence pipci001
pengine[17732]:   notice: Stop    sbd-stonith#011(pipci001)
pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
pengine[17732]:  warning: Calculated transition 6 (with warnings), 
saving inputs in /var/lib/pacemaker/pengine/pe-warn-15.bz2
2018-03-09T21:03:30.588865+05:00 pipci002 crmd[13030]:   notice: 
Processing graph 6 (ref=pe_calc-dc-1520611410-34) derived from 
/var/lib/pacemaker/pengine/pe-warn-15.bz2

crmd[17733]:   notice: Requesting fencing (reboot) of node pipci001
stonith-ng[13026]:   notice: Client crmd.13030.f5570444 wants to fence 
(reboot) 'pipci001' with device '(any)'

stonith-ng[13026]:   notice: Requesting peer fencing (reboot) of pipci001
stonith-ng[13026]:   notice: sbd-stonith can fence (rebo

Also as informed this cluster starts resources when both nodes are 
online and stonith is enabled

and works too.

cluster properties:
property cib-bootstrap-options: \
    have-watchdog=true \
    stonith-enabled=true \
    stonith-timeout=80 \
    startup-fencing=true \



kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
event done -512 0
kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join
failed -512 0
lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out after 6ms
lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start call_id:71
pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
crmd[17733]:    error: Result of start operation for p-fssapmnt on
pipci001: Timed Out
crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
(target: 0 vs. rc: 1): Error
crmd[17733]:   notice: Transition aborted by operation
p-fssapmnt_start_0 'modify' on pipci001: Event failed
crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
(target: 0 vs. rc: 1): Error
crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0, Fired=0,
Skipped=0, Incomplete=6,
Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
required
pengine[17732]:   notice: On loss of CCM Quorum: Ignore
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Forcing base-clone away from pipci001 after
100 failures (max=2)
pengine[17732]:  warning: Forcing 

Re: [ClusterLabs] single node fails to start the ocfs2 resource

2018-03-09 Thread Andrei Borzenkov
09.03.2018 19:55, Muhammad Sharfuddin пишет:
> Hi,
> 
> This two node cluster starts resources when both nodes are online but
> does not start the ocfs2 resources
> 
> when one node is offline. e.g if I gracefully stop the cluster resources
> then stop the pacemaker service on
> 
> either node, and try to start the ocfs2 resource on the online node, it
> fails.
> 
> logs:
> 
> pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
> pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
> pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
> pipci001 pengine[17732]:   notice: Calculated transition 2, saving
> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
> pipci001 crmd[17733]:   notice: Processing graph 2
> (ref=pe_calc-dc-1520613202-31) derived from
> /var/lib/pacemaker/pengine/pe-input-339.bz2
> crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on
> pipci001
> lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
> lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69
> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
> crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0
> (ok)
> crmd[17733]:   notice: Initiating monitor operation dlm_monitor_6
> locally on pipci001
> crmd[17733]:   notice: Initiating start operation p-fssapmnt_start_0
> locally on pipci001
> lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:start call_id:71
> Filesystem(p-fssapmnt)[19052]: INFO: Running start for
> /dev/mapper/sapmnt on /sapmnt
> kernel: [ 4576.529938] dlm: Using TCP for communications
> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
> the lockspace group.
> dlm_controld[19019]: 4629 fence work wait for quorum
> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
> lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052) timed out

That sounds like the problem. It attempts to fence the other node, but
you do not have any fencing resources configured so it cannot work. You
need to ensure you have working fencing agent in your configuration.

> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
> event done -512 0
> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join
> failed -512 0
> lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out after 6ms
> lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start call_id:71
> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
> crmd[17733]:    error: Result of start operation for p-fssapmnt on
> pipci001: Timed Out
> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
> (target: 0 vs. rc: 1): Error
> crmd[17733]:   notice: Transition aborted by operation
> p-fssapmnt_start_0 'modify' on pipci001: Event failed
> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
> (target: 0 vs. rc: 1): Error
> crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0, Fired=0,
> Skipped=0, Incomplete=6,
> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
> pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
> required
> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
> pipci001: unknown error (1)
> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
> pipci001: unknown error (1)
> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
> 100 failures (max=2)
> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
> 100 failures (max=2)
> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
> pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
> pengine[17732]:   notice: Calculated transition 3, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-340.bz2
> pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
> required
> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
> pipci001: unknown error (1)
> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
> pipci001: unknown error (1)
> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
> 100 failures (max=2)
> pipci001 pengine[17732]:  warning: Forcing base-clone away from pipci001
> after 100 failures (max=2)
> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
> pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
> pengine[17732]:   notice: Calculated transition 4, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-341.bz2
> crmd[17733]:   notice: Processing graph 4 (ref=pe_calc-dc-1520613263-36)
> derived from /var/lib/pacemaker/pengine/pe-input-341.bz2
> crmd[17733]:   notice: Initiating stop operation p-fssapmnt_stop_0
>