On 03/13/2018 02:03 PM, Muhammad Sharfuddin wrote: > Hi, > > 1 - if I put a node(node2) offline; ocfs2 resources keep running on > online node(node1) > > 2 - while node2 was offline, via cluster I stop/start the ocfs2 > resource group successfully so many times in a row. > > 3 - while node2 was offline; I restart the pacemaker service on the > node1 and then tries to start the ocfs2 resource group, dlm started > but ocfs2 file system resource does not start. > > Nutshell: > > a - both nodes must be online to start the ocfs2 resource. > > b - if one crashes or offline(gracefully) ocfs2 resource keeps running > on the other/surviving node. > > c - while one node was offline, we can stop/start the ocfs2 resource > group on the surviving node but if we stops the pacemaker service, > then ocfs2 file system resource does not start with the following info > in the logs:
From the logs I would say startup of dlm_controld times out because it is waiting for quorum - which doesn't happen because of wait-for-all. Question is if you really just stopped pacemaker or if you stopped corosync as well. In the latter case I would say it is the expected behavior. Regards, Klaus > > lrmd[4317]: notice: executing - rsc:p-fssapmnt action:start call_id:53 > Filesystem(p-fssapmnt)[5139]: INFO: Running start for > /dev/mapper/sapmnt on /sapmnt > kernel: [ 706.162676] dlm: Using TCP for communications > kernel: [ 706.162916] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining > the lockspace group... > dlm_controld[5105]: 759 fence work wait for quorum > dlm_controld[5105]: 764 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum > lrmd[4317]: warning: p-fssapmnt_start_0 process (PID 5139) timed out > lrmd[4317]: warning: p-fssapmnt_start_0:5139 - timed out after 60000ms > lrmd[4317]: notice: finished - rsc:p-fssapmnt action:start > call_id:53 pid:5139 exit-code:1 exec-time:60002ms queue-time:0ms > kernel: [ 766.056514] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group > event done -512 0 > kernel: [ 766.056528] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group > join failed -512 0 > crmd[4320]: notice: Result of stop operation for p-fssapmnt on > pipci001: 0 (ok) > crmd[4320]: notice: Initiating stop operation dlm_stop_0 locally on > pipci001 > lrmd[4317]: notice: executing - rsc:dlm action:stop call_id:56 > dlm_controld[5105]: 766 shutdown ignored, active lockspaces > lrmd[4317]: warning: dlm_stop_0 process (PID 5326) timed out > lrmd[4317]: warning: dlm_stop_0:5326 - timed out after 100000ms > lrmd[4317]: notice: finished - rsc:dlm action:stop call_id:56 > pid:5326 exit-code:1 exec-time:100003ms queue-time:0ms > crmd[4320]: error: Result of stop operation for dlm on pipci001: > Timed Out > crmd[4320]: warning: Action 15 (dlm_stop_0) on pipci001 failed > (target: 0 vs. rc: 1): Error > crmd[4320]: notice: Transition aborted by operation dlm_stop_0 > 'modify' on pipci001: Event failed > crmd[4320]: warning: Action 15 (dlm_stop_0) on pipci001 failed > (target: 0 vs. rc: 1): Error > pengine[4319]: notice: Watchdog will be used via SBD if fencing is > required > pengine[4319]: notice: On loss of CCM Quorum: Ignore > pengine[4319]: warning: Processing failed op stop for dlm:0 on > pipci001: unknown error (1) > pengine[4319]: warning: Processing failed op stop for dlm:0 on > pipci001: unknown error (1) > pengine[4319]: warning: Cluster node pipci001 will be fenced: dlm:0 > failed there > pengine[4319]: warning: Processing failed op start for p-fssapmnt:0 > on pipci001: unknown error (1) > pengine[4319]: notice: Stop of failed resource dlm:0 is implicit > after pipci001 is fenced > pengine[4319]: notice: * Fence pipci001 > pengine[4319]: notice: Stop sbd-stonith#011(pipci001) > pengine[4319]: notice: Stop dlm:0#011(pipci001) > crmd[4320]: notice: Requesting fencing (reboot) of node pipci001 > stonith-ng[4316]: notice: Client crmd.4320.4c2f757b wants to fence > (reboot) 'pipci001' with device '(any)' > stonith-ng[4316]: notice: Requesting peer fencing (reboot) of pipci001 > stonith-ng[4316]: notice: sbd-stonith can fence (reboot) pipci001: > dynamic-list > > > -- > Regards, > Muhammad Sharfuddin | +923332144823 | nds.com.pk > > On 3/13/2018 1:04 PM, Ulrich Windl wrote: >> Hi! >> >> I'd recommend this: >> Cleanly boot your nodes, avoiding any manual operation with cluster >> resources. Keep the logs. >> Then start your tests, keeping the logs for each. >> Try to fix issues by reading the logs and adjusting the cluster >> configuration, and not by starting commands that the cluster should >> start. >> >> We had an 2-node OCFS2 cluster running for quite some time with >> SLES11, but now the cluster is three nodes. To me the output of >> "crm_mon -1Arfj" combined with having set record-pending=true was >> very valuable finding problems. >> >> Regards, >> Ulrich >> >> >>>>> Muhammad Sharfuddin <m.sharfud...@nds.com.pk> schrieb am >>>>> 13.03.2018 um 08:43 in >> Nachricht <7b773ae9-4209-d246-b5c0-2c8b67e62...@nds.com.pk>: >>> Dear Klaus, >>> >>> If I understand you properly then, its a fencing issue, and whatever I >>> am facing is "natural" or "by-design" in a two node cluster where >>> quorum >>> is incomplete. >>> >>> I am quite convinced that you have pointed out right because, when I >>> start the dlm resource via cluster and then tries to start the ocfs2 >>> file system manually from command line, mount command remains hanged >>> and >>> following events are reported in the logs: >>> >>> kernel: [62622.864828] ocfs2: Registered cluster interface user >>> kernel: [62622.884427] dlm: Using TCP for communications >>> kernel: [62622.884750] dlm: BFA9FF042AA045F4822C2A6A06020EE9: >>> joining the lockspace group... >>> dlm_controld[17655]: 62627 fence work wait for quorum >>> dlm_controld[17655]: 62680 BFA9FF042AA045F4822C2A6A06020EE9 wait >>> for quorum >>> >>> and then following messages keep reported every 5-10 minutes, till I >>> kill the mount.ocfs2 process: >>> >>> dlm_controld[17655]: 62627 fence work wait for quorum >>> dlm_controld[17655]: 62680 BFA9FF042AA045F4822C2A6A06020EE9 wait >>> for quorum >>> >>> I am also very much confused, because yesterday I did the same and was >>> able to mount the ocfs2 file system manually from command line(at least >>> once), and then unmount the file system manually stop the dlm resource >>> from cluster and then complete ocfs2 resource stack(dlm, file systems) >>> start/stop successfully via cluster even when only machine was online. >>> >>> In a two-node cluster, which have ocfs2 resources, we can't run the >>> ocfs2 resources when quorum is incomplete(one node is offline) ? >>> >>> -- >>> Regards, >>> Muhammad Sharfuddin >>> >>> On 3/12/2018 5:58 PM, Klaus Wenninger wrote: >>>> On 03/12/2018 01:44 PM, Muhammad Sharfuddin wrote: >>>>> Hi Klaus, >>>>> >>>>> primitive sbd-stonith stonith:external/sbd \ >>>>> op monitor interval=3000 timeout=20 \ >>>>> op start interval=0 timeout=240 \ >>>>> op stop interval=0 timeout=100 \ >>>>> params sbd_device="/dev/mapper/sbd" \ >>>>> meta target-role=Started >>>> Makes more sense now. >>>> Using pcmk_delay_max would probably be useful here >>>> to prevent a fence-race. >>>> That stonith-resource was not in your resource-list below ... >>>> >>>>> property cib-bootstrap-options: \ >>>>> have-watchdog=true \ >>>>> stonith-enabled=true \ >>>>> no-quorum-policy=ignore \ >>>>> stonith-timeout=90 \ >>>>> startup-fencing=true >>>> You've set no-quorum-policy=ignore for pacemaker. >>>> Whether this is a good idea or not in your setup is >>>> written on another page. >>>> But isn't dlm directly interfering with corosync so >>>> that it would get the quorum state from there? >>>> As you have 2-node set probably on a 2-node-cluster >>>> this would - after both nodes down - wait for all >>>> nodes up first. >>>> >>>> Regards, >>>> Klaus >>>> >>>>> # ps -eaf |grep sbd >>>>> root 6129 1 0 17:35 ? 00:00:00 sbd: inquisitor >>>>> root 6133 6129 0 17:35 ? 00:00:00 sbd: watcher: >>>>> /dev/mapper/sbd - slot: 1 - uuid: >>>>> 6e80a337-95db-4608-bd62-d59517f39103 >>>>> root 6134 6129 0 17:35 ? 00:00:00 sbd: watcher: >>>>> Pacemaker >>>>> root 6135 6129 0 17:35 ? 00:00:00 sbd: watcher: Cluster >>>>> >>>>> This cluster does not start ocfs2 resources when I first >>>>> intentionally >>>>> crashed(reboot) both the nodes, then try to start ocfs2 resource >>>>> while >>>>> one node is offline. >>>>> >>>>> To fix the issue, I have one permanent solution, bring the other >>>>> node(offline) online and things get fixed automatically, i.e ocfs2 >>>>> resources mounts. >>>>> >>>>> -- >>>>> Regards, >>>>> Muhammad Sharfuddin >>>>> >>>>> On 3/12/2018 5:25 PM, Klaus Wenninger wrote: >>>>>> Hi Muhammad! >>>>>> >>>>>> Could you be a little bit more elaborate on your fencing-setup! >>>>>> I read about you using SBD but I don't see any sbd-fencing-resource. >>>>>> For the case you wanted to use watchdog-fencing with SBD this >>>>>> would require stonith-watchdog-timeout property to be set. >>>>>> But watchdog-fencing relies on quorum (without 2-node trickery) >>>>>> and thus wouldn't work on a 2-node-cluster anyway. >>>>>> >>>>>> Didn't read through the whole thread - so I might be missing >>>>>> something ... >>>>>> >>>>>> Regards, >>>>>> Klaus >>>>>> >>>>>> On 03/12/2018 12:51 PM, Muhammad Sharfuddin wrote: >>>>>>> Hello Gang, >>>>>>> >>>>>>> as informed, previously cluster was fixed to start the ocfs2 >>>>>>> resources by >>>>>>> >>>>>>> a) crm resource start dlm >>>>>>> >>>>>>> b) mount/umount the ocfs2 file system manually. (this step was the >>>>>>> fix) >>>>>>> >>>>>>> and then starting the clone group(which include dlm, ocfs2 file >>>>>>> systems) worked fine: >>>>>>> >>>>>>> c) crm resource start base-clone. >>>>>>> >>>>>>> Now I crash the nodes intentionally and then keep only one node >>>>>>> online, again cluster stopped starting the ocfs2 resources. I again >>>>>>> tried to follow your instructions i.e >>>>>>> >>>>>>> i) crm resource start dlm >>>>>>> >>>>>>> then try to mount the ocfs2 file system manually which got >>>>>>> hanged this >>>>>>> time(previously manually mounting helped me): >>>>>>> >>>>>>> # cat /proc/3966/stack >>>>>>> [<ffffffffa039f18e>] do_uevent+0x7e/0x200 [dlm] >>>>>>> [<ffffffffa039fe0a>] new_lockspace+0x80a/0xa70 [dlm] >>>>>>> [<ffffffffa03a02d9>] dlm_new_lockspace+0x69/0x160 [dlm] >>>>>>> [<ffffffffa038e758>] user_cluster_connect+0xc8/0x350 >>>>>>> [ocfs2_stack_user] >>>>>>> [<ffffffffa03c2872>] ocfs2_cluster_connect+0x192/0x240 >>>>>>> [ocfs2_stackglue] >>>>>>> [<ffffffffa045eefc>] ocfs2_dlm_init+0x31c/0x570 [ocfs2] >>>>>>> [<ffffffffa04a9983>] ocfs2_fill_super+0xb33/0x1200 [ocfs2] >>>>>>> [<ffffffff8120e130>] mount_bdev+0x1a0/0x1e0 >>>>>>> [<ffffffff8120ea1a>] mount_fs+0x3a/0x170 >>>>>>> [<ffffffff81228bf2>] vfs_kern_mount+0x62/0x110 >>>>>>> [<ffffffff8122b123>] do_mount+0x213/0xcd0 >>>>>>> [<ffffffff8122bed5>] SyS_mount+0x85/0xd0 >>>>>>> [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6 >>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>>>> >>>>>>> I killed the mount.ocfs2 process stop(crm resource stop dlm) the >>>>>>> dlm >>>>>>> process, and then try to start(crm resource start dlm) the >>>>>>> dlm(which >>>>>>> previously always get started successfully), this time dlm didn't >>>>>>> start and I checked the dlm_controld process >>>>>>> >>>>>>> cat /proc/3754/stack >>>>>>> [<ffffffff8121dc55>] poll_schedule_timeout+0x45/0x60 >>>>>>> [<ffffffff8121f0bc>] do_sys_poll+0x38c/0x4f0 >>>>>>> [<ffffffff8121f2dd>] SyS_poll+0x5d/0xe0 >>>>>>> [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6 >>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>>>> >>>>>>> Nutshell: >>>>>>> >>>>>>> 1 - this cluster is configured to run when single node is online >>>>>>> >>>>>>> 2 - this cluster does not start the ocfs2 resources after a >>>>>>> crash when >>>>>>> only one node is online. >>>>>>> >>>>>>> -- >>>>>>> Regards, >>>>>>> Muhammad Sharfuddin | +923332144823 | nds.com.pk >>>>>>> >>>>>>> On 3/12/2018 12:41 PM, Gang He wrote: >>>>>>>>> Hello Gang, >>>>>>>>> >>>>>>>>> to follow your instructions, I started the dlm resource via: >>>>>>>>> >>>>>>>>> crm resource start dlm >>>>>>>>> >>>>>>>>> then mount/unmount the ocfs2 file system manually..(which >>>>>>>>> seems to be >>>>>>>>> the fix of the situation). >>>>>>>>> >>>>>>>>> Now resources are getting started properly on a single node.. >>>>>>>>> I am >>>>>>>>> happy >>>>>>>>> as the issue is fixed, but at the same time I am lost because >>>>>>>>> I have >>>>>>>>> no idea >>>>>>>>> >>>>>>>>> how things get fixed here(merely by mounting/unmounting the ocfs2 >>>>>>>>> file >>>>>>>>> systems) >>>>>>>> >From your description. >>>>>>>> I just wonder the DLM resource does not work normally under that >>>>>>>> situation. >>>>>>>> Yan/Bin, do you have any comments about two-node cluster? which >>>>>>>> configuration settings will affect corosync quorum/DLM ? >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> Gang >>>>>>>> >>>>>>>> >>>>>>>>> -- >>>>>>>>> Regards, >>>>>>>>> Muhammad Sharfuddin >>>>>>>>> >>>>>>>>> On 3/12/2018 10:59 AM, Gang He wrote: >>>>>>>>>> Hello Muhammad, >>>>>>>>>> >>>>>>>>>> Usually, ocfs2 resource startup failure is caused by mount >>>>>>>>>> command >>>>>>>>>> timeout >>>>>>>>> (or hanged). >>>>>>>>>> The sample debugging method is, >>>>>>>>>> remove ocfs2 resource from crm first, >>>>>>>>>> then mount this file system manually, see if the mount command >>>>>>>>>> will be >>>>>>>>> timeout or hanged. >>>>>>>>>> If this command is hanged, please watch where is mount.ocfs2 >>>>>>>>>> process hanged >>>>>>>>> via "cat /proc/xxx/stack" command. >>>>>>>>>> If the back trace is stopped at DLM kernel module, usually >>>>>>>>>> the root >>>>>>>>>> cause is >>>>>>>>> cluster configuration problem. >>>>>>>>>> Thanks >>>>>>>>>> Gang >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On 3/12/2018 7:32 AM, Gang He wrote: >>>>>>>>>>>> Hello Muhammad, >>>>>>>>>>>> >>>>>>>>>>>> I think this problem is not in ocfs2, the cause looks like the >>>>>>>>>>>> cluster >>>>>>>>>>> quorum is missed. >>>>>>>>>>>> For two-node cluster (does not three-node cluster), if one >>>>>>>>>>>> node >>>>>>>>>>>> is offline, >>>>>>>>>>> the quorum will be missed by default. >>>>>>>>>>>> So, you should configure two-node related quorum setting >>>>>>>>>>>> according to the >>>>>>>>>>> pacemaker manual. >>>>>>>>>>>> Then, DLM can work normal, and ocfs2 resource can start up. >>>>>>>>>>> Yes its configured accordingly, no-quorum is set to "ignore". >>>>>>>>>>> >>>>>>>>>>> property cib-bootstrap-options: \ >>>>>>>>>>> have-watchdog=true \ >>>>>>>>>>> stonith-enabled=true \ >>>>>>>>>>> stonith-timeout=80 \ >>>>>>>>>>> startup-fencing=true \ >>>>>>>>>>> no-quorum-policy=ignore >>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Gang >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> This two node cluster starts resources when both nodes are >>>>>>>>>>>>> online but >>>>>>>>>>>>> does not start the ocfs2 resources >>>>>>>>>>>>> >>>>>>>>>>>>> when one node is offline. e.g if I gracefully stop the >>>>>>>>>>>>> cluster >>>>>>>>>>>>> resources >>>>>>>>>>>>> then stop the pacemaker service on >>>>>>>>>>>>> >>>>>>>>>>>>> either node, and try to start the ocfs2 resource on the >>>>>>>>>>>>> online >>>>>>>>>>>>> node, it >>>>>>>>>>>>> fails. >>>>>>>>>>>>> >>>>>>>>>>>>> logs: >>>>>>>>>>>>> >>>>>>>>>>>>> pipci001 pengine[17732]: notice: Start >>>>>>>>>>>>> dlm:0#011(pipci001) >>>>>>>>>>>>> pengine[17732]: notice: Start p-fssapmnt:0#011(pipci001) >>>>>>>>>>>>> pengine[17732]: notice: Start p-fsusrsap:0#011(pipci001) >>>>>>>>>>>>> pipci001 pengine[17732]: notice: Calculated transition 2, >>>>>>>>>>>>> saving >>>>>>>>>>>>> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2 >>>>>>>>>>>>> pipci001 crmd[17733]: notice: Processing graph 2 >>>>>>>>>>>>> (ref=pe_calc-dc-1520613202-31) derived from >>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-339.bz2 >>>>>>>>>>>>> crmd[17733]: notice: Initiating start operation dlm_start_0 >>>>>>>>>>>>> locally on >>>>>>>>>>>>> pipci001 >>>>>>>>>>>>> lrmd[17730]: notice: executing - rsc:dlm action:start >>>>>>>>>>>>> call_id:69 >>>>>>>>>>>>> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started >>>>>>>>>>>>> lrmd[17730]: notice: finished - rsc:dlm action:start >>>>>>>>>>>>> call_id:69 >>>>>>>>>>>>> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms >>>>>>>>>>>>> crmd[17733]: notice: Result of start operation for dlm on >>>>>>>>>>>>> pipci001: 0 (ok) >>>>>>>>>>>>> crmd[17733]: notice: Initiating monitor operation >>>>>>>>>>>>> dlm_monitor_60000 >>>>>>>>>>>>> locally on pipci001 >>>>>>>>>>>>> crmd[17733]: notice: Initiating start operation >>>>>>>>>>>>> p-fssapmnt_start_0 >>>>>>>>>>>>> locally on pipci001 >>>>>>>>>>>>> lrmd[17730]: notice: executing - rsc:p-fssapmnt >>>>>>>>>>>>> action:start >>>>>>>>>>>>> call_id:71 >>>>>>>>>>>>> Filesystem(p-fssapmnt)[19052]: INFO: Running start for >>>>>>>>>>>>> /dev/mapper/sapmnt on /sapmnt >>>>>>>>>>>>> kernel: [ 4576.529938] dlm: Using TCP for communications >>>>>>>>>>>>> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: >>>>>>>>>>>>> joining >>>>>>>>>>>>> the lockspace group. >>>>>>>>>>>>> dlm_controld[19019]: 4629 fence work wait for quorum >>>>>>>>>>>>> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 >>>>>>>>>>>>> wait >>>>>>>>>>>>> for quorum >>>>>>>>>>>>> lrmd[17730]: warning: p-fssapmnt_start_0 process (PID 19052) >>>>>>>>>>>>> timed out >>>>>>>>>>>>> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: >>>>>>>>>>>>> group >>>>>>>>>>>>> event done -512 0 >>>>>>>>>>>>> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: >>>>>>>>>>>>> group join >>>>>>>>>>>>> failed -512 0 >>>>>>>>>>>>> lrmd[17730]: warning: p-fssapmnt_start_0:19052 - timed out >>>>>>>>>>>>> after 60000ms >>>>>>>>>>>>> lrmd[17730]: notice: finished - rsc:p-fssapmnt action:start >>>>>>>>>>>>> call_id:71 >>>>>>>>>>>>> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms >>>>>>>>>>>>> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on >>>>>>>>>>>>> (node 0) >>>>>>>>>>>>> crmd[17733]: error: Result of start operation for >>>>>>>>>>>>> p-fssapmnt on >>>>>>>>>>>>> pipci001: Timed Out >>>>>>>>>>>>> crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on >>>>>>>>>>>>> pipci001 failed >>>>>>>>>>>>> (target: 0 vs. rc: 1): Error >>>>>>>>>>>>> crmd[17733]: notice: Transition aborted by operation >>>>>>>>>>>>> p-fssapmnt_start_0 'modify' on pipci001: Event failed >>>>>>>>>>>>> crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on >>>>>>>>>>>>> pipci001 failed >>>>>>>>>>>>> (target: 0 vs. rc: 1): Error >>>>>>>>>>>>> crmd[17733]: notice: Transition 2 (Complete=5, Pending=0, >>>>>>>>>>>>> Fired=0, >>>>>>>>>>>>> Skipped=0, Incomplete=6, >>>>>>>>>>>>> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete >>>>>>>>>>>>> pengine[17732]: notice: Watchdog will be used via SBD if >>>>>>>>>>>>> fencing is >>>>>>>>>>>>> required >>>>>>>>>>>>> pengine[17732]: notice: On loss of CCM Quorum: Ignore >>>>>>>>>>>>> pengine[17732]: warning: Processing failed op start for >>>>>>>>>>>>> p-fssapmnt:0 on >>>>>>>>>>>>> pipci001: unknown error (1) >>>>>>>>>>>>> pengine[17732]: warning: Processing failed op start for >>>>>>>>>>>>> p-fssapmnt:0 on >>>>>>>>>>>>> pipci001: unknown error (1) >>>>>>>>>>>>> pengine[17732]: warning: Forcing base-clone away from >>>>>>>>>>>>> pipci001 >>>>>>>>>>>>> after >>>>>>>>>>>>> 1000000 failures (max=2) >>>>>>>>>>>>> pengine[17732]: warning: Forcing base-clone away from >>>>>>>>>>>>> pipci001 >>>>>>>>>>>>> after >>>>>>>>>>>>> 1000000 failures (max=2) >>>>>>>>>>>>> pengine[17732]: notice: Stop dlm:0#011(pipci001) >>>>>>>>>>>>> pengine[17732]: notice: Stop p-fssapmnt:0#011(pipci001) >>>>>>>>>>>>> pengine[17732]: notice: Calculated transition 3, saving >>>>>>>>>>>>> inputs in >>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-340.bz2 >>>>>>>>>>>>> pengine[17732]: notice: Watchdog will be used via SBD if >>>>>>>>>>>>> fencing is >>>>>>>>>>>>> required >>>>>>>>>>>>> pengine[17732]: notice: On loss of CCM Quorum: Ignore >>>>>>>>>>>>> pengine[17732]: warning: Processing failed op start for >>>>>>>>>>>>> p-fssapmnt:0 on >>>>>>>>>>>>> pipci001: unknown error (1) >>>>>>>>>>>>> pengine[17732]: warning: Processing failed op start for >>>>>>>>>>>>> p-fssapmnt:0 on >>>>>>>>>>>>> pipci001: unknown error (1) >>>>>>>>>>>>> pengine[17732]: warning: Forcing base-clone away from >>>>>>>>>>>>> pipci001 >>>>>>>>>>>>> after >>>>>>>>>>>>> 1000000 failures (max=2) >>>>>>>>>>>>> pipci001 pengine[17732]: warning: Forcing base-clone away >>>>>>>>>>>>> from >>>>>>>>>>>>> pipci001 >>>>>>>>>>>>> after 1000000 failures (max=2) >>>>>>>>>>>>> pengine[17732]: notice: Stop dlm:0#011(pipci001) >>>>>>>>>>>>> pengine[17732]: notice: Stop p-fssapmnt:0#011(pipci001) >>>>>>>>>>>>> pengine[17732]: notice: Calculated transition 4, saving >>>>>>>>>>>>> inputs in >>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-341.bz2 >>>>>>>>>>>>> crmd[17733]: notice: Processing graph 4 >>>>>>>>>>>>> (ref=pe_calc-dc-1520613263-36) >>>>>>>>>>>>> derived from /var/lib/pacemaker/pengine/pe-input-341.bz2 >>>>>>>>>>>>> crmd[17733]: notice: Initiating stop operation >>>>>>>>>>>>> p-fssapmnt_stop_0 >>>>>>>>>>>>> locally on pipci001 >>>>>>>>>>>>> lrmd[17730]: notice: executing - rsc:p-fssapmnt action:stop >>>>>>>>>>>>> call_id:72 >>>>>>>>>>>>> Filesystem(p-fssapmnt)[19189]: INFO: Running stop for >>>>>>>>>>>>> /dev/mapper/sapmnt >>>>>>>>>>>>> on /sapmnt >>>>>>>>>>>>> pipci001 lrmd[17730]: notice: finished - rsc:p-fssapmnt >>>>>>>>>>>>> action:stop >>>>>>>>>>>>> call_id:72 pid:19189 exit-code:0 exec-time:83ms >>>>>>>>>>>>> queue-time:0ms >>>>>>>>>>>>> pipci001 crmd[17733]: notice: Result of stop operation for >>>>>>>>>>>>> p-fssapmnt >>>>>>>>>>>>> on pipci001: 0 (ok) >>>>>>>>>>>>> crmd[17733]: notice: Initiating stop operation dlm_stop_0 >>>>>>>>>>>>> locally on >>>>>>>>>>>>> pipci001 >>>>>>>>>>>>> pipci001 lrmd[17730]: notice: executing - rsc:dlm >>>>>>>>>>>>> action:stop >>>>>>>>>>>>> call_id:74 >>>>>>>>>>>>> pipci001 dlm_controld[19019]: 4636 shutdown ignored, active >>>>>>>>>>>>> lockspaces >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> resource configuration: >>>>>>>>>>>>> >>>>>>>>>>>>> primitive p-fssapmnt Filesystem \ >>>>>>>>>>>>> params device="/dev/mapper/sapmnt" >>>>>>>>>>>>> directory="/sapmnt" >>>>>>>>>>>>> fstype=ocfs2 \ >>>>>>>>>>>>> op monitor interval=20 timeout=40 \ >>>>>>>>>>>>> op start timeout=60 interval=0 \ >>>>>>>>>>>>> op stop timeout=60 interval=0 >>>>>>>>>>>>> primitive dlm ocf:pacemaker:controld \ >>>>>>>>>>>>> op monitor interval=60 timeout=60 \ >>>>>>>>>>>>> op start interval=0 timeout=90 \ >>>>>>>>>>>>> op stop interval=0 timeout=100 >>>>>>>>>>>>> clone base-clone base-group \ >>>>>>>>>>>>> meta interleave=true target-role=Started >>>>>>>>>>>>> >>>>>>>>>>>>> cluster properties: >>>>>>>>>>>>> property cib-bootstrap-options: \ >>>>>>>>>>>>> have-watchdog=true \ >>>>>>>>>>>>> stonith-enabled=true \ >>>>>>>>>>>>> stonith-timeout=80 \ >>>>>>>>>>>>> startup-fencing=true \ >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Software versions: >>>>>>>>>>>>> >>>>>>>>>>>>> kernel version: 4.4.114-94.11-default >>>>>>>>>>>>> pacemaker-1.1.16-4.8.x86_64 >>>>>>>>>>>>> corosync-2.3.6-9.5.1.x86_64 >>>>>>>>>>>>> ocfs2-kmp-default-4.4.114-94.11.3.x86_64 >>>>>>>>>>>>> ocfs2-tools-1.8.5-1.35.x86_64 >>>>>>>>>>>>> dlm-kmp-default-4.4.114-94.11.3.x86_64 >>>>>>>>>>>>> libdlm3-4.0.7-1.28.x86_64 >>>>>>>>>>>>> libdlm-4.0.7-1.28.x86_64 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Muhammad Sharfuddin >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> --- >>>>>>>>>>>>> This email has been checked for viruses by Avast antivirus >>>>>>>>>>>>> software. >>>>>>>>>>>>> https://www.avast.com/antivirus >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>>>>>> >>>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>>> Getting started: >>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>>>>> >>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>> Getting started: >>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Regards, >>>>>>>>>>> Muhammad Sharfuddin >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>>>> >>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>> Getting started: >>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>> _______________________________________________ >>>>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>>> >>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>> Getting started: >>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>> >>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>> Getting started: >>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>> _______________________________________________ >>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>> >>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>> Getting started: >>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>> --- >>>>> This email has been checked for viruses by Avast antivirus software. >>>>> https://www.avast.com/antivirus >>>>> >>>>> _______________________________________________ >>>>> Users mailing list: Users@clusterlabs.org >>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>> _______________________________________________ >>> Users mailing list: Users@clusterlabs.org >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org