On 03/13/2018 03:43 PM, Muhammad Sharfuddin wrote: > Thanks a lot for the explanation. But other then the ocfs2 resource > group, this cluster starts all other resources > > on a single node, without any issue just because the use of > "no-quorum-policy=ignore" option.
Yes I know. And what I tried to point out is that "no-quorum-policy=ignore" is dangerous for services that do require a resource-manager. If you don't have any of those go with a systemd startup. Regards, Klaus > > -- > Regards, > Muhammad Sharfuddin > > On 3/13/2018 7:32 PM, Klaus Wenninger wrote: >> On 03/13/2018 02:30 PM, Muhammad Sharfuddin wrote: >>> Yes, by saying pacemaker, I meant to say corosync as well. >>> >>> Is there any fix ? or a two node cluster can't run ocfs2 resources >>> when one node is offline ? >> Actually there can't be a "fix" as 2 nodes are just not enough >> for a partial-cluster to be quorate in the classical sense >> (more votes than half of the cluster nodes). >> >> So to still be able to use it we have this 2-node config that >> permanently sets quorum. But not to run into issues on >> startup we need it to require both nodes seeing each >> other once. >> >> So this is definitely nothing that is specific to ocfs2. >> It just looks specific to ocfs2 because you've disabled >> quorum for pacemaker. >> To be honnest doing this you wouldn't need a resource-manager >> at all and could just start up your services using systemd. >> >> If you don't want a full 3rd node, and still want to handle cases >> where one node doesn't come up after a full shutdown of >> all nodes, you probably could go for a setup with qdevice. >> >> Regards, >> Klaus >> >>> -- >>> Regards, >>> Muhammad Sharfuddin >>> >>> On 3/13/2018 6:16 PM, Klaus Wenninger wrote: >>>> On 03/13/2018 02:03 PM, Muhammad Sharfuddin wrote: >>>>> Hi, >>>>> >>>>> 1 - if I put a node(node2) offline; ocfs2 resources keep running on >>>>> online node(node1) >>>>> >>>>> 2 - while node2 was offline, via cluster I stop/start the ocfs2 >>>>> resource group successfully so many times in a row. >>>>> >>>>> 3 - while node2 was offline; I restart the pacemaker service on the >>>>> node1 and then tries to start the ocfs2 resource group, dlm started >>>>> but ocfs2 file system resource does not start. >>>>> >>>>> Nutshell: >>>>> >>>>> a - both nodes must be online to start the ocfs2 resource. >>>>> >>>>> b - if one crashes or offline(gracefully) ocfs2 resource keeps >>>>> running >>>>> on the other/surviving node. >>>>> >>>>> c - while one node was offline, we can stop/start the ocfs2 resource >>>>> group on the surviving node but if we stops the pacemaker service, >>>>> then ocfs2 file system resource does not start with the following >>>>> info >>>>> in the logs: >>>> >From the logs I would say startup of dlm_controld times out >>>> because it >>>> is waiting >>>> for quorum - which doesn't happen because of wait-for-all. >>>> Question is if you really just stopped pacemaker or if you stopped >>>> corosync as well. >>>> In the latter case I would say it is the expected behavior. >>>> >>>> Regards, >>>> Klaus >>>> >>>>> lrmd[4317]: notice: executing - rsc:p-fssapmnt action:start >>>>> call_id:53 >>>>> Filesystem(p-fssapmnt)[5139]: INFO: Running start for >>>>> /dev/mapper/sapmnt on /sapmnt >>>>> kernel: [ 706.162676] dlm: Using TCP for communications >>>>> kernel: [ 706.162916] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining >>>>> the lockspace group... >>>>> dlm_controld[5105]: 759 fence work wait for quorum >>>>> dlm_controld[5105]: 764 BFA9FF042AA045F4822C2A6A06020EE9 wait for >>>>> quorum >>>>> lrmd[4317]: warning: p-fssapmnt_start_0 process (PID 5139) timed out >>>>> lrmd[4317]: warning: p-fssapmnt_start_0:5139 - timed out after >>>>> 60000ms >>>>> lrmd[4317]: notice: finished - rsc:p-fssapmnt action:start >>>>> call_id:53 pid:5139 exit-code:1 exec-time:60002ms queue-time:0ms >>>>> kernel: [ 766.056514] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group >>>>> event done -512 0 >>>>> kernel: [ 766.056528] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group >>>>> join failed -512 0 >>>>> crmd[4320]: notice: Result of stop operation for p-fssapmnt on >>>>> pipci001: 0 (ok) >>>>> crmd[4320]: notice: Initiating stop operation dlm_stop_0 locally on >>>>> pipci001 >>>>> lrmd[4317]: notice: executing - rsc:dlm action:stop call_id:56 >>>>> dlm_controld[5105]: 766 shutdown ignored, active lockspaces >>>>> lrmd[4317]: warning: dlm_stop_0 process (PID 5326) timed out >>>>> lrmd[4317]: warning: dlm_stop_0:5326 - timed out after 100000ms >>>>> lrmd[4317]: notice: finished - rsc:dlm action:stop call_id:56 >>>>> pid:5326 exit-code:1 exec-time:100003ms queue-time:0ms >>>>> crmd[4320]: error: Result of stop operation for dlm on pipci001: >>>>> Timed Out >>>>> crmd[4320]: warning: Action 15 (dlm_stop_0) on pipci001 failed >>>>> (target: 0 vs. rc: 1): Error >>>>> crmd[4320]: notice: Transition aborted by operation dlm_stop_0 >>>>> 'modify' on pipci001: Event failed >>>>> crmd[4320]: warning: Action 15 (dlm_stop_0) on pipci001 failed >>>>> (target: 0 vs. rc: 1): Error >>>>> pengine[4319]: notice: Watchdog will be used via SBD if fencing is >>>>> required >>>>> pengine[4319]: notice: On loss of CCM Quorum: Ignore >>>>> pengine[4319]: warning: Processing failed op stop for dlm:0 on >>>>> pipci001: unknown error (1) >>>>> pengine[4319]: warning: Processing failed op stop for dlm:0 on >>>>> pipci001: unknown error (1) >>>>> pengine[4319]: warning: Cluster node pipci001 will be fenced: dlm:0 >>>>> failed there >>>>> pengine[4319]: warning: Processing failed op start for p-fssapmnt:0 >>>>> on pipci001: unknown error (1) >>>>> pengine[4319]: notice: Stop of failed resource dlm:0 is implicit >>>>> after pipci001 is fenced >>>>> pengine[4319]: notice: * Fence pipci001 >>>>> pengine[4319]: notice: Stop sbd-stonith#011(pipci001) >>>>> pengine[4319]: notice: Stop dlm:0#011(pipci001) >>>>> crmd[4320]: notice: Requesting fencing (reboot) of node pipci001 >>>>> stonith-ng[4316]: notice: Client crmd.4320.4c2f757b wants to fence >>>>> (reboot) 'pipci001' with device '(any)' >>>>> stonith-ng[4316]: notice: Requesting peer fencing (reboot) of >>>>> pipci001 >>>>> stonith-ng[4316]: notice: sbd-stonith can fence (reboot) pipci001: >>>>> dynamic-list >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Muhammad Sharfuddin | +923332144823 | nds.com.pk >>>>> >>>>> On 3/13/2018 1:04 PM, Ulrich Windl wrote: >>>>>> Hi! >>>>>> >>>>>> I'd recommend this: >>>>>> Cleanly boot your nodes, avoiding any manual operation with cluster >>>>>> resources. Keep the logs. >>>>>> Then start your tests, keeping the logs for each. >>>>>> Try to fix issues by reading the logs and adjusting the cluster >>>>>> configuration, and not by starting commands that the cluster should >>>>>> start. >>>>>> >>>>>> We had an 2-node OCFS2 cluster running for quite some time with >>>>>> SLES11, but now the cluster is three nodes. To me the output of >>>>>> "crm_mon -1Arfj" combined with having set record-pending=true was >>>>>> very valuable finding problems. >>>>>> >>>>>> Regards, >>>>>> Ulrich >>>>>> >>>>>> >>>>>>>>> Muhammad Sharfuddin <m.sharfud...@nds.com.pk> schrieb am >>>>>>>>> 13.03.2018 um 08:43 in >>>>>> Nachricht <7b773ae9-4209-d246-b5c0-2c8b67e62...@nds.com.pk>: >>>>>>> Dear Klaus, >>>>>>> >>>>>>> If I understand you properly then, its a fencing issue, and >>>>>>> whatever I >>>>>>> am facing is "natural" or "by-design" in a two node cluster where >>>>>>> quorum >>>>>>> is incomplete. >>>>>>> >>>>>>> I am quite convinced that you have pointed out right because, >>>>>>> when I >>>>>>> start the dlm resource via cluster and then tries to start the >>>>>>> ocfs2 >>>>>>> file system manually from command line, mount command remains >>>>>>> hanged >>>>>>> and >>>>>>> following events are reported in the logs: >>>>>>> >>>>>>> kernel: [62622.864828] ocfs2: Registered cluster interface >>>>>>> user >>>>>>> kernel: [62622.884427] dlm: Using TCP for communications >>>>>>> kernel: [62622.884750] dlm: >>>>>>> BFA9FF042AA045F4822C2A6A06020EE9: >>>>>>> joining the lockspace group... >>>>>>> dlm_controld[17655]: 62627 fence work wait for quorum >>>>>>> dlm_controld[17655]: 62680 BFA9FF042AA045F4822C2A6A06020EE9 >>>>>>> wait >>>>>>> for quorum >>>>>>> >>>>>>> and then following messages keep reported every 5-10 minutes, >>>>>>> till I >>>>>>> kill the mount.ocfs2 process: >>>>>>> >>>>>>> dlm_controld[17655]: 62627 fence work wait for quorum >>>>>>> dlm_controld[17655]: 62680 BFA9FF042AA045F4822C2A6A06020EE9 >>>>>>> wait >>>>>>> for quorum >>>>>>> >>>>>>> I am also very much confused, because yesterday I did the same and >>>>>>> was >>>>>>> able to mount the ocfs2 file system manually from command line(at >>>>>>> least >>>>>>> once), and then unmount the file system manually stop the dlm >>>>>>> resource >>>>>>> from cluster and then complete ocfs2 resource stack(dlm, file >>>>>>> systems) >>>>>>> start/stop successfully via cluster even when only machine was >>>>>>> online. >>>>>>> >>>>>>> In a two-node cluster, which have ocfs2 resources, we can't run the >>>>>>> ocfs2 resources when quorum is incomplete(one node is offline) ? >>>>>>> >>>>>>> -- >>>>>>> Regards, >>>>>>> Muhammad Sharfuddin >>>>>>> >>>>>>> On 3/12/2018 5:58 PM, Klaus Wenninger wrote: >>>>>>>> On 03/12/2018 01:44 PM, Muhammad Sharfuddin wrote: >>>>>>>>> Hi Klaus, >>>>>>>>> >>>>>>>>> primitive sbd-stonith stonith:external/sbd \ >>>>>>>>> op monitor interval=3000 timeout=20 \ >>>>>>>>> op start interval=0 timeout=240 \ >>>>>>>>> op stop interval=0 timeout=100 \ >>>>>>>>> params sbd_device="/dev/mapper/sbd" \ >>>>>>>>> meta target-role=Started >>>>>>>> Makes more sense now. >>>>>>>> Using pcmk_delay_max would probably be useful here >>>>>>>> to prevent a fence-race. >>>>>>>> That stonith-resource was not in your resource-list below ... >>>>>>>> >>>>>>>>> property cib-bootstrap-options: \ >>>>>>>>> have-watchdog=true \ >>>>>>>>> stonith-enabled=true \ >>>>>>>>> no-quorum-policy=ignore \ >>>>>>>>> stonith-timeout=90 \ >>>>>>>>> startup-fencing=true >>>>>>>> You've set no-quorum-policy=ignore for pacemaker. >>>>>>>> Whether this is a good idea or not in your setup is >>>>>>>> written on another page. >>>>>>>> But isn't dlm directly interfering with corosync so >>>>>>>> that it would get the quorum state from there? >>>>>>>> As you have 2-node set probably on a 2-node-cluster >>>>>>>> this would - after both nodes down - wait for all >>>>>>>> nodes up first. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Klaus >>>>>>>> >>>>>>>>> # ps -eaf |grep sbd >>>>>>>>> root 6129 1 0 17:35 ? 00:00:00 sbd: inquisitor >>>>>>>>> root 6133 6129 0 17:35 ? 00:00:00 sbd: watcher: >>>>>>>>> /dev/mapper/sbd - slot: 1 - uuid: >>>>>>>>> 6e80a337-95db-4608-bd62-d59517f39103 >>>>>>>>> root 6134 6129 0 17:35 ? 00:00:00 sbd: watcher: >>>>>>>>> Pacemaker >>>>>>>>> root 6135 6129 0 17:35 ? 00:00:00 sbd: watcher: >>>>>>>>> Cluster >>>>>>>>> >>>>>>>>> This cluster does not start ocfs2 resources when I first >>>>>>>>> intentionally >>>>>>>>> crashed(reboot) both the nodes, then try to start ocfs2 resource >>>>>>>>> while >>>>>>>>> one node is offline. >>>>>>>>> >>>>>>>>> To fix the issue, I have one permanent solution, bring the other >>>>>>>>> node(offline) online and things get fixed automatically, i.e >>>>>>>>> ocfs2 >>>>>>>>> resources mounts. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Regards, >>>>>>>>> Muhammad Sharfuddin >>>>>>>>> >>>>>>>>> On 3/12/2018 5:25 PM, Klaus Wenninger wrote: >>>>>>>>>> Hi Muhammad! >>>>>>>>>> >>>>>>>>>> Could you be a little bit more elaborate on your fencing-setup! >>>>>>>>>> I read about you using SBD but I don't see any >>>>>>>>>> sbd-fencing-resource. >>>>>>>>>> For the case you wanted to use watchdog-fencing with SBD this >>>>>>>>>> would require stonith-watchdog-timeout property to be set. >>>>>>>>>> But watchdog-fencing relies on quorum (without 2-node trickery) >>>>>>>>>> and thus wouldn't work on a 2-node-cluster anyway. >>>>>>>>>> >>>>>>>>>> Didn't read through the whole thread - so I might be missing >>>>>>>>>> something ... >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Klaus >>>>>>>>>> >>>>>>>>>> On 03/12/2018 12:51 PM, Muhammad Sharfuddin wrote: >>>>>>>>>>> Hello Gang, >>>>>>>>>>> >>>>>>>>>>> as informed, previously cluster was fixed to start the ocfs2 >>>>>>>>>>> resources by >>>>>>>>>>> >>>>>>>>>>> a) crm resource start dlm >>>>>>>>>>> >>>>>>>>>>> b) mount/umount the ocfs2 file system manually. (this step was >>>>>>>>>>> the >>>>>>>>>>> fix) >>>>>>>>>>> >>>>>>>>>>> and then starting the clone group(which include dlm, ocfs2 file >>>>>>>>>>> systems) worked fine: >>>>>>>>>>> >>>>>>>>>>> c) crm resource start base-clone. >>>>>>>>>>> >>>>>>>>>>> Now I crash the nodes intentionally and then keep only one node >>>>>>>>>>> online, again cluster stopped starting the ocfs2 resources. I >>>>>>>>>>> again >>>>>>>>>>> tried to follow your instructions i.e >>>>>>>>>>> >>>>>>>>>>> i) crm resource start dlm >>>>>>>>>>> >>>>>>>>>>> then try to mount the ocfs2 file system manually which got >>>>>>>>>>> hanged this >>>>>>>>>>> time(previously manually mounting helped me): >>>>>>>>>>> >>>>>>>>>>> # cat /proc/3966/stack >>>>>>>>>>> [<ffffffffa039f18e>] do_uevent+0x7e/0x200 [dlm] >>>>>>>>>>> [<ffffffffa039fe0a>] new_lockspace+0x80a/0xa70 [dlm] >>>>>>>>>>> [<ffffffffa03a02d9>] dlm_new_lockspace+0x69/0x160 [dlm] >>>>>>>>>>> [<ffffffffa038e758>] user_cluster_connect+0xc8/0x350 >>>>>>>>>>> [ocfs2_stack_user] >>>>>>>>>>> [<ffffffffa03c2872>] ocfs2_cluster_connect+0x192/0x240 >>>>>>>>>>> [ocfs2_stackglue] >>>>>>>>>>> [<ffffffffa045eefc>] ocfs2_dlm_init+0x31c/0x570 [ocfs2] >>>>>>>>>>> [<ffffffffa04a9983>] ocfs2_fill_super+0xb33/0x1200 [ocfs2] >>>>>>>>>>> [<ffffffff8120e130>] mount_bdev+0x1a0/0x1e0 >>>>>>>>>>> [<ffffffff8120ea1a>] mount_fs+0x3a/0x170 >>>>>>>>>>> [<ffffffff81228bf2>] vfs_kern_mount+0x62/0x110 >>>>>>>>>>> [<ffffffff8122b123>] do_mount+0x213/0xcd0 >>>>>>>>>>> [<ffffffff8122bed5>] SyS_mount+0x85/0xd0 >>>>>>>>>>> [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6 >>>>>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>>>>>>>> >>>>>>>>>>> I killed the mount.ocfs2 process stop(crm resource stop dlm) >>>>>>>>>>> the >>>>>>>>>>> dlm >>>>>>>>>>> process, and then try to start(crm resource start dlm) the >>>>>>>>>>> dlm(which >>>>>>>>>>> previously always get started successfully), this time dlm >>>>>>>>>>> didn't >>>>>>>>>>> start and I checked the dlm_controld process >>>>>>>>>>> >>>>>>>>>>> cat /proc/3754/stack >>>>>>>>>>> [<ffffffff8121dc55>] poll_schedule_timeout+0x45/0x60 >>>>>>>>>>> [<ffffffff8121f0bc>] do_sys_poll+0x38c/0x4f0 >>>>>>>>>>> [<ffffffff8121f2dd>] SyS_poll+0x5d/0xe0 >>>>>>>>>>> [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6 >>>>>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>>>>>>>> >>>>>>>>>>> Nutshell: >>>>>>>>>>> >>>>>>>>>>> 1 - this cluster is configured to run when single node is >>>>>>>>>>> online >>>>>>>>>>> >>>>>>>>>>> 2 - this cluster does not start the ocfs2 resources after a >>>>>>>>>>> crash when >>>>>>>>>>> only one node is online. >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Regards, >>>>>>>>>>> Muhammad Sharfuddin | +923332144823 | nds.com.pk >>>>>>>>>>> >>>>>>>>>>> On 3/12/2018 12:41 PM, Gang He wrote: >>>>>>>>>>>>> Hello Gang, >>>>>>>>>>>>> >>>>>>>>>>>>> to follow your instructions, I started the dlm resource via: >>>>>>>>>>>>> >>>>>>>>>>>>> crm resource start dlm >>>>>>>>>>>>> >>>>>>>>>>>>> then mount/unmount the ocfs2 file system manually..(which >>>>>>>>>>>>> seems to be >>>>>>>>>>>>> the fix of the situation). >>>>>>>>>>>>> >>>>>>>>>>>>> Now resources are getting started properly on a single node.. >>>>>>>>>>>>> I am >>>>>>>>>>>>> happy >>>>>>>>>>>>> as the issue is fixed, but at the same time I am lost because >>>>>>>>>>>>> I have >>>>>>>>>>>>> no idea >>>>>>>>>>>>> >>>>>>>>>>>>> how things get fixed here(merely by mounting/unmounting the >>>>>>>>>>>>> ocfs2 >>>>>>>>>>>>> file >>>>>>>>>>>>> systems) >>>>>>>>>>>> >From your description. >>>>>>>>>>>> I just wonder the DLM resource does not work normally under >>>>>>>>>>>> that >>>>>>>>>>>> situation. >>>>>>>>>>>> Yan/Bin, do you have any comments about two-node cluster? >>>>>>>>>>>> which >>>>>>>>>>>> configuration settings will affect corosync quorum/DLM ? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Gang >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Muhammad Sharfuddin >>>>>>>>>>>>> >>>>>>>>>>>>> On 3/12/2018 10:59 AM, Gang He wrote: >>>>>>>>>>>>>> Hello Muhammad, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Usually, ocfs2 resource startup failure is caused by mount >>>>>>>>>>>>>> command >>>>>>>>>>>>>> timeout >>>>>>>>>>>>> (or hanged). >>>>>>>>>>>>>> The sample debugging method is, >>>>>>>>>>>>>> remove ocfs2 resource from crm first, >>>>>>>>>>>>>> then mount this file system manually, see if the mount >>>>>>>>>>>>>> command >>>>>>>>>>>>>> will be >>>>>>>>>>>>> timeout or hanged. >>>>>>>>>>>>>> If this command is hanged, please watch where is mount.ocfs2 >>>>>>>>>>>>>> process hanged >>>>>>>>>>>>> via "cat /proc/xxx/stack" command. >>>>>>>>>>>>>> If the back trace is stopped at DLM kernel module, usually >>>>>>>>>>>>>> the root >>>>>>>>>>>>>> cause is >>>>>>>>>>>>> cluster configuration problem. >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Gang >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 3/12/2018 7:32 AM, Gang He wrote: >>>>>>>>>>>>>>>> Hello Muhammad, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think this problem is not in ocfs2, the cause looks >>>>>>>>>>>>>>>> like the >>>>>>>>>>>>>>>> cluster >>>>>>>>>>>>>>> quorum is missed. >>>>>>>>>>>>>>>> For two-node cluster (does not three-node cluster), if one >>>>>>>>>>>>>>>> node >>>>>>>>>>>>>>>> is offline, >>>>>>>>>>>>>>> the quorum will be missed by default. >>>>>>>>>>>>>>>> So, you should configure two-node related quorum setting >>>>>>>>>>>>>>>> according to the >>>>>>>>>>>>>>> pacemaker manual. >>>>>>>>>>>>>>>> Then, DLM can work normal, and ocfs2 resource can start >>>>>>>>>>>>>>>> up. >>>>>>>>>>>>>>> Yes its configured accordingly, no-quorum is set to >>>>>>>>>>>>>>> "ignore". >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> property cib-bootstrap-options: \ >>>>>>>>>>>>>>> have-watchdog=true \ >>>>>>>>>>>>>>> stonith-enabled=true \ >>>>>>>>>>>>>>> stonith-timeout=80 \ >>>>>>>>>>>>>>> startup-fencing=true \ >>>>>>>>>>>>>>> no-quorum-policy=ignore >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>> Gang >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This two node cluster starts resources when both nodes >>>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>> online but >>>>>>>>>>>>>>>>> does not start the ocfs2 resources >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> when one node is offline. e.g if I gracefully stop the >>>>>>>>>>>>>>>>> cluster >>>>>>>>>>>>>>>>> resources >>>>>>>>>>>>>>>>> then stop the pacemaker service on >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> either node, and try to start the ocfs2 resource on the >>>>>>>>>>>>>>>>> online >>>>>>>>>>>>>>>>> node, it >>>>>>>>>>>>>>>>> fails. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> logs: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> pipci001 pengine[17732]: notice: Start >>>>>>>>>>>>>>>>> dlm:0#011(pipci001) >>>>>>>>>>>>>>>>> pengine[17732]: notice: Start >>>>>>>>>>>>>>>>> p-fssapmnt:0#011(pipci001) >>>>>>>>>>>>>>>>> pengine[17732]: notice: Start >>>>>>>>>>>>>>>>> p-fsusrsap:0#011(pipci001) >>>>>>>>>>>>>>>>> pipci001 pengine[17732]: notice: Calculated >>>>>>>>>>>>>>>>> transition 2, >>>>>>>>>>>>>>>>> saving >>>>>>>>>>>>>>>>> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2 >>>>>>>>>>>>>>>>> pipci001 crmd[17733]: notice: Processing graph 2 >>>>>>>>>>>>>>>>> (ref=pe_calc-dc-1520613202-31) derived from >>>>>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-339.bz2 >>>>>>>>>>>>>>>>> crmd[17733]: notice: Initiating start operation >>>>>>>>>>>>>>>>> dlm_start_0 >>>>>>>>>>>>>>>>> locally on >>>>>>>>>>>>>>>>> pipci001 >>>>>>>>>>>>>>>>> lrmd[17730]: notice: executing - rsc:dlm action:start >>>>>>>>>>>>>>>>> call_id:69 >>>>>>>>>>>>>>>>> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started >>>>>>>>>>>>>>>>> lrmd[17730]: notice: finished - rsc:dlm action:start >>>>>>>>>>>>>>>>> call_id:69 >>>>>>>>>>>>>>>>> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms >>>>>>>>>>>>>>>>> crmd[17733]: notice: Result of start operation for >>>>>>>>>>>>>>>>> dlm on >>>>>>>>>>>>>>>>> pipci001: 0 (ok) >>>>>>>>>>>>>>>>> crmd[17733]: notice: Initiating monitor operation >>>>>>>>>>>>>>>>> dlm_monitor_60000 >>>>>>>>>>>>>>>>> locally on pipci001 >>>>>>>>>>>>>>>>> crmd[17733]: notice: Initiating start operation >>>>>>>>>>>>>>>>> p-fssapmnt_start_0 >>>>>>>>>>>>>>>>> locally on pipci001 >>>>>>>>>>>>>>>>> lrmd[17730]: notice: executing - rsc:p-fssapmnt >>>>>>>>>>>>>>>>> action:start >>>>>>>>>>>>>>>>> call_id:71 >>>>>>>>>>>>>>>>> Filesystem(p-fssapmnt)[19052]: INFO: Running start for >>>>>>>>>>>>>>>>> /dev/mapper/sapmnt on /sapmnt >>>>>>>>>>>>>>>>> kernel: [ 4576.529938] dlm: Using TCP for communications >>>>>>>>>>>>>>>>> kernel: [ 4576.530233] dlm: >>>>>>>>>>>>>>>>> BFA9FF042AA045F4822C2A6A06020EE9: >>>>>>>>>>>>>>>>> joining >>>>>>>>>>>>>>>>> the lockspace group. >>>>>>>>>>>>>>>>> dlm_controld[19019]: 4629 fence work wait for quorum >>>>>>>>>>>>>>>>> dlm_controld[19019]: 4634 >>>>>>>>>>>>>>>>> BFA9FF042AA045F4822C2A6A06020EE9 >>>>>>>>>>>>>>>>> wait >>>>>>>>>>>>>>>>> for quorum >>>>>>>>>>>>>>>>> lrmd[17730]: warning: p-fssapmnt_start_0 process (PID >>>>>>>>>>>>>>>>> 19052) >>>>>>>>>>>>>>>>> timed out >>>>>>>>>>>>>>>>> kernel: [ 4636.418223] dlm: >>>>>>>>>>>>>>>>> BFA9FF042AA045F4822C2A6A06020EE9: >>>>>>>>>>>>>>>>> group >>>>>>>>>>>>>>>>> event done -512 0 >>>>>>>>>>>>>>>>> kernel: [ 4636.418227] dlm: >>>>>>>>>>>>>>>>> BFA9FF042AA045F4822C2A6A06020EE9: >>>>>>>>>>>>>>>>> group join >>>>>>>>>>>>>>>>> failed -512 0 >>>>>>>>>>>>>>>>> lrmd[17730]: warning: p-fssapmnt_start_0:19052 - >>>>>>>>>>>>>>>>> timed out >>>>>>>>>>>>>>>>> after 60000ms >>>>>>>>>>>>>>>>> lrmd[17730]: notice: finished - rsc:p-fssapmnt >>>>>>>>>>>>>>>>> action:start >>>>>>>>>>>>>>>>> call_id:71 >>>>>>>>>>>>>>>>> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms >>>>>>>>>>>>>>>>> kernel: [ 4636.420628] ocfs2: Unmounting device >>>>>>>>>>>>>>>>> (254,1) on >>>>>>>>>>>>>>>>> (node 0) >>>>>>>>>>>>>>>>> crmd[17733]: error: Result of start operation for >>>>>>>>>>>>>>>>> p-fssapmnt on >>>>>>>>>>>>>>>>> pipci001: Timed Out >>>>>>>>>>>>>>>>> crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on >>>>>>>>>>>>>>>>> pipci001 failed >>>>>>>>>>>>>>>>> (target: 0 vs. rc: 1): Error >>>>>>>>>>>>>>>>> crmd[17733]: notice: Transition aborted by operation >>>>>>>>>>>>>>>>> p-fssapmnt_start_0 'modify' on pipci001: Event failed >>>>>>>>>>>>>>>>> crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on >>>>>>>>>>>>>>>>> pipci001 failed >>>>>>>>>>>>>>>>> (target: 0 vs. rc: 1): Error >>>>>>>>>>>>>>>>> crmd[17733]: notice: Transition 2 (Complete=5, >>>>>>>>>>>>>>>>> Pending=0, >>>>>>>>>>>>>>>>> Fired=0, >>>>>>>>>>>>>>>>> Skipped=0, Incomplete=6, >>>>>>>>>>>>>>>>> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): >>>>>>>>>>>>>>>>> Complete >>>>>>>>>>>>>>>>> pengine[17732]: notice: Watchdog will be used via >>>>>>>>>>>>>>>>> SBD if >>>>>>>>>>>>>>>>> fencing is >>>>>>>>>>>>>>>>> required >>>>>>>>>>>>>>>>> pengine[17732]: notice: On loss of CCM Quorum: Ignore >>>>>>>>>>>>>>>>> pengine[17732]: warning: Processing failed op start for >>>>>>>>>>>>>>>>> p-fssapmnt:0 on >>>>>>>>>>>>>>>>> pipci001: unknown error (1) >>>>>>>>>>>>>>>>> pengine[17732]: warning: Processing failed op start for >>>>>>>>>>>>>>>>> p-fssapmnt:0 on >>>>>>>>>>>>>>>>> pipci001: unknown error (1) >>>>>>>>>>>>>>>>> pengine[17732]: warning: Forcing base-clone away from >>>>>>>>>>>>>>>>> pipci001 >>>>>>>>>>>>>>>>> after >>>>>>>>>>>>>>>>> 1000000 failures (max=2) >>>>>>>>>>>>>>>>> pengine[17732]: warning: Forcing base-clone away from >>>>>>>>>>>>>>>>> pipci001 >>>>>>>>>>>>>>>>> after >>>>>>>>>>>>>>>>> 1000000 failures (max=2) >>>>>>>>>>>>>>>>> pengine[17732]: notice: Stop dlm:0#011(pipci001) >>>>>>>>>>>>>>>>> pengine[17732]: notice: Stop >>>>>>>>>>>>>>>>> p-fssapmnt:0#011(pipci001) >>>>>>>>>>>>>>>>> pengine[17732]: notice: Calculated transition 3, saving >>>>>>>>>>>>>>>>> inputs in >>>>>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-340.bz2 >>>>>>>>>>>>>>>>> pengine[17732]: notice: Watchdog will be used via >>>>>>>>>>>>>>>>> SBD if >>>>>>>>>>>>>>>>> fencing is >>>>>>>>>>>>>>>>> required >>>>>>>>>>>>>>>>> pengine[17732]: notice: On loss of CCM Quorum: Ignore >>>>>>>>>>>>>>>>> pengine[17732]: warning: Processing failed op start for >>>>>>>>>>>>>>>>> p-fssapmnt:0 on >>>>>>>>>>>>>>>>> pipci001: unknown error (1) >>>>>>>>>>>>>>>>> pengine[17732]: warning: Processing failed op start for >>>>>>>>>>>>>>>>> p-fssapmnt:0 on >>>>>>>>>>>>>>>>> pipci001: unknown error (1) >>>>>>>>>>>>>>>>> pengine[17732]: warning: Forcing base-clone away from >>>>>>>>>>>>>>>>> pipci001 >>>>>>>>>>>>>>>>> after >>>>>>>>>>>>>>>>> 1000000 failures (max=2) >>>>>>>>>>>>>>>>> pipci001 pengine[17732]: warning: Forcing base-clone >>>>>>>>>>>>>>>>> away >>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>> pipci001 >>>>>>>>>>>>>>>>> after 1000000 failures (max=2) >>>>>>>>>>>>>>>>> pengine[17732]: notice: Stop dlm:0#011(pipci001) >>>>>>>>>>>>>>>>> pengine[17732]: notice: Stop >>>>>>>>>>>>>>>>> p-fssapmnt:0#011(pipci001) >>>>>>>>>>>>>>>>> pengine[17732]: notice: Calculated transition 4, saving >>>>>>>>>>>>>>>>> inputs in >>>>>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-341.bz2 >>>>>>>>>>>>>>>>> crmd[17733]: notice: Processing graph 4 >>>>>>>>>>>>>>>>> (ref=pe_calc-dc-1520613263-36) >>>>>>>>>>>>>>>>> derived from /var/lib/pacemaker/pengine/pe-input-341.bz2 >>>>>>>>>>>>>>>>> crmd[17733]: notice: Initiating stop operation >>>>>>>>>>>>>>>>> p-fssapmnt_stop_0 >>>>>>>>>>>>>>>>> locally on pipci001 >>>>>>>>>>>>>>>>> lrmd[17730]: notice: executing - rsc:p-fssapmnt >>>>>>>>>>>>>>>>> action:stop >>>>>>>>>>>>>>>>> call_id:72 >>>>>>>>>>>>>>>>> Filesystem(p-fssapmnt)[19189]: INFO: Running stop for >>>>>>>>>>>>>>>>> /dev/mapper/sapmnt >>>>>>>>>>>>>>>>> on /sapmnt >>>>>>>>>>>>>>>>> pipci001 lrmd[17730]: notice: finished - rsc:p-fssapmnt >>>>>>>>>>>>>>>>> action:stop >>>>>>>>>>>>>>>>> call_id:72 pid:19189 exit-code:0 exec-time:83ms >>>>>>>>>>>>>>>>> queue-time:0ms >>>>>>>>>>>>>>>>> pipci001 crmd[17733]: notice: Result of stop operation >>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>> p-fssapmnt >>>>>>>>>>>>>>>>> on pipci001: 0 (ok) >>>>>>>>>>>>>>>>> crmd[17733]: notice: Initiating stop operation >>>>>>>>>>>>>>>>> dlm_stop_0 >>>>>>>>>>>>>>>>> locally on >>>>>>>>>>>>>>>>> pipci001 >>>>>>>>>>>>>>>>> pipci001 lrmd[17730]: notice: executing - rsc:dlm >>>>>>>>>>>>>>>>> action:stop >>>>>>>>>>>>>>>>> call_id:74 >>>>>>>>>>>>>>>>> pipci001 dlm_controld[19019]: 4636 shutdown ignored, >>>>>>>>>>>>>>>>> active >>>>>>>>>>>>>>>>> lockspaces >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> resource configuration: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> primitive p-fssapmnt Filesystem \ >>>>>>>>>>>>>>>>> params device="/dev/mapper/sapmnt" >>>>>>>>>>>>>>>>> directory="/sapmnt" >>>>>>>>>>>>>>>>> fstype=ocfs2 \ >>>>>>>>>>>>>>>>> op monitor interval=20 timeout=40 \ >>>>>>>>>>>>>>>>> op start timeout=60 interval=0 \ >>>>>>>>>>>>>>>>> op stop timeout=60 interval=0 >>>>>>>>>>>>>>>>> primitive dlm ocf:pacemaker:controld \ >>>>>>>>>>>>>>>>> op monitor interval=60 timeout=60 \ >>>>>>>>>>>>>>>>> op start interval=0 timeout=90 \ >>>>>>>>>>>>>>>>> op stop interval=0 timeout=100 >>>>>>>>>>>>>>>>> clone base-clone base-group \ >>>>>>>>>>>>>>>>> meta interleave=true target-role=Started >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> cluster properties: >>>>>>>>>>>>>>>>> property cib-bootstrap-options: \ >>>>>>>>>>>>>>>>> have-watchdog=true \ >>>>>>>>>>>>>>>>> stonith-enabled=true \ >>>>>>>>>>>>>>>>> stonith-timeout=80 \ >>>>>>>>>>>>>>>>> startup-fencing=true \ >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Software versions: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> kernel version: 4.4.114-94.11-default >>>>>>>>>>>>>>>>> pacemaker-1.1.16-4.8.x86_64 >>>>>>>>>>>>>>>>> corosync-2.3.6-9.5.1.x86_64 >>>>>>>>>>>>>>>>> ocfs2-kmp-default-4.4.114-94.11.3.x86_64 >>>>>>>>>>>>>>>>> ocfs2-tools-1.8.5-1.35.x86_64 >>>>>>>>>>>>>>>>> dlm-kmp-default-4.4.114-94.11.3.x86_64 >>>>>>>>>>>>>>>>> libdlm3-4.0.7-1.28.x86_64 >>>>>>>>>>>>>>>>> libdlm-4.0.7-1.28.x86_64 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>> Muhammad Sharfuddin >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>> This email has been checked for viruses by Avast >>>>>>>>>>>>>>>>> antivirus >>>>>>>>>>>>>>>>> software. >>>>>>>>>>>>>>>>> https://www.avast.com/antivirus >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>>>>>>> Getting started: >>>>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>>>>>> Getting started: >>>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> Muhammad Sharfuddin >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>>>>> Getting started: >>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>>>>>>> >>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>>>> Getting started: >>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>>>>>> >>>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>>> Getting started: >>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>>>>> >>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>> Getting started: >>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>>>> >>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>> Getting started: >>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>> --- >>>>>>>>> This email has been checked for viruses by Avast antivirus >>>>>>>>> software. >>>>>>>>> https://www.avast.com/antivirus >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>> >>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>> Getting started: >>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> _______________________________________________ >>>>>>> Users mailing list: Users@clusterlabs.org >>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> _______________________________________________ >>>>>> Users mailing list: Users@clusterlabs.org >>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: >>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> >>>>> _______________________________________________ >>>>> Users mailing list: Users@clusterlabs.org >>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >> >> > > > --- > This email has been checked for viruses by Avast antivirus software. > https://www.avast.com/antivirus > -- Klaus Wenninger Senior Software Engineer, EMEA ENG Base Operating Systems Red Hat kwenn...@redhat.com _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org