Hi! I didn't read the logs carefully, but I remember one pitfall (SLES 11): If I formatted the filesystem when the OCFS serveices were not running, I was unable to mount it; I had to reformat the filesystem when the OCFS services were running. Maybe that helps.
Regards, Ulrich >>> "Gang He" <g...@suse.com> schrieb am 12.03.2018 um 06:59 in Nachricht <5aa687c8020000f9000ae...@prv-mh.provo.novell.com>: > Hello Muhammad, > > Usually, ocfs2 resource startup failure is caused by mount command timeout > (or hanged). > The sample debugging method is, > remove ocfs2 resource from crm first, > then mount this file system manually, see if the mount command will be > timeout or hanged. > If this command is hanged, please watch where is mount.ocfs2 process hanged > via "cat /proc/xxx/stack" command. > If the back trace is stopped at DLM kernel module, usually the root cause is > cluster configuration problem. > > > Thanks > Gang > > >>>> >> On 3/12/2018 7:32 AM, Gang He wrote: >>> Hello Muhammad, >>> >>> I think this problem is not in ocfs2, the cause looks like the cluster >> quorum is missed. >>> For two-node cluster (does not three-node cluster), if one node is offline, >> the quorum will be missed by default. >>> So, you should configure two-node related quorum setting according to the >> pacemaker manual. >>> Then, DLM can work normal, and ocfs2 resource can start up. >> Yes its configured accordingly, no-quorum is set to "ignore". >> >> property cib-bootstrap-options: \ >> have-watchdog=true \ >> stonith-enabled=true \ >> stonith-timeout=80 \ >> startup-fencing=true \ >> no-quorum-policy=ignore >> >>> >>> Thanks >>> Gang >>> >>> >>>> Hi, >>>> >>>> This two node cluster starts resources when both nodes are online but >>>> does not start the ocfs2 resources >>>> >>>> when one node is offline. e.g if I gracefully stop the cluster resources >>>> then stop the pacemaker service on >>>> >>>> either node, and try to start the ocfs2 resource on the online node, it >>>> fails. >>>> >>>> logs: >>>> >>>> pipci001 pengine[17732]: notice: Start dlm:0#011(pipci001) >>>> pengine[17732]: notice: Start p-fssapmnt:0#011(pipci001) >>>> pengine[17732]: notice: Start p-fsusrsap:0#011(pipci001) >>>> pipci001 pengine[17732]: notice: Calculated transition 2, saving >>>> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2 >>>> pipci001 crmd[17733]: notice: Processing graph 2 >>>> (ref=pe_calc-dc-1520613202-31) derived from >>>> /var/lib/pacemaker/pengine/pe-input-339.bz2 >>>> crmd[17733]: notice: Initiating start operation dlm_start_0 locally on >>>> pipci001 >>>> lrmd[17730]: notice: executing - rsc:dlm action:start call_id:69 >>>> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started >>>> lrmd[17730]: notice: finished - rsc:dlm action:start call_id:69 >>>> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms >>>> crmd[17733]: notice: Result of start operation for dlm on pipci001: 0 >>>> (ok) >>>> crmd[17733]: notice: Initiating monitor operation dlm_monitor_60000 >>>> locally on pipci001 >>>> crmd[17733]: notice: Initiating start operation p-fssapmnt_start_0 >>>> locally on pipci001 >>>> lrmd[17730]: notice: executing - rsc:p-fssapmnt action:start call_id:71 >>>> Filesystem(p-fssapmnt)[19052]: INFO: Running start for >>>> /dev/mapper/sapmnt on /sapmnt >>>> kernel: [ 4576.529938] dlm: Using TCP for communications >>>> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining >>>> the lockspace group. >>>> dlm_controld[19019]: 4629 fence work wait for quorum >>>> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum >>>> lrmd[17730]: warning: p-fssapmnt_start_0 process (PID 19052) timed out >>>> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group >>>> event done -512 0 >>>> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join >>>> failed -512 0 >>>> lrmd[17730]: warning: p-fssapmnt_start_0:19052 - timed out after 60000ms >>>> lrmd[17730]: notice: finished - rsc:p-fssapmnt action:start call_id:71 >>>> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms >>>> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0) >>>> crmd[17733]: error: Result of start operation for p-fssapmnt on >>>> pipci001: Timed Out >>>> crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed >>>> (target: 0 vs. rc: 1): Error >>>> crmd[17733]: notice: Transition aborted by operation >>>> p-fssapmnt_start_0 'modify' on pipci001: Event failed >>>> crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed >>>> (target: 0 vs. rc: 1): Error >>>> crmd[17733]: notice: Transition 2 (Complete=5, Pending=0, Fired=0, >>>> Skipped=0, Incomplete=6, >>>> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete >>>> pengine[17732]: notice: Watchdog will be used via SBD if fencing is >>>> required >>>> pengine[17732]: notice: On loss of CCM Quorum: Ignore >>>> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on >>>> pipci001: unknown error (1) >>>> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on >>>> pipci001: unknown error (1) >>>> pengine[17732]: warning: Forcing base-clone away from pipci001 after >>>> 1000000 failures (max=2) >>>> pengine[17732]: warning: Forcing base-clone away from pipci001 after >>>> 1000000 failures (max=2) >>>> pengine[17732]: notice: Stop dlm:0#011(pipci001) >>>> pengine[17732]: notice: Stop p-fssapmnt:0#011(pipci001) >>>> pengine[17732]: notice: Calculated transition 3, saving inputs in >>>> /var/lib/pacemaker/pengine/pe-input-340.bz2 >>>> pengine[17732]: notice: Watchdog will be used via SBD if fencing is >>>> required >>>> pengine[17732]: notice: On loss of CCM Quorum: Ignore >>>> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on >>>> pipci001: unknown error (1) >>>> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on >>>> pipci001: unknown error (1) >>>> pengine[17732]: warning: Forcing base-clone away from pipci001 after >>>> 1000000 failures (max=2) >>>> pipci001 pengine[17732]: warning: Forcing base-clone away from pipci001 >>>> after 1000000 failures (max=2) >>>> pengine[17732]: notice: Stop dlm:0#011(pipci001) >>>> pengine[17732]: notice: Stop p-fssapmnt:0#011(pipci001) >>>> pengine[17732]: notice: Calculated transition 4, saving inputs in >>>> /var/lib/pacemaker/pengine/pe-input-341.bz2 >>>> crmd[17733]: notice: Processing graph 4 (ref=pe_calc-dc-1520613263-36) >>>> derived from /var/lib/pacemaker/pengine/pe-input-341.bz2 >>>> crmd[17733]: notice: Initiating stop operation p-fssapmnt_stop_0 >>>> locally on pipci001 >>>> lrmd[17730]: notice: executing - rsc:p-fssapmnt action:stop call_id:72 >>>> Filesystem(p-fssapmnt)[19189]: INFO: Running stop for /dev/mapper/sapmnt >>>> on /sapmnt >>>> pipci001 lrmd[17730]: notice: finished - rsc:p-fssapmnt action:stop >>>> call_id:72 pid:19189 exit-code:0 exec-time:83ms queue-time:0ms >>>> pipci001 crmd[17733]: notice: Result of stop operation for p-fssapmnt >>>> on pipci001: 0 (ok) >>>> crmd[17733]: notice: Initiating stop operation dlm_stop_0 locally on >>>> pipci001 >>>> pipci001 lrmd[17730]: notice: executing - rsc:dlm action:stop call_id:74 >>>> pipci001 dlm_controld[19019]: 4636 shutdown ignored, active lockspaces >>>> >>>> >>>> resource configuration: >>>> >>>> primitive p-fssapmnt Filesystem \ >>>> params device="/dev/mapper/sapmnt" directory="/sapmnt" >>>> fstype=ocfs2 \ >>>> op monitor interval=20 timeout=40 \ >>>> op start timeout=60 interval=0 \ >>>> op stop timeout=60 interval=0 >>>> primitive dlm ocf:pacemaker:controld \ >>>> op monitor interval=60 timeout=60 \ >>>> op start interval=0 timeout=90 \ >>>> op stop interval=0 timeout=100 >>>> clone base-clone base-group \ >>>> meta interleave=true target-role=Started >>>> >>>> cluster properties: >>>> property cib-bootstrap-options: \ >>>> have-watchdog=true \ >>>> stonith-enabled=true \ >>>> stonith-timeout=80 \ >>>> startup-fencing=true \ >>>> >>>> >>>> Software versions: >>>> >>>> kernel version: 4.4.114-94.11-default >>>> pacemaker-1.1.16-4.8.x86_64 >>>> corosync-2.3.6-9.5.1.x86_64 >>>> ocfs2-kmp-default-4.4.114-94.11.3.x86_64 >>>> ocfs2-tools-1.8.5-1.35.x86_64 >>>> dlm-kmp-default-4.4.114-94.11.3.x86_64 >>>> libdlm3-4.0.7-1.28.x86_64 >>>> libdlm-4.0.7-1.28.x86_64 >>>> >>>> >>>> -- >>>> Regards, >>>> Muhammad Sharfuddin >>>> >>>> >>>> --- >>>> This email has been checked for viruses by Avast antivirus software. >>>> https://www.avast.com/antivirus >>>> >>>> _______________________________________________ >>>> Users mailing list: Users@clusterlabs.org >>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> _______________________________________________ >>> Users mailing list: Users@clusterlabs.org >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> -- >> Regards, >> Muhammad Sharfuddin >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org