Re: [ClusterLabs] single node fails to start the ocfs2 resource
Hello Muhammad, Usually, ocfs2 resource startup failure is caused by mount command timeout (or hanged). The sample debugging method is, remove ocfs2 resource from crm first, then mount this file system manually, see if the mount command will be timeout or hanged. If this command is hanged, please watch where is mount.ocfs2 process hanged via "cat /proc/xxx/stack" command. If the back trace is stopped at DLM kernel module, usually the root cause is cluster configuration problem. Thanks Gang >>> > On 3/12/2018 7:32 AM, Gang He wrote: >> Hello Muhammad, >> >> I think this problem is not in ocfs2, the cause looks like the cluster > quorum is missed. >> For two-node cluster (does not three-node cluster), if one node is offline, > the quorum will be missed by default. >> So, you should configure two-node related quorum setting according to the > pacemaker manual. >> Then, DLM can work normal, and ocfs2 resource can start up. > Yes its configured accordingly, no-quorum is set to "ignore". > > property cib-bootstrap-options: \ > have-watchdog=true \ > stonith-enabled=true \ > stonith-timeout=80 \ > startup-fencing=true \ > no-quorum-policy=ignore > >> >> Thanks >> Gang >> >> >>> Hi, >>> >>> This two node cluster starts resources when both nodes are online but >>> does not start the ocfs2 resources >>> >>> when one node is offline. e.g if I gracefully stop the cluster resources >>> then stop the pacemaker service on >>> >>> either node, and try to start the ocfs2 resource on the online node, it >>> fails. >>> >>> logs: >>> >>> pipci001 pengine[17732]: notice: Start dlm:0#011(pipci001) >>> pengine[17732]: notice: Start p-fssapmnt:0#011(pipci001) >>> pengine[17732]: notice: Start p-fsusrsap:0#011(pipci001) >>> pipci001 pengine[17732]: notice: Calculated transition 2, saving >>> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2 >>> pipci001 crmd[17733]: notice: Processing graph 2 >>> (ref=pe_calc-dc-1520613202-31) derived from >>> /var/lib/pacemaker/pengine/pe-input-339.bz2 >>> crmd[17733]: notice: Initiating start operation dlm_start_0 locally on >>> pipci001 >>> lrmd[17730]: notice: executing - rsc:dlm action:start call_id:69 >>> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started >>> lrmd[17730]: notice: finished - rsc:dlm action:start call_id:69 >>> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms >>> crmd[17733]: notice: Result of start operation for dlm on pipci001: 0 (ok) >>> crmd[17733]: notice: Initiating monitor operation dlm_monitor_6 >>> locally on pipci001 >>> crmd[17733]: notice: Initiating start operation p-fssapmnt_start_0 >>> locally on pipci001 >>> lrmd[17730]: notice: executing - rsc:p-fssapmnt action:start call_id:71 >>> Filesystem(p-fssapmnt)[19052]: INFO: Running start for >>> /dev/mapper/sapmnt on /sapmnt >>> kernel: [ 4576.529938] dlm: Using TCP for communications >>> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining >>> the lockspace group. >>> dlm_controld[19019]: 4629 fence work wait for quorum >>> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum >>> lrmd[17730]: warning: p-fssapmnt_start_0 process (PID 19052) timed out >>> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group >>> event done -512 0 >>> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join >>> failed -512 0 >>> lrmd[17730]: warning: p-fssapmnt_start_0:19052 - timed out after 6ms >>> lrmd[17730]: notice: finished - rsc:p-fssapmnt action:start call_id:71 >>> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms >>> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0) >>> crmd[17733]:error: Result of start operation for p-fssapmnt on >>> pipci001: Timed Out >>> crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed >>> (target: 0 vs. rc: 1): Error >>> crmd[17733]: notice: Transition aborted by operation >>> p-fssapmnt_start_0 'modify' on pipci001: Event failed >>> crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed >>> (target: 0 vs. rc: 1): Error >>> crmd[17733]: notice: Transition 2 (Complete=5, Pending=0, Fired=0, >>> Skipped=0, Incomplete=6, >>> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete >>> pengine[17732]: notice: Watchdog will be used via SBD if fencing is >>> required >>> pengine[17732]: notice: On loss of CCM Quorum: Ignore >>> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on >>> pipci001: unknown error (1) >>> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on >>> pipci001: unknown error (1) >>> pengine[17732]: warning: Forcing base-clone away from pipci001 after >>> 100 failures (max=2) >>> pengine[17732]: warning: Forcing base-clone away from pipci001 after >>> 100 failures (max=2) >>> pengine[17732]: notice: Stopdlm:0#011(pipci001) >>> pengine[17732]: notice: Stopp-fssap
Re: [ClusterLabs] single node fails to start the ocfs2 resource
On 3/12/2018 7:32 AM, Gang He wrote: Hello Muhammad, I think this problem is not in ocfs2, the cause looks like the cluster quorum is missed. For two-node cluster (does not three-node cluster), if one node is offline, the quorum will be missed by default. So, you should configure two-node related quorum setting according to the pacemaker manual. Then, DLM can work normal, and ocfs2 resource can start up. Yes its configured accordingly, no-quorum is set to "ignore". property cib-bootstrap-options: \ have-watchdog=true \ stonith-enabled=true \ stonith-timeout=80 \ startup-fencing=true \ no-quorum-policy=ignore Thanks Gang Hi, This two node cluster starts resources when both nodes are online but does not start the ocfs2 resources when one node is offline. e.g if I gracefully stop the cluster resources then stop the pacemaker service on either node, and try to start the ocfs2 resource on the online node, it fails. logs: pipci001 pengine[17732]: notice: Start dlm:0#011(pipci001) pengine[17732]: notice: Start p-fssapmnt:0#011(pipci001) pengine[17732]: notice: Start p-fsusrsap:0#011(pipci001) pipci001 pengine[17732]: notice: Calculated transition 2, saving inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2 pipci001 crmd[17733]: notice: Processing graph 2 (ref=pe_calc-dc-1520613202-31) derived from /var/lib/pacemaker/pengine/pe-input-339.bz2 crmd[17733]: notice: Initiating start operation dlm_start_0 locally on pipci001 lrmd[17730]: notice: executing - rsc:dlm action:start call_id:69 dlm_controld[19019]: 4575 dlm_controld 4.0.7 started lrmd[17730]: notice: finished - rsc:dlm action:start call_id:69 pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms crmd[17733]: notice: Result of start operation for dlm on pipci001: 0 (ok) crmd[17733]: notice: Initiating monitor operation dlm_monitor_6 locally on pipci001 crmd[17733]: notice: Initiating start operation p-fssapmnt_start_0 locally on pipci001 lrmd[17730]: notice: executing - rsc:p-fssapmnt action:start call_id:71 Filesystem(p-fssapmnt)[19052]: INFO: Running start for /dev/mapper/sapmnt on /sapmnt kernel: [ 4576.529938] dlm: Using TCP for communications kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining the lockspace group. dlm_controld[19019]: 4629 fence work wait for quorum dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum lrmd[17730]: warning: p-fssapmnt_start_0 process (PID 19052) timed out kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group event done -512 0 kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join failed -512 0 lrmd[17730]: warning: p-fssapmnt_start_0:19052 - timed out after 6ms lrmd[17730]: notice: finished - rsc:p-fssapmnt action:start call_id:71 pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0) crmd[17733]:error: Result of start operation for p-fssapmnt on pipci001: Timed Out crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed (target: 0 vs. rc: 1): Error crmd[17733]: notice: Transition aborted by operation p-fssapmnt_start_0 'modify' on pipci001: Event failed crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed (target: 0 vs. rc: 1): Error crmd[17733]: notice: Transition 2 (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=6, Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete pengine[17732]: notice: Watchdog will be used via SBD if fencing is required pengine[17732]: notice: On loss of CCM Quorum: Ignore pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on pipci001: unknown error (1) pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on pipci001: unknown error (1) pengine[17732]: warning: Forcing base-clone away from pipci001 after 100 failures (max=2) pengine[17732]: warning: Forcing base-clone away from pipci001 after 100 failures (max=2) pengine[17732]: notice: Stopdlm:0#011(pipci001) pengine[17732]: notice: Stopp-fssapmnt:0#011(pipci001) pengine[17732]: notice: Calculated transition 3, saving inputs in /var/lib/pacemaker/pengine/pe-input-340.bz2 pengine[17732]: notice: Watchdog will be used via SBD if fencing is required pengine[17732]: notice: On loss of CCM Quorum: Ignore pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on pipci001: unknown error (1) pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on pipci001: unknown error (1) pengine[17732]: warning: Forcing base-clone away from pipci001 after 100 failures (max=2) pipci001 pengine[17732]: warning: Forcing base-clone away from pipci001 after 100 failures (max=2) pengine[17732]: notice: Stopdlm:0#011(pipci001) pengine[17732]: notice: Stopp-fssapmnt:0#011(pipci001) pengine[17732]: notice: Calculated transition 4, saving inputs in /va
Re: [ClusterLabs] single node fails to start the ocfs2 resource
Hello Muhammad, I think this problem is not in ocfs2, the cause looks like the cluster quorum is missed. For two-node cluster (does not three-node cluster), if one node is offline, the quorum will be missed by default. So, you should configure two-node related quorum setting according to the pacemaker manual. Then, DLM can work normal, and ocfs2 resource can start up. Thanks Gang >>> > Hi, > > This two node cluster starts resources when both nodes are online but > does not start the ocfs2 resources > > when one node is offline. e.g if I gracefully stop the cluster resources > then stop the pacemaker service on > > either node, and try to start the ocfs2 resource on the online node, it > fails. > > logs: > > pipci001 pengine[17732]: notice: Start dlm:0#011(pipci001) > pengine[17732]: notice: Start p-fssapmnt:0#011(pipci001) > pengine[17732]: notice: Start p-fsusrsap:0#011(pipci001) > pipci001 pengine[17732]: notice: Calculated transition 2, saving > inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2 > pipci001 crmd[17733]: notice: Processing graph 2 > (ref=pe_calc-dc-1520613202-31) derived from > /var/lib/pacemaker/pengine/pe-input-339.bz2 > crmd[17733]: notice: Initiating start operation dlm_start_0 locally on > pipci001 > lrmd[17730]: notice: executing - rsc:dlm action:start call_id:69 > dlm_controld[19019]: 4575 dlm_controld 4.0.7 started > lrmd[17730]: notice: finished - rsc:dlm action:start call_id:69 > pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms > crmd[17733]: notice: Result of start operation for dlm on pipci001: 0 (ok) > crmd[17733]: notice: Initiating monitor operation dlm_monitor_6 > locally on pipci001 > crmd[17733]: notice: Initiating start operation p-fssapmnt_start_0 > locally on pipci001 > lrmd[17730]: notice: executing - rsc:p-fssapmnt action:start call_id:71 > Filesystem(p-fssapmnt)[19052]: INFO: Running start for > /dev/mapper/sapmnt on /sapmnt > kernel: [ 4576.529938] dlm: Using TCP for communications > kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining > the lockspace group. > dlm_controld[19019]: 4629 fence work wait for quorum > dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum > lrmd[17730]: warning: p-fssapmnt_start_0 process (PID 19052) timed out > kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group > event done -512 0 > kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join > failed -512 0 > lrmd[17730]: warning: p-fssapmnt_start_0:19052 - timed out after 6ms > lrmd[17730]: notice: finished - rsc:p-fssapmnt action:start call_id:71 > pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms > kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0) > crmd[17733]:error: Result of start operation for p-fssapmnt on > pipci001: Timed Out > crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed > (target: 0 vs. rc: 1): Error > crmd[17733]: notice: Transition aborted by operation > p-fssapmnt_start_0 'modify' on pipci001: Event failed > crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed > (target: 0 vs. rc: 1): Error > crmd[17733]: notice: Transition 2 (Complete=5, Pending=0, Fired=0, > Skipped=0, Incomplete=6, > Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete > pengine[17732]: notice: Watchdog will be used via SBD if fencing is > required > pengine[17732]: notice: On loss of CCM Quorum: Ignore > pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on > pipci001: unknown error (1) > pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on > pipci001: unknown error (1) > pengine[17732]: warning: Forcing base-clone away from pipci001 after > 100 failures (max=2) > pengine[17732]: warning: Forcing base-clone away from pipci001 after > 100 failures (max=2) > pengine[17732]: notice: Stopdlm:0#011(pipci001) > pengine[17732]: notice: Stopp-fssapmnt:0#011(pipci001) > pengine[17732]: notice: Calculated transition 3, saving inputs in > /var/lib/pacemaker/pengine/pe-input-340.bz2 > pengine[17732]: notice: Watchdog will be used via SBD if fencing is > required > pengine[17732]: notice: On loss of CCM Quorum: Ignore > pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on > pipci001: unknown error (1) > pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on > pipci001: unknown error (1) > pengine[17732]: warning: Forcing base-clone away from pipci001 after > 100 failures (max=2) > pipci001 pengine[17732]: warning: Forcing base-clone away from pipci001 > after 100 failures (max=2) > pengine[17732]: notice: Stopdlm:0#011(pipci001) > pengine[17732]: notice: Stopp-fssapmnt:0#011(pipci001) > pengine[17732]: notice: Calculated transition 4, saving inputs in > /var/lib/pacemaker/pengine/pe-input-341.bz2 > crmd[17733]: notice: Processing