Re: [ClusterLabs] single node fails to start the ocfs2 resource

Muhammad Sharfuddin Mon, 12 Mar 2018 05:19:13 -0700

@Ulrich,

issue I am facing is that when both nodes get crashed and then if I keepone node offline, the online node doesn't start the ocfs2 resources.


--
Regards,
Muhammad Sharfuddin

On 3/12/2018 4:51 PM, Muhammad Sharfuddin wrote:

Hello Gang,

as informed, previously cluster was fixed to start the ocfs2 resources by

a) crm resource start dlm

b) mount/umount  the ocfs2 file system manually. (this step was the fix)
and then starting the clone group(which include dlm, ocfs2 filesystems) worked fine:
c) crm resource start base-clone.
Now I crash the nodes intentionally and then keep only one nodeonline, again cluster stopped starting the ocfs2 resources. I againtried to follow your instructions i.e
i) crm resource start dlm
then try to mount the ocfs2 file system manually which got hanged thistime(previously manually mounting helped me):
# cat /proc/3966/stack
[<ffffffffa039f18e>] do_uevent+0x7e/0x200 [dlm]
[<ffffffffa039fe0a>] new_lockspace+0x80a/0xa70 [dlm]
[<ffffffffa03a02d9>] dlm_new_lockspace+0x69/0x160 [dlm]
[<ffffffffa038e758>] user_cluster_connect+0xc8/0x350 [ocfs2_stack_user]
[<ffffffffa03c2872>] ocfs2_cluster_connect+0x192/0x240 [ocfs2_stackglue]
[<ffffffffa045eefc>] ocfs2_dlm_init+0x31c/0x570 [ocfs2]
[<ffffffffa04a9983>] ocfs2_fill_super+0xb33/0x1200 [ocfs2]
[<ffffffff8120e130>] mount_bdev+0x1a0/0x1e0
[<ffffffff8120ea1a>] mount_fs+0x3a/0x170
[<ffffffff81228bf2>] vfs_kern_mount+0x62/0x110
[<ffffffff8122b123>] do_mount+0x213/0xcd0
[<ffffffff8122bed5>] SyS_mount+0x85/0xd0
[<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6
[<ffffffffffffffff>] 0xffffffffffffffff
I killed the mount.ocfs2 process stop(crm resource stop dlm) the dlmprocess, and then try to start(crm resource start dlm) the dlm(whichpreviously always get started successfully), this time dlm didn'tstart and I checked the dlm_controld process
cat /proc/3754/stack
[<ffffffff8121dc55>] poll_schedule_timeout+0x45/0x60
[<ffffffff8121f0bc>] do_sys_poll+0x38c/0x4f0
[<ffffffff8121f2dd>] SyS_poll+0x5d/0xe0
[<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6
[<ffffffffffffffff>] 0xffffffffffffffff

Nutshell:

1 - this cluster is configured to run when single node is online
2 - this cluster does not start the ocfs2 resources after a crash whenonly one node is online.
--
Regards,
Muhammad Sharfuddin | +923332144823 | nds.com.pk

On 3/12/2018 12:41 PM, Gang He wrote:
Hello Gang,

to follow your instructions, I started the dlm resource via:

      crm resource start dlm

then mount/unmount the ocfs2 file system manually..(which seems to be
the fix of the situation).
Now resources are getting started properly on a single node.. I amhappyas the issue is fixed, but at the same time I am lost because I haveno idea
how things get fixed here(merely by mounting/unmounting the ocfs2 file
systems)
>From your description.
I just wonder the DLM resource does not work normally under thatsituation.Yan/Bin, do you have any comments about two-node cluster? whichconfiguration settings will affect corosync quorum/DLM ?
Thanks
Gang
--
Regards,
Muhammad Sharfuddin

On 3/12/2018 10:59 AM, Gang He wrote:
Hello Muhammad,
Usually, ocfs2 resource startup failure is caused by mount commandtimeout
(or hanged).
The sample debugging method is,
remove ocfs2 resource from crm first,
then mount this file system manually, see if the mount command will be
timeout or hanged.
If this command is hanged, please watch where is mount.ocfs2process hanged
via "cat /proc/xxx/stack" command.
If the back trace is stopped at DLM kernel module, usually the rootcause is
cluster configuration problem.
Thanks
Gang
On 3/12/2018 7:32 AM, Gang He wrote:
Hello Muhammad,
I think this problem is not in ocfs2, the cause looks like thecluster
quorum is missed.
For two-node cluster (does not three-node cluster), if one nodeis offline,
the quorum will be missed by default.
So, you should configure two-node related quorum settingaccording to the
pacemaker manual.
Then, DLM can work normal, and ocfs2 resource can start up.
Yes its configured accordingly, no-quorum is set to "ignore".

property cib-bootstrap-options: \
            have-watchdog=true \
            stonith-enabled=true \
            stonith-timeout=80 \
            startup-fencing=true \
            no-quorum-policy=ignore
Thanks
Gang
Hi,
This two node cluster starts resources when both nodes areonline but
does not start the ocfs2 resources
when one node is offline. e.g if I gracefully stop the clusterresources
then stop the pacemaker service on
either node, and try to start the ocfs2 resource on the onlinenode, it
fails.

logs:

pipci001 pengine[17732]:   notice: Start dlm:0#011(pipci001)
pengine[17732]:   notice: Start p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Start p-fsusrsap:0#011(pipci001)
pipci001 pengine[17732]:   notice: Calculated transition 2, saving
inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
pipci001 crmd[17733]:   notice: Processing graph 2
(ref=pe_calc-dc-1520613202-31) derived from
/var/lib/pacemaker/pengine/pe-input-339.bz2
crmd[17733]: notice: Initiating start operation dlm_start_0locally on
pipci001
lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69
pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
crmd[17733]: notice: Result of start operation for dlm onpipci001: 0 (ok)crmd[17733]: notice: Initiating monitor operationdlm_monitor_60000
locally on pipci001
crmd[17733]: notice: Initiating start operationp-fssapmnt_start_0
locally on pipci001
lrmd[17730]: notice: executing - rsc:p-fssapmnt action:startcall_id:71
Filesystem(p-fssapmnt)[19052]: INFO: Running start for
/dev/mapper/sapmnt on /sapmnt
kernel: [ 4576.529938] dlm: Using TCP for communications
kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9:joining
the lockspace group.
dlm_controld[19019]: 4629 fence work wait for quorum
dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 waitfor quorumlrmd[17730]: warning: p-fssapmnt_start_0 process (PID 19052)timed out
kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
event done -512 0
kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9:group join
failed -512 0
lrmd[17730]: warning: p-fssapmnt_start_0:19052 - timed outafter 60000mslrmd[17730]: notice: finished - rsc:p-fssapmnt action:startcall_id:71
pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
crmd[17733]:    error: Result of start operation for p-fssapmnt on
pipci001: Timed Out
crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) onpipci001 failed
(target: 0 vs. rc: 1): Error
crmd[17733]:   notice: Transition aborted by operation
p-fssapmnt_start_0 'modify' on pipci001: Event failed
crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) onpipci001 failed
(target: 0 vs. rc: 1): Error
crmd[17733]: notice: Transition 2 (Complete=5, Pending=0,Fired=0,
Skipped=0, Incomplete=6,
Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
pengine[17732]: notice: Watchdog will be used via SBD iffencing is
required
pengine[17732]:   notice: On loss of CCM Quorum: Ignore
pengine[17732]: warning: Processing failed op start forp-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]: warning: Processing failed op start forp-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]: warning: Forcing base-clone away from pipci001after
1000000 failures (max=2)
pengine[17732]: warning: Forcing base-clone away from pipci001after
1000000 failures (max=2)
pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
pengine[17732]:   notice: Stop p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Calculated transition 3, saving inputs in
/var/lib/pacemaker/pengine/pe-input-340.bz2
pengine[17732]: notice: Watchdog will be used via SBD iffencing is
required
pengine[17732]:   notice: On loss of CCM Quorum: Ignore
pengine[17732]: warning: Processing failed op start forp-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]: warning: Processing failed op start forp-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]: warning: Forcing base-clone away from pipci001after
1000000 failures (max=2)
pipci001 pengine[17732]: warning: Forcing base-clone away frompipci001
after 1000000 failures (max=2)
pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
pengine[17732]:   notice: Stop p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Calculated transition 4, saving inputs in
/var/lib/pacemaker/pengine/pe-input-341.bz2
crmd[17733]: notice: Processing graph 4(ref=pe_calc-dc-1520613263-36)
derived from /var/lib/pacemaker/pengine/pe-input-341.bz2
crmd[17733]:   notice: Initiating stop operation p-fssapmnt_stop_0
locally on pipci001
lrmd[17730]: notice: executing - rsc:p-fssapmnt action:stopcall_id:72Filesystem(p-fssapmnt)[19189]: INFO: Running stop for/dev/mapper/sapmnt
on /sapmnt
pipci001 lrmd[17730]: notice: finished - rsc:p-fssapmntaction:stop
call_id:72 pid:19189 exit-code:0 exec-time:83ms queue-time:0ms
pipci001 crmd[17733]: notice: Result of stop operation forp-fssapmnt
on pipci001: 0 (ok)
crmd[17733]: notice: Initiating stop operation dlm_stop_0locally on
pipci001
pipci001 lrmd[17730]: notice: executing - rsc:dlm action:stopcall_id:74pipci001 dlm_controld[19019]: 4636 shutdown ignored, activelockspaces
resource configuration:

primitive p-fssapmnt Filesystem \
            params device="/dev/mapper/sapmnt" directory="/sapmnt"
fstype=ocfs2 \
            op monitor interval=20 timeout=40 \
            op start timeout=60 interval=0 \
            op stop timeout=60 interval=0
primitive dlm ocf:pacemaker:controld \
            op monitor interval=60 timeout=60 \
            op start interval=0 timeout=90 \
            op stop interval=0 timeout=100
clone base-clone base-group \
            meta interleave=true target-role=Started

cluster properties:
property cib-bootstrap-options: \
            have-watchdog=true \
            stonith-enabled=true \
            stonith-timeout=80 \
            startup-fencing=true \


Software versions:

kernel version: 4.4.114-94.11-default
pacemaker-1.1.16-4.8.x86_64
corosync-2.3.6-9.5.1.x86_64
ocfs2-kmp-default-4.4.114-94.11.3.x86_64
ocfs2-tools-1.8.5-1.35.x86_64
dlm-kmp-default-4.4.114-94.11.3.x86_64
libdlm3-4.0.7-1.28.x86_64
libdlm-4.0.7-1.28.x86_64


--
Regards,
Muhammad Sharfuddin


---
This email has been checked for viruses by Avast antivirussoftware.
https://www.avast.com/antivirus

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
--
Regards,
Muhammad Sharfuddin

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] single node fails to start the ocfs2 resource

Reply via email to