Re: [ClusterLabs] single node fails to start the ocfs2 resource

Muhammad Sharfuddin Fri, 09 Mar 2018 21:49:28 -0800

On 3/10/2018 10:00 AM, Andrei Borzenkov wrote:

09.03.2018 19:55, Muhammad Sharfuddin пишет:

Hi,


This two node cluster starts resources when both nodes are online but
does not start the ocfs2 resources

when one node is offline. e.g if I gracefully stop the cluster resources
then stop the pacemaker service on

either node, and try to start the ocfs2 resource on the online node, it
fails.

logs:

pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
pipci001 pengine[17732]:   notice: Calculated transition 2, saving
inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
pipci001 crmd[17733]:   notice: Processing graph 2
(ref=pe_calc-dc-1520613202-31) derived from
/var/lib/pacemaker/pengine/pe-input-339.bz2
crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on
pipci001
lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69
pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0
(ok)
crmd[17733]:   notice: Initiating monitor operation dlm_monitor_60000
locally on pipci001
crmd[17733]:   notice: Initiating start operation p-fssapmnt_start_0
locally on pipci001
lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:start call_id:71
Filesystem(p-fssapmnt)[19052]: INFO: Running start for
/dev/mapper/sapmnt on /sapmnt
kernel: [ 4576.529938] dlm: Using TCP for communications
kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
the lockspace group.
dlm_controld[19019]: 4629 fence work wait for quorum
dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052) timed out

That sounds like the problem. It attempts to fence the other node, but
you do not have any fencing resources configured so it cannot work. You
need to ensure you have working fencing agent in your configuration.

sbd is being perfectly used in this cluster and after multiple failedattempts to start the ocfs2

resource, this standalone online node gets fenced too

logs:
pengine[17732]:  warning: Scheduling Node pipci001 for STONITH

pengine[17732]: notice: Stop of failed resource dlm:0 is implicitafter pipci001 is fenced

pengine[17732]:   notice:  * Fence pipci001
pengine[17732]:   notice: Stop    sbd-stonith#011(pipci001)
pengine[17732]:   notice: Stop    dlm:0#011(pipci001)

pengine[17732]: warning: Calculated transition 6 (with warnings),saving inputs in /var/lib/pacemaker/pengine/pe-warn-15.bz22018-03-09T21:03:30.588865+05:00 pipci002 crmd[13030]: notice:Processing graph 6 (ref=pe_calc-dc-1520611410-34) derived from/var/lib/pacemaker/pengine/pe-warn-15.bz2

crmd[17733]:   notice: Requesting fencing (reboot) of node pipci001

stonith-ng[13026]: notice: Client crmd.13030.f5570444 wants to fence(reboot) 'pipci001' with device '(any)'

stonith-ng[13026]:   notice: Requesting peer fencing (reboot) of pipci001
stonith-ng[13026]:   notice: sbd-stonith can fence (rebo

Also as informed this cluster starts resources when both nodes areonline and stonith is enabled

and works too.

cluster properties:
property cib-bootstrap-options: \
        have-watchdog=true \
        stonith-enabled=true \
        stonith-timeout=80 \
        startup-fencing=true \

kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
event done -512 0
kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join
failed -512 0
lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out after 60000ms
lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start call_id:71
pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
crmd[17733]:    error: Result of start operation for p-fssapmnt on
pipci001: Timed Out
crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
(target: 0 vs. rc: 1): Error
crmd[17733]:   notice: Transition aborted by operation
p-fssapmnt_start_0 'modify' on pipci001: Event failed
crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
(target: 0 vs. rc: 1): Error
crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0, Fired=0,
Skipped=0, Incomplete=6,
Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
required
pengine[17732]:   notice: On loss of CCM Quorum: Ignore
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Forcing base-clone away from pipci001 after
1000000 failures (max=2)
pengine[17732]:  warning: Forcing base-clone away from pipci001 after
1000000 failures (max=2)
pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Calculated transition 3, saving inputs in
/var/lib/pacemaker/pengine/pe-input-340.bz2
pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
required
pengine[17732]:   notice: On loss of CCM Quorum: Ignore
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
pipci001: unknown error (1)
pengine[17732]:  warning: Forcing base-clone away from pipci001 after
1000000 failures (max=2)
pipci001 pengine[17732]:  warning: Forcing base-clone away from pipci001
after 1000000 failures (max=2)
pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
pengine[17732]:   notice: Calculated transition 4, saving inputs in
/var/lib/pacemaker/pengine/pe-input-341.bz2
crmd[17733]:   notice: Processing graph 4 (ref=pe_calc-dc-1520613263-36)
derived from /var/lib/pacemaker/pengine/pe-input-341.bz2
crmd[17733]:   notice: Initiating stop operation p-fssapmnt_stop_0
locally on pipci001
lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:stop call_id:72
Filesystem(p-fssapmnt)[19189]: INFO: Running stop for /dev/mapper/sapmnt
on /sapmnt
pipci001 lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:stop
call_id:72 pid:19189 exit-code:0 exec-time:83ms queue-time:0ms
pipci001 crmd[17733]:   notice: Result of stop operation for p-fssapmnt
on pipci001: 0 (ok)
crmd[17733]:   notice: Initiating stop operation dlm_stop_0 locally on
pipci001
pipci001 lrmd[17730]:   notice: executing - rsc:dlm action:stop call_id:74
pipci001 dlm_controld[19019]: 4636 shutdown ignored, active lockspaces


resource configuration:

primitive p-fssapmnt Filesystem \
         params device="/dev/mapper/sapmnt" directory="/sapmnt"
fstype=ocfs2 \
         op monitor interval=20 timeout=40 \
         op start timeout=60 interval=0 \
         op stop timeout=60 interval=0
primitive dlm ocf:pacemaker:controld \
         op monitor interval=60 timeout=60 \
         op start interval=0 timeout=90 \
         op stop interval=0 timeout=100
clone base-clone base-group \
         meta interleave=true target-role=Started

cluster properties:
property cib-bootstrap-options: \
         have-watchdog=true \
         stonith-enabled=true \
         stonith-timeout=80 \
         startup-fencing=true \


Software versions:

kernel version: 4.4.114-94.11-default
pacemaker-1.1.16-4.8.x86_64
corosync-2.3.6-9.5.1.x86_64
ocfs2-kmp-default-4.4.114-94.11.3.x86_64
ocfs2-tools-1.8.5-1.35.x86_64
dlm-kmp-default-4.4.114-94.11.3.x86_64
libdlm3-4.0.7-1.28.x86_64
libdlm-4.0.7-1.28.x86_64


--
Regards,
Muhammad Sharfuddin


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] single node fails to start the ocfs2 resource

Reply via email to