Re: [ClusterLabs] Antw: Re: single node fails to start the ocfs2 resource

Klaus Wenninger Tue, 13 Mar 2018 06:16:45 -0700

On 03/13/2018 02:03 PM, Muhammad Sharfuddin wrote:
> Hi,
>
> 1 - if I put a node(node2) offline; ocfs2 resources keep running on
> online node(node1)
>
> 2 - while node2 was offline, via cluster I stop/start the ocfs2
> resource group successfully so many times in a row.
>
> 3 - while node2 was offline; I restart the pacemaker service on the
> node1 and then tries to start the ocfs2 resource group, dlm started
> but ocfs2 file system resource does not start.
>
> Nutshell:
>
> a - both nodes must be online to start the ocfs2 resource.
>
> b - if one crashes or offline(gracefully) ocfs2 resource keeps running
> on the other/surviving node.
>
> c - while one node was offline, we can stop/start the ocfs2 resource
> group on the surviving node but if we stops the pacemaker service,
> then ocfs2 file system resource does not start with the following info
> in the logs:


From the logs I would say startup of dlm_controld times out because it
is waiting
for quorum - which doesn't happen because of wait-for-all.
Question is if you really just stopped pacemaker or if you stopped
corosync as well.
In the latter case I would say it is the expected behavior.

Regards,
Klaus
 
>
> lrmd[4317]:   notice: executing - rsc:p-fssapmnt action:start call_id:53
> Filesystem(p-fssapmnt)[5139]: INFO: Running start for
> /dev/mapper/sapmnt on /sapmnt
> kernel: [  706.162676] dlm: Using TCP for communications
> kernel: [  706.162916] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
> the lockspace group...
> dlm_controld[5105]: 759 fence work wait for quorum
> dlm_controld[5105]: 764 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
> lrmd[4317]:  warning: p-fssapmnt_start_0 process (PID 5139) timed out
> lrmd[4317]:  warning: p-fssapmnt_start_0:5139 - timed out after 60000ms
> lrmd[4317]:   notice: finished - rsc:p-fssapmnt action:start
> call_id:53 pid:5139 exit-code:1 exec-time:60002ms queue-time:0ms
> kernel: [  766.056514] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
> event done -512 0
> kernel: [  766.056528] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
> join failed -512 0
> crmd[4320]:   notice: Result of stop operation for p-fssapmnt on
> pipci001: 0 (ok)
> crmd[4320]:   notice: Initiating stop operation dlm_stop_0 locally on
> pipci001
> lrmd[4317]:   notice: executing - rsc:dlm action:stop call_id:56
> dlm_controld[5105]: 766 shutdown ignored, active lockspaces
> lrmd[4317]:  warning: dlm_stop_0 process (PID 5326) timed out
> lrmd[4317]:  warning: dlm_stop_0:5326 - timed out after 100000ms
> lrmd[4317]:   notice: finished - rsc:dlm action:stop call_id:56
> pid:5326 exit-code:1 exec-time:100003ms queue-time:0ms
> crmd[4320]:    error: Result of stop operation for dlm on pipci001:
> Timed Out
> crmd[4320]:  warning: Action 15 (dlm_stop_0) on pipci001 failed
> (target: 0 vs. rc: 1): Error
> crmd[4320]:   notice: Transition aborted by operation dlm_stop_0
> 'modify' on pipci001: Event failed
> crmd[4320]:  warning: Action 15 (dlm_stop_0) on pipci001 failed
> (target: 0 vs. rc: 1): Error
> pengine[4319]:   notice: Watchdog will be used via SBD if fencing is
> required
> pengine[4319]:   notice: On loss of CCM Quorum: Ignore
> pengine[4319]:  warning: Processing failed op stop for dlm:0 on
> pipci001: unknown error (1)
> pengine[4319]:  warning: Processing failed op stop for dlm:0 on
> pipci001: unknown error (1)
> pengine[4319]:  warning: Cluster node pipci001 will be fenced: dlm:0
> failed there
> pengine[4319]:  warning: Processing failed op start for p-fssapmnt:0
> on pipci001: unknown error (1)
> pengine[4319]:   notice: Stop of failed resource dlm:0 is implicit
> after pipci001 is fenced
> pengine[4319]:   notice:  * Fence pipci001
> pengine[4319]:   notice: Stop    sbd-stonith#011(pipci001)
> pengine[4319]:   notice: Stop    dlm:0#011(pipci001)
> crmd[4320]:   notice: Requesting fencing (reboot) of node pipci001
> stonith-ng[4316]:   notice: Client crmd.4320.4c2f757b wants to fence
> (reboot) 'pipci001' with device '(any)'
> stonith-ng[4316]:   notice: Requesting peer fencing (reboot) of pipci001
> stonith-ng[4316]:   notice: sbd-stonith can fence (reboot) pipci001:
> dynamic-list
>
>
> -- 
> Regards,
> Muhammad Sharfuddin | +923332144823 | nds.com.pk
>
> On 3/13/2018 1:04 PM, Ulrich Windl wrote:
>> Hi!
>>
>> I'd recommend this:
>> Cleanly boot your nodes, avoiding any manual operation with cluster
>> resources. Keep the logs.
>> Then start your tests, keeping the logs for each.
>> Try to fix issues by reading the logs and adjusting the cluster
>> configuration, and not by starting commands that the cluster should
>> start.
>>
>> We had an 2-node OCFS2 cluster running for quite some time with
>> SLES11, but now the cluster is three nodes. To me the output of
>> "crm_mon -1Arfj" combined with having set record-pending=true was
>> very valuable finding problems.
>>
>> Regards,
>> Ulrich
>>
>>
>>>>> Muhammad Sharfuddin <m.sharfud...@nds.com.pk> schrieb am
>>>>> 13.03.2018 um 08:43 in
>> Nachricht <7b773ae9-4209-d246-b5c0-2c8b67e62...@nds.com.pk>:
>>> Dear Klaus,
>>>
>>> If I understand you properly then, its a fencing issue, and whatever I
>>> am facing is "natural" or "by-design" in a two node cluster where
>>> quorum
>>> is incomplete.
>>>
>>> I am quite convinced that you have pointed out right because, when I
>>> start the dlm resource via cluster and then tries to start the ocfs2
>>> file system manually from command line, mount command remains hanged
>>> and
>>> following events are reported in the logs:
>>>
>>>       kernel: [62622.864828] ocfs2: Registered cluster interface user
>>>       kernel: [62622.884427] dlm: Using TCP for communications
>>>       kernel: [62622.884750] dlm: BFA9FF042AA045F4822C2A6A06020EE9:
>>> joining the lockspace group...
>>>       dlm_controld[17655]: 62627 fence work wait for quorum
>>>       dlm_controld[17655]: 62680 BFA9FF042AA045F4822C2A6A06020EE9 wait
>>> for quorum
>>>
>>> and then following messages keep reported every 5-10 minutes, till I
>>> kill the mount.ocfs2 process:
>>>
>>>       dlm_controld[17655]: 62627 fence work wait for quorum
>>>       dlm_controld[17655]: 62680 BFA9FF042AA045F4822C2A6A06020EE9 wait
>>> for quorum
>>>
>>> I am also very much confused, because yesterday I did the same and was
>>> able to mount the ocfs2 file system manually from command line(at least
>>> once), and then unmount the file system manually stop the dlm resource
>>> from cluster and then complete ocfs2 resource stack(dlm, file systems)
>>> start/stop successfully via cluster even when only machine was online.
>>>
>>> In a two-node cluster, which have ocfs2 resources, we can't run the
>>> ocfs2 resources when quorum is incomplete(one node is offline) ?
>>>
>>> -- 
>>> Regards,
>>> Muhammad Sharfuddin
>>>
>>> On 3/12/2018 5:58 PM, Klaus Wenninger wrote:
>>>> On 03/12/2018 01:44 PM, Muhammad Sharfuddin wrote:
>>>>> Hi Klaus,
>>>>>
>>>>> primitive sbd-stonith stonith:external/sbd \
>>>>>           op monitor interval=3000 timeout=20 \
>>>>>           op start interval=0 timeout=240 \
>>>>>           op stop interval=0 timeout=100 \
>>>>>           params sbd_device="/dev/mapper/sbd" \
>>>>>           meta target-role=Started
>>>> Makes more sense now.
>>>> Using pcmk_delay_max would probably be useful here
>>>> to prevent a fence-race.
>>>> That stonith-resource was not in your resource-list below ...
>>>>
>>>>> property cib-bootstrap-options: \
>>>>>           have-watchdog=true \
>>>>>           stonith-enabled=true \
>>>>>           no-quorum-policy=ignore \
>>>>>           stonith-timeout=90 \
>>>>>           startup-fencing=true
>>>> You've set no-quorum-policy=ignore for pacemaker.
>>>> Whether this is a good idea or not in your setup is
>>>> written on another page.
>>>> But isn't dlm directly interfering with corosync so
>>>> that it would get the quorum state from there?
>>>> As you have 2-node set probably on a 2-node-cluster
>>>> this would - after both nodes down - wait for all
>>>> nodes up first.
>>>>
>>>> Regards,
>>>> Klaus
>>>>
>>>>> # ps -eaf |grep sbd
>>>>> root      6129     1  0 17:35 ?        00:00:00 sbd: inquisitor
>>>>> root      6133  6129  0 17:35 ?        00:00:00 sbd: watcher:
>>>>> /dev/mapper/sbd - slot: 1 - uuid:
>>>>> 6e80a337-95db-4608-bd62-d59517f39103
>>>>> root      6134  6129  0 17:35 ?        00:00:00 sbd: watcher:
>>>>> Pacemaker
>>>>> root      6135  6129  0 17:35 ?        00:00:00 sbd: watcher: Cluster
>>>>>
>>>>> This cluster does not start ocfs2 resources when I first
>>>>> intentionally
>>>>> crashed(reboot) both the nodes, then try to start ocfs2 resource
>>>>> while
>>>>> one node is  offline.
>>>>>
>>>>> To fix the issue, I have one permanent solution, bring the other
>>>>> node(offline) online and things get fixed automatically, i.e ocfs2
>>>>> resources mounts.
>>>>>
>>>>> -- 
>>>>> Regards,
>>>>> Muhammad Sharfuddin
>>>>>
>>>>> On 3/12/2018 5:25 PM, Klaus Wenninger wrote:
>>>>>> Hi Muhammad!
>>>>>>
>>>>>> Could you be a little bit more elaborate on your fencing-setup!
>>>>>> I read about you using SBD but I don't see any sbd-fencing-resource.
>>>>>> For the case you wanted to use watchdog-fencing with SBD this
>>>>>> would require stonith-watchdog-timeout property to be set.
>>>>>> But watchdog-fencing relies on quorum (without 2-node trickery)
>>>>>> and thus wouldn't work on a 2-node-cluster anyway.
>>>>>>
>>>>>> Didn't read through the whole thread - so I might be missing
>>>>>> something ...
>>>>>>
>>>>>> Regards,
>>>>>> Klaus
>>>>>>
>>>>>> On 03/12/2018 12:51 PM, Muhammad Sharfuddin wrote:
>>>>>>> Hello Gang,
>>>>>>>
>>>>>>> as informed, previously cluster was fixed to start the ocfs2
>>>>>>> resources by
>>>>>>>
>>>>>>> a) crm resource start dlm
>>>>>>>
>>>>>>> b) mount/umount the ocfs2 file system manually. (this step was the
>>>>>>> fix)
>>>>>>>
>>>>>>> and then starting the clone group(which include dlm, ocfs2 file
>>>>>>> systems) worked fine:
>>>>>>>
>>>>>>> c) crm resource start base-clone.
>>>>>>>
>>>>>>> Now I crash the nodes intentionally and then keep only one node
>>>>>>> online, again cluster stopped starting the ocfs2 resources. I again
>>>>>>> tried to follow your instructions i.e
>>>>>>>
>>>>>>> i) crm resource start dlm
>>>>>>>
>>>>>>> then try to mount the ocfs2 file system manually which got
>>>>>>> hanged this
>>>>>>> time(previously manually mounting helped me):
>>>>>>>
>>>>>>> # cat /proc/3966/stack
>>>>>>> [<ffffffffa039f18e>] do_uevent+0x7e/0x200 [dlm]
>>>>>>> [<ffffffffa039fe0a>] new_lockspace+0x80a/0xa70 [dlm]
>>>>>>> [<ffffffffa03a02d9>] dlm_new_lockspace+0x69/0x160 [dlm]
>>>>>>> [<ffffffffa038e758>] user_cluster_connect+0xc8/0x350
>>>>>>> [ocfs2_stack_user]
>>>>>>> [<ffffffffa03c2872>] ocfs2_cluster_connect+0x192/0x240
>>>>>>> [ocfs2_stackglue]
>>>>>>> [<ffffffffa045eefc>] ocfs2_dlm_init+0x31c/0x570 [ocfs2]
>>>>>>> [<ffffffffa04a9983>] ocfs2_fill_super+0xb33/0x1200 [ocfs2]
>>>>>>> [<ffffffff8120e130>] mount_bdev+0x1a0/0x1e0
>>>>>>> [<ffffffff8120ea1a>] mount_fs+0x3a/0x170
>>>>>>> [<ffffffff81228bf2>] vfs_kern_mount+0x62/0x110
>>>>>>> [<ffffffff8122b123>] do_mount+0x213/0xcd0
>>>>>>> [<ffffffff8122bed5>] SyS_mount+0x85/0xd0
>>>>>>> [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6
>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>>>
>>>>>>> I killed the mount.ocfs2 process stop(crm resource stop dlm) the
>>>>>>> dlm
>>>>>>> process, and then try to start(crm resource start dlm) the
>>>>>>> dlm(which
>>>>>>> previously always get started successfully), this time dlm didn't
>>>>>>> start and I checked the dlm_controld process
>>>>>>>
>>>>>>> cat /proc/3754/stack
>>>>>>> [<ffffffff8121dc55>] poll_schedule_timeout+0x45/0x60
>>>>>>> [<ffffffff8121f0bc>] do_sys_poll+0x38c/0x4f0
>>>>>>> [<ffffffff8121f2dd>] SyS_poll+0x5d/0xe0
>>>>>>> [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6
>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>>>
>>>>>>> Nutshell:
>>>>>>>
>>>>>>> 1 - this cluster is configured to run when single node is online
>>>>>>>
>>>>>>> 2 - this cluster does not start the ocfs2 resources after a
>>>>>>> crash when
>>>>>>> only one node is online.
>>>>>>>
>>>>>>> -- 
>>>>>>> Regards,
>>>>>>> Muhammad Sharfuddin | +923332144823 | nds.com.pk
>>>>>>>
>>>>>>> On 3/12/2018 12:41 PM, Gang He wrote:
>>>>>>>>> Hello Gang,
>>>>>>>>>
>>>>>>>>> to follow your instructions, I started the dlm resource via:
>>>>>>>>>
>>>>>>>>>          crm resource start dlm
>>>>>>>>>
>>>>>>>>> then mount/unmount the ocfs2 file system manually..(which
>>>>>>>>> seems to be
>>>>>>>>> the fix of the situation).
>>>>>>>>>
>>>>>>>>> Now resources are getting started properly on a single node..
>>>>>>>>> I am
>>>>>>>>> happy
>>>>>>>>> as the issue is fixed, but at the same time I am lost because
>>>>>>>>> I have
>>>>>>>>> no idea
>>>>>>>>>
>>>>>>>>> how things get fixed here(merely by mounting/unmounting the ocfs2
>>>>>>>>> file
>>>>>>>>> systems)
>>>>>>>> >From your description.
>>>>>>>> I just wonder  the DLM resource does not work normally under that
>>>>>>>> situation.
>>>>>>>> Yan/Bin, do you have any comments about two-node cluster? which
>>>>>>>> configuration settings will affect corosync quorum/DLM ?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Gang
>>>>>>>>
>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Regards,
>>>>>>>>> Muhammad Sharfuddin
>>>>>>>>>
>>>>>>>>> On 3/12/2018 10:59 AM, Gang He wrote:
>>>>>>>>>> Hello Muhammad,
>>>>>>>>>>
>>>>>>>>>> Usually, ocfs2 resource startup failure is caused by mount
>>>>>>>>>> command
>>>>>>>>>> timeout
>>>>>>>>> (or hanged).
>>>>>>>>>> The sample debugging method is,
>>>>>>>>>> remove ocfs2 resource from crm first,
>>>>>>>>>> then mount this file system manually, see if the mount command
>>>>>>>>>> will be
>>>>>>>>> timeout or hanged.
>>>>>>>>>> If this command is hanged, please watch where is mount.ocfs2
>>>>>>>>>> process hanged
>>>>>>>>> via "cat /proc/xxx/stack" command.
>>>>>>>>>> If the back trace is stopped at DLM kernel module, usually
>>>>>>>>>> the root
>>>>>>>>>> cause is
>>>>>>>>> cluster configuration problem.
>>>>>>>>>> Thanks
>>>>>>>>>> Gang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 3/12/2018 7:32 AM, Gang He wrote:
>>>>>>>>>>>> Hello Muhammad,
>>>>>>>>>>>>
>>>>>>>>>>>> I think this problem is not in ocfs2, the cause looks like the
>>>>>>>>>>>> cluster
>>>>>>>>>>> quorum is missed.
>>>>>>>>>>>> For two-node cluster (does not three-node cluster), if one
>>>>>>>>>>>> node
>>>>>>>>>>>> is offline,
>>>>>>>>>>> the quorum will be missed by default.
>>>>>>>>>>>> So, you should configure two-node related quorum setting
>>>>>>>>>>>> according to the
>>>>>>>>>>> pacemaker manual.
>>>>>>>>>>>> Then, DLM can work normal, and ocfs2 resource can start up.
>>>>>>>>>>> Yes its configured accordingly, no-quorum is set to "ignore".
>>>>>>>>>>>
>>>>>>>>>>> property cib-bootstrap-options: \
>>>>>>>>>>>                have-watchdog=true \
>>>>>>>>>>>                stonith-enabled=true \
>>>>>>>>>>>                stonith-timeout=80 \
>>>>>>>>>>>                startup-fencing=true \
>>>>>>>>>>>                no-quorum-policy=ignore
>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Gang
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This two node cluster starts resources when both nodes are
>>>>>>>>>>>>> online but
>>>>>>>>>>>>> does not start the ocfs2 resources
>>>>>>>>>>>>>
>>>>>>>>>>>>> when one node is offline. e.g if I gracefully stop the
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>> resources
>>>>>>>>>>>>> then stop the pacemaker service on
>>>>>>>>>>>>>
>>>>>>>>>>>>> either node, and try to start the ocfs2 resource on the
>>>>>>>>>>>>> online
>>>>>>>>>>>>> node, it
>>>>>>>>>>>>> fails.
>>>>>>>>>>>>>
>>>>>>>>>>>>> logs:
>>>>>>>>>>>>>
>>>>>>>>>>>>> pipci001 pengine[17732]:   notice: Start  
>>>>>>>>>>>>> dlm:0#011(pipci001)
>>>>>>>>>>>>> pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
>>>>>>>>>>>>> pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
>>>>>>>>>>>>> pipci001 pengine[17732]:   notice: Calculated transition 2,
>>>>>>>>>>>>> saving
>>>>>>>>>>>>> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
>>>>>>>>>>>>> pipci001 crmd[17733]:   notice: Processing graph 2
>>>>>>>>>>>>> (ref=pe_calc-dc-1520613202-31) derived from
>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-339.bz2
>>>>>>>>>>>>> crmd[17733]:   notice: Initiating start operation dlm_start_0
>>>>>>>>>>>>> locally on
>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>> lrmd[17730]:   notice: executing - rsc:dlm action:start
>>>>>>>>>>>>> call_id:69
>>>>>>>>>>>>> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
>>>>>>>>>>>>> lrmd[17730]:   notice: finished - rsc:dlm action:start
>>>>>>>>>>>>> call_id:69
>>>>>>>>>>>>> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
>>>>>>>>>>>>> crmd[17733]:   notice: Result of start operation for dlm on
>>>>>>>>>>>>> pipci001: 0 (ok)
>>>>>>>>>>>>> crmd[17733]:   notice: Initiating monitor operation
>>>>>>>>>>>>> dlm_monitor_60000
>>>>>>>>>>>>> locally on pipci001
>>>>>>>>>>>>> crmd[17733]:   notice: Initiating start operation
>>>>>>>>>>>>> p-fssapmnt_start_0
>>>>>>>>>>>>> locally on pipci001
>>>>>>>>>>>>> lrmd[17730]:   notice: executing - rsc:p-fssapmnt
>>>>>>>>>>>>> action:start
>>>>>>>>>>>>> call_id:71
>>>>>>>>>>>>> Filesystem(p-fssapmnt)[19052]: INFO: Running start for
>>>>>>>>>>>>> /dev/mapper/sapmnt on /sapmnt
>>>>>>>>>>>>> kernel: [ 4576.529938] dlm: Using TCP for communications
>>>>>>>>>>>>> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9:
>>>>>>>>>>>>> joining
>>>>>>>>>>>>> the lockspace group.
>>>>>>>>>>>>> dlm_controld[19019]: 4629 fence work wait for quorum
>>>>>>>>>>>>> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9
>>>>>>>>>>>>> wait
>>>>>>>>>>>>> for quorum
>>>>>>>>>>>>> lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052)
>>>>>>>>>>>>> timed out
>>>>>>>>>>>>> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9:
>>>>>>>>>>>>> group
>>>>>>>>>>>>> event done -512 0
>>>>>>>>>>>>> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9:
>>>>>>>>>>>>> group join
>>>>>>>>>>>>> failed -512 0
>>>>>>>>>>>>> lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out
>>>>>>>>>>>>> after 60000ms
>>>>>>>>>>>>> lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start
>>>>>>>>>>>>> call_id:71
>>>>>>>>>>>>> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
>>>>>>>>>>>>> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on
>>>>>>>>>>>>> (node 0)
>>>>>>>>>>>>> crmd[17733]:    error: Result of start operation for
>>>>>>>>>>>>> p-fssapmnt on
>>>>>>>>>>>>> pipci001: Timed Out
>>>>>>>>>>>>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on
>>>>>>>>>>>>> pipci001 failed
>>>>>>>>>>>>> (target: 0 vs. rc: 1): Error
>>>>>>>>>>>>> crmd[17733]:   notice: Transition aborted by operation
>>>>>>>>>>>>> p-fssapmnt_start_0 'modify' on pipci001: Event failed
>>>>>>>>>>>>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on
>>>>>>>>>>>>> pipci001 failed
>>>>>>>>>>>>> (target: 0 vs. rc: 1): Error
>>>>>>>>>>>>> crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0,
>>>>>>>>>>>>> Fired=0,
>>>>>>>>>>>>> Skipped=0, Incomplete=6,
>>>>>>>>>>>>> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
>>>>>>>>>>>>> pengine[17732]:   notice: Watchdog will be used via SBD if
>>>>>>>>>>>>> fencing is
>>>>>>>>>>>>> required
>>>>>>>>>>>>> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>> pengine[17732]:  warning: Forcing base-clone away from
>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>> after
>>>>>>>>>>>>> 1000000 failures (max=2)
>>>>>>>>>>>>> pengine[17732]:  warning: Forcing base-clone away from
>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>> after
>>>>>>>>>>>>> 1000000 failures (max=2)
>>>>>>>>>>>>> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
>>>>>>>>>>>>> pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
>>>>>>>>>>>>> pengine[17732]:   notice: Calculated transition 3, saving
>>>>>>>>>>>>> inputs in
>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-340.bz2
>>>>>>>>>>>>> pengine[17732]:   notice: Watchdog will be used via SBD if
>>>>>>>>>>>>> fencing is
>>>>>>>>>>>>> required
>>>>>>>>>>>>> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>> pengine[17732]:  warning: Forcing base-clone away from
>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>> after
>>>>>>>>>>>>> 1000000 failures (max=2)
>>>>>>>>>>>>> pipci001 pengine[17732]:  warning: Forcing base-clone away
>>>>>>>>>>>>> from
>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>> after 1000000 failures (max=2)
>>>>>>>>>>>>> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
>>>>>>>>>>>>> pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
>>>>>>>>>>>>> pengine[17732]:   notice: Calculated transition 4, saving
>>>>>>>>>>>>> inputs in
>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-341.bz2
>>>>>>>>>>>>> crmd[17733]:   notice: Processing graph 4
>>>>>>>>>>>>> (ref=pe_calc-dc-1520613263-36)
>>>>>>>>>>>>> derived from /var/lib/pacemaker/pengine/pe-input-341.bz2
>>>>>>>>>>>>> crmd[17733]:   notice: Initiating stop operation
>>>>>>>>>>>>> p-fssapmnt_stop_0
>>>>>>>>>>>>> locally on pipci001
>>>>>>>>>>>>> lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:stop
>>>>>>>>>>>>> call_id:72
>>>>>>>>>>>>> Filesystem(p-fssapmnt)[19189]: INFO: Running stop for
>>>>>>>>>>>>> /dev/mapper/sapmnt
>>>>>>>>>>>>> on /sapmnt
>>>>>>>>>>>>> pipci001 lrmd[17730]:   notice: finished - rsc:p-fssapmnt
>>>>>>>>>>>>> action:stop
>>>>>>>>>>>>> call_id:72 pid:19189 exit-code:0 exec-time:83ms
>>>>>>>>>>>>> queue-time:0ms
>>>>>>>>>>>>> pipci001 crmd[17733]:   notice: Result of stop operation for
>>>>>>>>>>>>> p-fssapmnt
>>>>>>>>>>>>> on pipci001: 0 (ok)
>>>>>>>>>>>>> crmd[17733]:   notice: Initiating stop operation dlm_stop_0
>>>>>>>>>>>>> locally on
>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>> pipci001 lrmd[17730]:   notice: executing - rsc:dlm
>>>>>>>>>>>>> action:stop
>>>>>>>>>>>>> call_id:74
>>>>>>>>>>>>> pipci001 dlm_controld[19019]: 4636 shutdown ignored, active
>>>>>>>>>>>>> lockspaces
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> resource configuration:
>>>>>>>>>>>>>
>>>>>>>>>>>>> primitive p-fssapmnt Filesystem \
>>>>>>>>>>>>>                params device="/dev/mapper/sapmnt"
>>>>>>>>>>>>> directory="/sapmnt"
>>>>>>>>>>>>> fstype=ocfs2 \
>>>>>>>>>>>>>                op monitor interval=20 timeout=40 \
>>>>>>>>>>>>>                op start timeout=60 interval=0 \
>>>>>>>>>>>>>                op stop timeout=60 interval=0
>>>>>>>>>>>>> primitive dlm ocf:pacemaker:controld \
>>>>>>>>>>>>>                op monitor interval=60 timeout=60 \
>>>>>>>>>>>>>                op start interval=0 timeout=90 \
>>>>>>>>>>>>>                op stop interval=0 timeout=100
>>>>>>>>>>>>> clone base-clone base-group \
>>>>>>>>>>>>>                meta interleave=true target-role=Started
>>>>>>>>>>>>>
>>>>>>>>>>>>> cluster properties:
>>>>>>>>>>>>> property cib-bootstrap-options: \
>>>>>>>>>>>>>                have-watchdog=true \
>>>>>>>>>>>>>                stonith-enabled=true \
>>>>>>>>>>>>>                stonith-timeout=80 \
>>>>>>>>>>>>>                startup-fencing=true \
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Software versions:
>>>>>>>>>>>>>
>>>>>>>>>>>>> kernel version: 4.4.114-94.11-default
>>>>>>>>>>>>> pacemaker-1.1.16-4.8.x86_64
>>>>>>>>>>>>> corosync-2.3.6-9.5.1.x86_64
>>>>>>>>>>>>> ocfs2-kmp-default-4.4.114-94.11.3.x86_64
>>>>>>>>>>>>> ocfs2-tools-1.8.5-1.35.x86_64
>>>>>>>>>>>>> dlm-kmp-default-4.4.114-94.11.3.x86_64
>>>>>>>>>>>>> libdlm3-4.0.7-1.28.x86_64
>>>>>>>>>>>>> libdlm-4.0.7-1.28.x86_64
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Muhammad Sharfuddin
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>> This email has been checked for viruses by Avast antivirus
>>>>>>>>>>>>> software.
>>>>>>>>>>>>> https://www.avast.com/antivirus
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Users mailing list: Users@clusterlabs.org
>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>>
>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>> Getting started:
>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Users mailing list: Users@clusterlabs.org
>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>
>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>> Getting started:
>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Muhammad Sharfuddin
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Users mailing list: Users@clusterlabs.org
>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>
>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>> Getting started:
>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Users mailing list: Users@clusterlabs.org
>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>
>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>> Getting started:
>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list: Users@clusterlabs.org
>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>> Getting started:
>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list: Users@clusterlabs.org
>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started:
>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Users mailing list: Users@clusterlabs.org
>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started:
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>> ---
>>>>> This email has been checked for viruses by Avast antivirus software.
>>>>> https://www.avast.com/antivirus
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users@clusterlabs.org
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>> _______________________________________________
>>> Users mailing list: Users@clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> _______________________________________________
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: single node fails to start the ocfs2 resource

Reply via email to