Re: [ClusterLabs] All clones are stopped when one of them fails

Reid Wahl Thu, 10 Dec 2020 01:14:18 -0800

On Thu, Dec 10, 2020 at 1:08 AM Reid Wahl <nw...@redhat.com> wrote:
>
> Thanks. I see it's only reproducible with stonith-enabled=false.
> That's the step I was skipping previously, as I always have stonith
> enabled in my clusters.
>
> I'm not sure whether that's expected behavior for some reason when
> stonith is disabled. Maybe someone else (e.g., Ken) can weigh in.


Never mind. This was a mistake on my part: I didn't re-add the stonith
**device** configuration when I re-enabled stonith.

So the behavior is the same regardless of whether stonith is enabled
or not. I attribute it to the OCF_ERR_CONFIGURED error.

Why exactly is this behavior unexpected, from your point of view?

Ref:
  - 
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Administration/#_how_are_ocf_return_codes_interpreted


> I also noticed that the state4.xml file has a return code of 6 for the
> resource's start operation. That's an OCF_ERR_CONFIGURED, which is a
> fatal error. At least for primitive resources, this type of error
> prevents the resource from starting anywhere. So I'm somewhat
> surprised that the clone instances don't stop on all nodes even when
> fencing **is** enabled.
>
>
> Without stonith:
>
> Allocation scores:
> pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node1: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node2: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
>
> Transition Summary:
>  * Stop       vg.bv_sanlock:0     ( node2 )   due to node availability
>  * Stop       vg.bv_sanlock:1     ( node1 )   due to node availability
>
> Executing cluster transition:
>  * Pseudo action:   vg.bv_sanlock-clone_stop_0
>  * Resource action: vg.bv_sanlock   stop on node2
>  * Resource action: vg.bv_sanlock   stop on node1
>  * Pseudo action:   vg.bv_sanlock-clone_stopped_0
>
>
>
> With stonith:
>
> Allocation scores:
> pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node1: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock-clone allocation score on node2: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
> pcmk__clone_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node1: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:0 allocation score on node2: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node1: -INFINITY
> pcmk__native_allocate: vg.bv_sanlock:1 allocation score on node2: -INFINITY
>
> Transition Summary:
>
> Executing cluster transition:
>
> On Wed, Dec 9, 2020 at 10:33 PM Pavel Levshin <l...@581.spb.su> wrote:
> >
> >
> > See the file attached. This one has been produced and tested with
> > pacemaker 1.1.16 (RHEL 7).
> >
> >
> > --
> >
> > Pavel
> >
> >
> > 08.12.2020 10:14, Reid Wahl :
> > > Can you provide the state4.xml file that you're using? I'm unable to
> > > reproduce this issue by the clone instance to fail on one node.
> > >
> > > Might need some logs as well.
> > >
> > > On Mon, Dec 7, 2020 at 10:40 PM Pavel Levshin <l...@581.spb.su> wrote:
> > >> Hello.
> > >>
> > >>
> > >> Despite many years of Pacemaker use, it never stops fooling me...
> > >>
> > >>
> > >> This time, I have faced a trivial problem. In my new setup, the cluster 
> > >> consists of several identical nodes. A clone resource (vg.sanlock) is 
> > >> started on every node, ensuring it has access to SAN storage. Almost all 
> > >> other resources are colocated and ordered after vg.sanlock.
> > >>
> > >>
> > >> This day, I've started a node, and vg.sanlock has failed to start. Then 
> > >> the cluster has desided to stop all the clone instances "due to node 
> > >> availability", taking down all other resources by dependencies. This 
> > >> seemes illogical to me. In the case of a failing clone, I would prefer 
> > >> to see it stopping on one node only. How do I do it properly?
> > >>
> > >>
> > >> I've tried this config with Pacemaker 2.0.3 and 1.1.16, the behaviour 
> > >> stays the same.
> > >>
> > >>
> > >> Reduced test config here:
> > >>
> > >>
> > >> pcs cluster auth test-pcmk0 test-pcmk1 <>/dev/tty
> > >>
> > >> pcs cluster setup --name test-pcmk test-pcmk0 test-pcmk1 --transport 
> > >> udpu \
> > >>
> > >>    --auto_tie_breaker 1
> > >>
> > >> pcs cluster start --all --wait=60
> > >>
> > >> pcs cluster cib tmp-cib.xml
> > >>
> > >> cp tmp-cib.xml tmp-cib.xml.deltasrc
> > >>
> > >> pcs -f tmp-cib.xml property set stonith-enabled=false
> > >>
> > >> pcs -f tmp-cib.xml resource defaults resource-stickiness=100
> > >>
> > >> pcs -f tmp-cib.xml resource create vg.sanlock ocf:pacemaker:Dummy \
> > >>
> > >>    op monitor interval=10 timeout=20 start interval=0s stop interval=0s \
> > >>
> > >>    timeout=20
> > >>
> > >> pcs -f tmp-cib.xml resource clone vg.sanlock interleave=true
> > >>
> > >> pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.xml.deltasrc
> > >>
> > >>
> > >>
> > >> And here goes cluster reaction to the failure:
> > >>
> > >>
> > >> # crm_simulate -x state4.xml -S
> > >>
> > >>
> > >>
> > >> Current cluster status:
> > >>
> > >> Online: [ test-pcmk0 test-pcmk1 ]
> > >>
> > >>
> > >>
> > >> Clone Set: vg.sanlock-clone [vg.sanlock]
> > >>
> > >>       vg.sanlock      (ocf::pacemaker:Dummy): FAILED test-pcmk0
> > >>
> > >>       Started: [ test-pcmk1 ]
> > >>
> > >>
> > >>
> > >> Transition Summary:
> > >>
> > >> * Stop       vg.sanlock:0     ( test-pcmk1 )   due to node availability
> > >>
> > >> * Stop       vg.sanlock:1     ( test-pcmk0 )   due to node availability
> > >>
> > >>
> > >>
> > >> Executing cluster transition:
> > >>
> > >> * Pseudo action:   vg.sanlock-clone_stop_0
> > >>
> > >> * Resource action: vg.sanlock   stop on test-pcmk1
> > >>
> > >> * Resource action: vg.sanlock   stop on test-pcmk0
> > >>
> > >> * Pseudo action:   vg.sanlock-clone_stopped_0
> > >>
> > >> * Pseudo action:   all_stopped
> > >>
> > >>
> > >>
> > >> Revised cluster status:
> > >>
> > >> Online: [ test-pcmk0 test-pcmk1 ]
> > >>
> > >>
> > >>
> > >> Clone Set: vg.sanlock-clone [vg.sanlock]
> > >>
> > >>       Stopped: [ test-pcmk0 test-pcmk1 ]
> > >>
> > >>
> > >> As a sidenote, if I make those clones globally-unique, they seem to 
> > >> behave properly. But nowhere I found a reference to this solution. In 
> > >> general, globally-unique clones are referred to only where resource 
> > >> agents make distinction between clone instances. This is not the case.
> > >>
> > >>
> > >> --
> > >>
> > >> Thanks,
> > >>
> > >> Pavel
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> Manage your subscription:
> > >> https://lists.clusterlabs.org/mailman/listinfo/users
> > >>
> > >> ClusterLabs home: https://www.clusterlabs.org/
> > >
> > >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA



-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] All clones are stopped when one of them fails

Reply via email to