Re: [Linux-HA] the stop sequence for group resource

Andrew Beekhof Mon, 02 Jun 2008 01:51:43 -0700

On Mon, Jun 2, 2008 at 8:51 AM, Junko IKEDA <[EMAIL PROTECTED]> wrote:
> Hi,
>
> This is a question about the stop sequence for a group resource.
> We have two nodes and six resources in one group.
>
> # crm_mon -1
> Node: node-b (db8f2da4-a7fb-40bf-bf14-befe4af11db7): online
> Node: node-a (8029f8c4-1f03-4695-a78a-29c02fdd399c): online
>
> Resource Group: group
>    dummy-1     (heartbeat::ocf:Dummy): Started node-a
>    dummy-2     (heartbeat::ocf:Dummy): Started node-a
>    dummy-3     (heartbeat::ocf:Dummy): Started node-a
>    dummy-4     (heartbeat::ocf:Dummy): Started node-a
>    dummy-5     (heartbeat::ocf:Dummy): Started node-a
>    dummy-6     (heartbeat::ocf:Dummy-stop-ng): Started node-a
>    dummy-7     (heartbeat::ocf:Dummy): Started node-a
>
> our test is;
> induce "stop error" for dummy-6 by editing RA.
> # vim /usr/lib/ocf/resource.d/heartbeat/Dummy-stop-ng
> dummy_stop() {
>    return $OCF_ERR_GENERIC
>    dummy_monitor
>    if [ $? =  $OCF_SUCCESS ]; then
>        rm ${OCF_RESKEY_state}
>    fi
>    return $OCF_SUCCESS
> }
>
>
> # crm_standby -U node-a -v on
> # crm_mon -1
> Node: node-b (db8f2da4-a7fb-40bf-bf14-befe4af11db7): online
> Node: node-a (8029f8c4-1f03-4695-a78a-29c02fdd399c): standby
>
> Resource Group: group
>    dummy-1     (heartbeat::ocf:Dummy): Started node-a
>    dummy-2     (heartbeat::ocf:Dummy): Started node-a
>    dummy-3     (heartbeat::ocf:Dummy): Started node-a
>    dummy-4     (heartbeat::ocf:Dummy): Started node-a
>    dummy-5     (heartbeat::ocf:Dummy): Started node-a
>    dummy-6     (heartbeat::ocf:Dummy-stop-ng): Started node-a (unmanaged)
> FAILED
>    dummy-7     (heartbeat::ocf:Dummy): Stopped
>
> Failed actions:
>    dummy-6_stop_0 (node=node-a, call=24, rc=1): Error
>
> dummy-6 failed.
> After that, restore RA.
> # vim /usr/lib/ocf/resource.d/heartbeat/Dummy-stop-ng
> dummy_stop() {
> #   return $OCF_ERR_GENERIC
>    dummy_monitor
>    if [ $? =  $OCF_SUCCESS ]; then
>        rm ${OCF_RESKEY_state}
>    fi
>    return $OCF_SUCCESS
> }
>
> delete fail-count and clean the resource.
> # crm_failcount -r dummy-6 -U node-a -D
> # crm_resource -r dummy-6 -H node-a -C
>
> The resource group failed over from node-a to node-b successfully,
> But some stop sequence is storange.
> See attached hb_report.
> According to pe-input-6.bz2,
> It seems that dummy-5 stops before dummy-6.
> I expected that stop actions goes on, like dummy-7 -> 6 -> 5... so on.
> Is this the special case?
> Or is there some miss operations?


Looks like its your test.

We don't stop dummy-5, 4, 3... until you do crm_resource -C (since you
have on_fail=block for stop failures) which removes our knowledge that
dummy-6 is running.  So we assume its not and go on stopping the rest
of the group.

Then (because of the probe) we find out it _is_ running afterall and
we end up in the situation contained in pe-input-6.bz2

We only guarantee that the probe for rscX completes before we start the rscX.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] the stop sequence for group resource

Reply via email to