Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

Ken Gaillot Wed, 31 May 2017 16:43:46 -0700

On 05/26/2017 03:21 AM, 井上 和徳 wrote:
> Hi Ken,
> 
> The cause turned out.
> 
> When stonith is executed, stonithd sends results and notifications to crmd.
> https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/fencing/remote.c#L402-L406
> 
> - when "result" is sent (calling do_local_reply()), too many stonith failures 
> is checked in too_many_st_failures().
>   
> https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L638-L669
> - when "notification" is sent (calling do_stonith_notify()), the number of 
> failures is incremented in st_fail_count_increment().
>   
> https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L704-L726
> From this, since checking is done before incrementing, the number of failures 
> in "Too many failures (10) to fence" log does not match the number of actual 
> failures.


Thanks for this analysis!

We do want the result to be sent before the notifications, so the
solution will be slightly more complicated. The DC will have to call
st_fail_count_increment() when receiving the result, while non-DC nodes
will continue to call it when receiving the notification.

I'll put together a fix before the 1.1.17 release.

> 
> I confirmed that the expected result will be obtained from the following 
> changes.
> 
> # git diff
> diff --git a/fencing/remote.c b/fencing/remote.c
> index 4a47d49..3ff324e 100644
> --- a/fencing/remote.c
> +++ b/fencing/remote.c
> @@ -399,12 +399,12 @@ handle_local_reply_and_notify(remote_fencing_op_t * op, 
> xmlNode * data, int rc)
>      reply = stonith_construct_reply(op->request, NULL, data, rc);
>      crm_xml_add(reply, F_STONITH_DELEGATE, op->delegate);
> 
> -    /* Send fencing OP reply to local client that initiated fencing */
> -    do_local_reply(reply, op->client_id, op->call_options & 
> st_opt_sync_call, FALSE);
> -
>      /* bcast to all local clients that the fencing operation happend */
>      do_stonith_notify(0, T_STONITH_NOTIFY_FENCE, rc, notify_data);
> 
> +    /* Send fencing OP reply to local client that initiated fencing */
> +    do_local_reply(reply, op->client_id, op->call_options & 
> st_opt_sync_call, FALSE);
> +
>      /* mark this op as having notify's already sent */
>      op->notify_sent = TRUE;
>      free_xml(reply);
> 
> Regards,
> Kazunori INOUE
> 
>> -----Original Message-----
>> From: Ken Gaillot [mailto:kgail...@redhat.com]
>> Sent: Wednesday, May 17, 2017 11:09 PM
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is 
>> not accurate
>>
>> On 05/17/2017 04:56 AM, Klaus Wenninger wrote:
>>> On 05/17/2017 11:28 AM, 井上 和徳 wrote:
>>>> Hi,
>>>> I'm testing Pacemaker-1.1.17-rc1.
>>>> The number of failures in "Too many failures (10) to fence" log does not 
>>>> match the number of actual failures.
>>>
>>> Well it kind of does as after 10 failures it doesn't try fencing again
>>> so that is what
>>> failures stay at ;-)
>>> Of course it still sees the need to fence but doesn't actually try.
>>>
>>> Regards,
>>> Klaus
>>
>> This feature can be a little confusing: it doesn't prevent all further
>> fence attempts of the target, just *immediate* fence attempts. Whenever
>> the next transition is started for some other reason (a configuration or
>> state change, cluster-recheck-interval, node failure, etc.), it will try
>> to fence again.
>>
>> Also, it only checks this threshold if it's aborting a transition
>> *because* of this fence failure. If it's aborting the transition for
>> some other reason, the number can go higher than the threshold. That's
>> what I'm guessing happened here.
>>
>>>> After the 11th time fence failure, "Too many failures (10) to fence" is 
>>>> output.
>>>> Incidentally, stonith-max-attempts has not been set, so it is 10 by 
>>>> default..
>>>>
>>>> [root@x3650f log]# egrep "Requesting fencing|error: Operation 
>>>> reboot|Stonith failed|Too many failures"
>>>> ##Requesting fencing : 1st time
>>>> May 12 05:51:47 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
>>>> of node rhel73-2
>>>> May 12 05:52:52 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>>>> rhel73-2 by rhel73-1 for
>> crmd.5269@rhel73-1.8415167d: No data available
>>>> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>>>> failed
>>>> ## 2nd time
>>>> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
>>>> of node rhel73-2
>>>> May 12 05:53:56 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>>>> rhel73-2 by rhel73-1 for
>> crmd.5269@rhel73-1.53d3592a: No data available
>>>> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>>>> failed
>>>> ## 3rd time
>>>> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
>>>> of node rhel73-2
>>>> May 12 05:55:01 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>>>> rhel73-2 by rhel73-1 for
>> crmd.5269@rhel73-1.9177cb76: No data available
>>>> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>>>> failed
>>>> ## 4th time
>>>> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
>>>> of node rhel73-2
>>>> May 12 05:56:05 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>>>> rhel73-2 by rhel73-1 for
>> crmd.5269@rhel73-1.946531cb: No data available
>>>> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>>>> failed
>>>> ## 5th time
>>>> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
>>>> of node rhel73-2
>>>> May 12 05:57:10 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>>>> rhel73-2 by rhel73-1 for
>> crmd.5269@rhel73-1.278b3c4b: No data available
>>>> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>>>> failed
>>>> ## 6th time
>>>> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
>>>> of node rhel73-2
>>>> May 12 05:58:14 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>>>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb:
>> No data available
>>>> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>>>> failed
>>>> ## 7th time
>>>> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
>>>> of node rhel73-2
>>>> May 12 05:59:19 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>>>> rhel73-2 by rhel73-1 for
>> crmd.5269@rhel73-1.83421862: No data available
>>>> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>>>> failed
>>>> ## 8th time
>>>> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
>>>> of node rhel73-2
>>>> May 12 06:00:24 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>>>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.afd7ef98:
>> No data available
>>>> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>>>> failed
>>>> ## 9th time
>>>> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
>>>> of node rhel73-2
>>>> May 12 06:01:28 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>>>> rhel73-2 by rhel73-1 for
>> crmd.5269@rhel73-1.3b033dbe: No data available
>>>> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>>>> failed
>>>> ## 10th time
>>>> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
>>>> of node rhel73-2
>>>> May 12 06:02:33 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>>>> rhel73-2 by rhel73-1 for
>> crmd.5269@rhel73-1.5447a345: No data available
>>>> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>>>> failed
>>>> ## 11th time
>>>> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) 
>>>> of node rhel73-2
>>>> May 12 06:03:37 rhel73-1 stonith-ng[5265]:   error: Operation reboot of 
>>>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.db50c21a:
>> No data available
>>>> May 12 06:03:37 rhel73-1 crmd[5269]: warning: Too many failures (10) to 
>>>> fence rhel73-2, giving up
>>>> May 12 06:03:37 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith 
>>>> failed
>>>>
>>>> Regards,
>>>> Kazunori INOUE

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

Reply via email to