On 05/26/2017 03:21 AM, 井上 和徳 wrote: > Hi Ken, > > The cause turned out. > > When stonith is executed, stonithd sends results and notifications to crmd. > https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/fencing/remote.c#L402-L406 > > - when "result" is sent (calling do_local_reply()), too many stonith failures > is checked in too_many_st_failures(). > > https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L638-L669 > - when "notification" is sent (calling do_stonith_notify()), the number of > failures is incremented in st_fail_count_increment(). > > https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L704-L726 > From this, since checking is done before incrementing, the number of failures > in "Too many failures (10) to fence" log does not match the number of actual > failures.
Thanks for this analysis! We do want the result to be sent before the notifications, so the solution will be slightly more complicated. The DC will have to call st_fail_count_increment() when receiving the result, while non-DC nodes will continue to call it when receiving the notification. I'll put together a fix before the 1.1.17 release. > > I confirmed that the expected result will be obtained from the following > changes. > > # git diff > diff --git a/fencing/remote.c b/fencing/remote.c > index 4a47d49..3ff324e 100644 > --- a/fencing/remote.c > +++ b/fencing/remote.c > @@ -399,12 +399,12 @@ handle_local_reply_and_notify(remote_fencing_op_t * op, > xmlNode * data, int rc) > reply = stonith_construct_reply(op->request, NULL, data, rc); > crm_xml_add(reply, F_STONITH_DELEGATE, op->delegate); > > - /* Send fencing OP reply to local client that initiated fencing */ > - do_local_reply(reply, op->client_id, op->call_options & > st_opt_sync_call, FALSE); > - > /* bcast to all local clients that the fencing operation happend */ > do_stonith_notify(0, T_STONITH_NOTIFY_FENCE, rc, notify_data); > > + /* Send fencing OP reply to local client that initiated fencing */ > + do_local_reply(reply, op->client_id, op->call_options & > st_opt_sync_call, FALSE); > + > /* mark this op as having notify's already sent */ > op->notify_sent = TRUE; > free_xml(reply); > > Regards, > Kazunori INOUE > >> -----Original Message----- >> From: Ken Gaillot [mailto:kgail...@redhat.com] >> Sent: Wednesday, May 17, 2017 11:09 PM >> To: users@clusterlabs.org >> Subject: Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is >> not accurate >> >> On 05/17/2017 04:56 AM, Klaus Wenninger wrote: >>> On 05/17/2017 11:28 AM, 井上 和徳 wrote: >>>> Hi, >>>> I'm testing Pacemaker-1.1.17-rc1. >>>> The number of failures in "Too many failures (10) to fence" log does not >>>> match the number of actual failures. >>> >>> Well it kind of does as after 10 failures it doesn't try fencing again >>> so that is what >>> failures stay at ;-) >>> Of course it still sees the need to fence but doesn't actually try. >>> >>> Regards, >>> Klaus >> >> This feature can be a little confusing: it doesn't prevent all further >> fence attempts of the target, just *immediate* fence attempts. Whenever >> the next transition is started for some other reason (a configuration or >> state change, cluster-recheck-interval, node failure, etc.), it will try >> to fence again. >> >> Also, it only checks this threshold if it's aborting a transition >> *because* of this fence failure. If it's aborting the transition for >> some other reason, the number can go higher than the threshold. That's >> what I'm guessing happened here. >> >>>> After the 11th time fence failure, "Too many failures (10) to fence" is >>>> output. >>>> Incidentally, stonith-max-attempts has not been set, so it is 10 by >>>> default.. >>>> >>>> [root@x3650f log]# egrep "Requesting fencing|error: Operation >>>> reboot|Stonith failed|Too many failures" >>>> ##Requesting fencing : 1st time >>>> May 12 05:51:47 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) >>>> of node rhel73-2 >>>> May 12 05:52:52 rhel73-1 stonith-ng[5265]: error: Operation reboot of >>>> rhel73-2 by rhel73-1 for >> crmd.5269@rhel73-1.8415167d: No data available >>>> May 12 05:52:52 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >>>> failed >>>> ## 2nd time >>>> May 12 05:52:52 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) >>>> of node rhel73-2 >>>> May 12 05:53:56 rhel73-1 stonith-ng[5265]: error: Operation reboot of >>>> rhel73-2 by rhel73-1 for >> crmd.5269@rhel73-1.53d3592a: No data available >>>> May 12 05:53:56 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >>>> failed >>>> ## 3rd time >>>> May 12 05:53:56 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) >>>> of node rhel73-2 >>>> May 12 05:55:01 rhel73-1 stonith-ng[5265]: error: Operation reboot of >>>> rhel73-2 by rhel73-1 for >> crmd.5269@rhel73-1.9177cb76: No data available >>>> May 12 05:55:01 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >>>> failed >>>> ## 4th time >>>> May 12 05:55:01 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) >>>> of node rhel73-2 >>>> May 12 05:56:05 rhel73-1 stonith-ng[5265]: error: Operation reboot of >>>> rhel73-2 by rhel73-1 for >> crmd.5269@rhel73-1.946531cb: No data available >>>> May 12 05:56:05 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >>>> failed >>>> ## 5th time >>>> May 12 05:56:05 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) >>>> of node rhel73-2 >>>> May 12 05:57:10 rhel73-1 stonith-ng[5265]: error: Operation reboot of >>>> rhel73-2 by rhel73-1 for >> crmd.5269@rhel73-1.278b3c4b: No data available >>>> May 12 05:57:10 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >>>> failed >>>> ## 6th time >>>> May 12 05:57:10 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) >>>> of node rhel73-2 >>>> May 12 05:58:14 rhel73-1 stonith-ng[5265]: error: Operation reboot of >>>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.7a49aebb: >> No data available >>>> May 12 05:58:14 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >>>> failed >>>> ## 7th time >>>> May 12 05:58:14 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) >>>> of node rhel73-2 >>>> May 12 05:59:19 rhel73-1 stonith-ng[5265]: error: Operation reboot of >>>> rhel73-2 by rhel73-1 for >> crmd.5269@rhel73-1.83421862: No data available >>>> May 12 05:59:19 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >>>> failed >>>> ## 8th time >>>> May 12 05:59:19 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) >>>> of node rhel73-2 >>>> May 12 06:00:24 rhel73-1 stonith-ng[5265]: error: Operation reboot of >>>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.afd7ef98: >> No data available >>>> May 12 06:00:24 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >>>> failed >>>> ## 9th time >>>> May 12 06:00:24 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) >>>> of node rhel73-2 >>>> May 12 06:01:28 rhel73-1 stonith-ng[5265]: error: Operation reboot of >>>> rhel73-2 by rhel73-1 for >> crmd.5269@rhel73-1.3b033dbe: No data available >>>> May 12 06:01:28 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >>>> failed >>>> ## 10th time >>>> May 12 06:01:28 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) >>>> of node rhel73-2 >>>> May 12 06:02:33 rhel73-1 stonith-ng[5265]: error: Operation reboot of >>>> rhel73-2 by rhel73-1 for >> crmd.5269@rhel73-1.5447a345: No data available >>>> May 12 06:02:33 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >>>> failed >>>> ## 11th time >>>> May 12 06:02:33 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) >>>> of node rhel73-2 >>>> May 12 06:03:37 rhel73-1 stonith-ng[5265]: error: Operation reboot of >>>> rhel73-2 by rhel73-1 for crmd.5269@rhel73-1.db50c21a: >> No data available >>>> May 12 06:03:37 rhel73-1 crmd[5269]: warning: Too many failures (10) to >>>> fence rhel73-2, giving up >>>> May 12 06:03:37 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith >>>> failed >>>> >>>> Regards, >>>> Kazunori INOUE _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org