On 02/09/2017 05:46 AM, Jehan-Guillaume de Rorthais wrote: > On Thu, 9 Feb 2017 19:24:22 +0900 (JST) > renayama19661...@ybb.ne.jp wrote: > >> Hi Ken, >> >> >>> 1. Return a "hard" error such as OCF_ERR_ARGS or OCF_ERR_PERM. When >>> Pacemaker gets one of these errors from an agent, it will ban the >>> resource from that node (until the failure is cleared). >> >> The first suggestion does not work well. >> >> Even if this returns OCF_ERR_ARGS and OCF_ERR_PERM, it seems to be to be >> pre_promote(notify) handling of RA. Pacemaker does not record the notify(pre >> promote) error in CIB. >> >> * https://github.com/ClusterLabs/pacemaker/blob/master/crmd/lrm.c#L2411 >> >> Because it is not recorded in CIB, there cannot be the thing that pengine >> works as "hard error".
Ah, I didn't think of that. > Indeed. That's why PAF use private attribute to give informations between > actions. We detect the failure during the notify as well, but raise the error > during the promotion itself. See how I dealt with this in PAF: > > https://github.com/ioguix/PAF/commit/6123025ff7cd9929b56c9af2faaefdf392886e68 That's a nice use of private attributes. > As private attributes does not work on older stacks, you could rely on local > temp file as well in $HA_RSCTMP. > >>> 2. Use crm_resource --ban instead. This would ban the resource from that >>> node until the user removes the ban with crm_resource --clear (or by >>> deleting the ban consraint from the configuration). >> >> The second suggestion works well. >> I intend to adopt the second suggestion. >> >> As other methods, you think crm_resource -F to be available, but what do you >> think? I think that last-failure does not have a problem either to let you >> handle pseudotrouble if it is crm_resource -F. >> >> I think whether crm_resource -F is available, but adopt crm_resource -B >> because RA wants to completely stop pgsql resource. >> >> ``` @pgsql RA >> >> pgsql_pre_promote() { >> (snip) >> if [ "$cmp_location" != "$my_master_baseline" ]; then >> ocf_exit_reason "My data is newer than new master's one. New >> master's location : $master_baseline" exec_with_retry 0 $CRM_RESOURCE -B -r >> $OCF_RESOURCE_INSTANCE -N $NODENAME -Q return $OCF_ERR_GENERIC >> fi >> (snip) >> CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount" >> CRM_RESOURCE="${HA_SBIN_DIR}/crm_resource" >> ``` >> >> I test movement a little more and send a patch. > > I suppose crm_resource -F will just raise the failcount, break the current > transition and the CRM will recompute another transition paying attention to > your "failed" resource (will it try to recover it? retry the previous > transition again?). > > I would bet on crm_resource -B. Correct, crm_resource -F only simulates OCF_ERR_GENERIC, which is a soft error. It might be a nice extension to be able to specify the error code, but in this case, I think crm_resource -B (or the private attribute approach, if you're OK with limiting it to corosync 2 and pacemaker 1.1.13+) is better. >> ----- Original Message ----- >>> From: Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> >>> To: users@clusterlabs.org; kgail...@redhat.com >>> Cc: >>> Date: 2017/2/6, Mon 17:44 >>> Subject: [ClusterLabs] Antw: Re: [Question] About a change of crm_failcount. >>> >>>>>> Ken Gaillot <kgail...@redhat.com> schrieb am 02.02.2017 um >>> 19:33 in Nachricht >>> <91a83571-9930-94fd-e635-962830671...@redhat.com>: >>>> On 02/02/2017 12:23 PM, renayama19661...@ybb.ne.jp wrote: >>>>> Hi All, >>>>> >>>>> By the next correction, the user was not able to set a value except >>> zero in >>>> crm_failcount. >>>>> >>>>> - [Fix: tools: implement crm_failcount command-line options correctly] >>>>> - >>>> >>> https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40 >>> >>>> a994498cafd#diff-6e58482648938fd488a920b9902daac4 >>>>> >>>>> However, pgsql RA sets INFINITY in a script. >>>>> >>>>> ``` >>>>> (snip) >>>>> CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount" >>>>> (snip) >>>>> ocf_exit_reason "My data is newer than new master's one. >>> New master's >>>> location : $master_baseline" >>>>> exec_with_retry 0 $CRM_FAILCOUNT -r $OCF_RESOURCE_INSTANCE -U >>> $NODENAME -v >>>> INFINITY >>>>> return $OCF_ERR_GENERIC >>>>> (snip) >>>>> ``` >>>>> >>>>> There seems to be the influence only in pgsql somehow or other. >>>>> >>>>> Can you revise it to set a value except zero in crm_failcount? >>>>> We make modifications to use crm_attribute in pgsql RA if we cannot >>> revise >>>> it. >>>>> >>>>> Best Regards, >>>>> Hideo Yamauchi. >>>> >>>> Hmm, I didn't realize that was used. I changed it because it's not >>> a >>>> good idea to set fail-count without also changing last-failure and >>>> having a failed op in the LRM history. I'll have to think about what >>> the >>>> best alternative is. >>> >>> The question also is whether the RA can acieve the same effect otherwise. I >>> thought CRM sets the failcount, not the RA... _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org