Re: [devel] [PATCH 1 of 1] amfd: try again if ImplementerClear times out

Anders Björnerstedt Thu, 28 May 2015 05:45:41 -0700

I mentioned "aware of the timing difference" but forgot to elaborate.


The typical om-client has a retrye lopp for TRY_AGAIN with a short wait time
between each try and a max number of retries.

This works well/predicatably if the timing for each TRY_AGAIN is stable 
and "short".
By "short" i mean realtime-ish like milliseconds.

But if that loop gets repeated ERR_TIMEOUT where each such timeout is 10 
seconds (default
then you could be looking at applications apparently hanging for minutes.

So just treating ERR_TIMEOUT the same as TRY_AGAIN even for idempotent 
operations has to
be done with care. The retry loop timing/number-of-retries needs to be 
sensitive to TIMEOUT
or TRY_AGAIN.

/AndersBj

On 05/28/2015 02:36 PM, Anders Björnerstedt wrote:
> Note that you have to be careful about generalizing the handling of
> ERR_TIMEOUT.
> That error code is the only error code that literally means you dont
> know what happened.
> The operation may have been executed or it may be still on going or it
> may have failed.
> You dont know which.
>
> For all other error codes (except SA_AIS_OK) including of course
> TRY_AGAIN and NO_RESOURCES,
> the user knows that the oepration was not executed.
>
> So handling NO_RESOURCES the same way as TRY_AGAIN is possible.
> But handling ERR_TIMEOUT the same way as TRY_AGAIN is only possible if
> hte operation is idempotent
> and you are aware of the timing difference.
>
> Idempotent operations are operations where you an invoke a request and
> get ack iof the request was already
> fulfilled when it was received. saImmOiImplementerClear is one such
> idempotent operation.
>
> saImmOiImplementerSet, classImplementerSet, objectImplementerSet,
> classImplementerRelease, objectImplementerRelease
> are all idempotent.
>
> But saImmOmCcbObjectCreate is not.
>
> Most (probably all) finalize operations are idempotent.
>
> /AndersBj
>
> On 05/28/2015 01:19 PM, Nagendra Kumar wrote:
>> Hi Hans N,
>>              Thanks for pointing it out.
>> Do you mean, we should add handling of errors like BAD_HANDLE, TIMEOUT in 
>> immutil function and remove handling from Amf code ? If yes, then I can 
>> raise an enhancement ticket on imm util and another on 'Amf/other services 
>> in general' to clean the code. Is that ok?
>>
>> Thanks
>> -Nagu
>>> -----Original Message-----
>>> From: Hans Nordebäck [mailto:[email protected]]
>>> Sent: 28 May 2015 16:32
>>> To: Anders Björnerstedt; Nagendra Kumar; Praveen Malviya
>>> Cc: [email protected]
>>> Subject: RE: [devel] [PATCH 1 of 1] amfd: try again if ImplementerClear 
>>> times
>>> out
>>>
>>> Hi, in ticket #1290 BAD_HANDLE, one comment was that preferably should
>>> immutil be updated to handle BAD_HANDLE. Probably immutil should also
>>> handle other errors, e.g. ERR_TIMEOUT, as for this ticket #1360.
>>>
>>> /Thanks HansN
>>>
>>> -----Original Message-----
>>> From: Anders Björnerstedt
>>> Sent: den 27 maj 2015 09:52
>>> To: Nagendra Kumar; Hans Nordebäck; Praveen Malviya
>>> Cc: [email protected]
>>> Subject: RE: [devel] [PATCH 1 of 1] amfd: try again if ImplementerClear 
>>> times
>>> out
>>>
>>> Hi again,
>>>
>>> I understand that this case is apparently due to message loss IMMND->IMMD.
>>>
>>> Since change-set:
>>>
>>>      changeset:   3523:5228e50660b4
>>>      user:        Anders Bjornerstedt <[email protected]>
>>>      date:        Fri Apr 27 15:02:01 2012 +0200
>>>      summary:     Immsv: IMMND subscribes (mds) in normal mode on IMMD
>>> (#2446)
>>>
>>> IMMND subscribes to IMMD sv-id in NORMAL mode, i.e. not REDUNDANT
>>> mode.
>>> MDS will then supposedly take care of buffering messages that can not be
>>> delivered due to absence of IMMD until the IMMD is back. The IMMD itself
>>> MBCPs fevs mesasages (which ImplementerCLear is) to standby before
>>> sending.
>>>
>>> So this message loss could only happen if the IMMD goes down after having
>>> received the request, but before having MBCP'ed the request.  So this should
>>> be very rare indeed.
>>>
>>> The fix in itself is safe though.
>>>
>>> I worry about the followingthough.
>>> Assume that instead of ImplementerClear, the AMf simply closes the OI
>>> handle and re-opens it.
>>> This should have exactly the same effect.
>>> A close of handle will be registgered and acceptet by the local IMMND and 
>>> the
>>> local IMMND will close the implementer by generating that message internally
>>> to the IMMND. If that message gets lost, then the AMFD would here actually
>>> get Ok, but we would possibly have an "orphaned" OI.
>>>
>>> I have not seen one such case though.
>>>
>>> /AndersBj
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Anders Björnerstedt [mailto:[email protected]]
>>> Sent: den 27 maj 2015 09:23
>>> To: Nagendra Kumar; Hans Nordebäck; Praveen Malviya
>>> Cc: [email protected]
>>> Subject: Re: [devel] [PATCH 1 of 1] amfd: try again if ImplementerClear 
>>> times
>>> out
>>>
>>> Ok, that should be a very rare error scenario, but in this case for this
>>> particular operation (ImplementerClear) which is an idempotent operation,
>>> then repeating the request should work and be safe.
>>>
>>> /AndersBj
>>>
>>> -----Original Message-----
>>> From: Nagendra Kumar [mailto:[email protected]]
>>> Sent: den 27 maj 2015 08:35
>>> To: Hans Nordebäck; Praveen Malviya
>>> Cc: [email protected]
>>> Subject: Re: [devel] [PATCH 1 of 1] amfd: try again if ImplementerClear 
>>> times
>>> out
>>>
>>> Hi Hans N/Anders Bj,
>>>             Thanks for your review.
>>>>>> A log record can be written before the "goto try_again".
>>> I would like to get logs record for BAD_HANDLE and other types of errors 
>>> also.
>>> That's why the log record is on top common for all the errors.
>>>
>>>>>> An alternative to try_again can be to change IMMA_SYNCR_TIMEOUT
>>> from the default 10 seconds to a higher value.
>>> In the scenario of the ticket(specific), Immd has gone down and the packet
>>> has been lost along with Immd and hence increase of wait time will increase
>>> the HA unavailability.) So, my thought was it is ok for default wait time 
>>> for 10
>>> seconds, and try again so that local Immd will respond, hence quicker
>>> recovery.
>>>
>>> Thanks
>>> -Nagu
>>>
>>>> -----Original Message-----
>>>> From: Hans Nordebäck [mailto:[email protected]]
>>>> Sent: 26 May 2015 18:15
>>>> To: Nagendra Kumar; Praveen Malviya
>>>> Cc: [email protected]
>>>> Subject: Re: [PATCH 1 of 1] amfd: try again if ImplementerClear times
>>>> out
>>>>
>>>> ack, code review only. Some comments below discussed with AndersBj.
>>>>
>>>> A log record can be written before the "goto try_again". An
>>>> alternative to try_again can be to change IMMA_SYNCR_TIMEOUT from the
>>>> default 10 seconds to a higher value.
>>>>
>>>> /Thanks HansN
>>>>
>>>> On 05/12/2015 10:44 AM, [email protected] wrote:
>>>>>     osaf/services/saf/amf/amfd/role.cc |  9 ++++++---
>>>>>     1 files changed, 6 insertions(+), 3 deletions(-)
>>>>>
>>>>>
>>>>> ImplementerClear returns SA_AIS_ERR_TIMEOUT because the message
>>> sent
>>>>> from immnd to immd got lost because of node down.
>>>>> So, Amf need to handle it as try_again.
>>>>> After Amf does try_again, Imm returns SA_AIS_ERR_BAD_HANDLE.
>>>>> Amf handles SA_AIS_ERR_BAD_HANDLE and reinitializes its interface
>>>>> with
>>>> Imm.
>>>>> In handling SA_AIS_ERR_BAD_HANDLE, Amf sets the implementer in
>>>> avd_imm_reinit_bg_thread, so
>>>>> when Amf proceeds and calls avd_imm_applier_set, Imm returns
>>>>> SA_AIS_ERR_INVALID_PARAM. Since, it is expected error from Imm, Amf
>>>>> ignores it.
>>>>>
>>>>> diff --git a/osaf/services/saf/amf/amfd/role.cc
>>>> b/osaf/services/saf/amf/amfd/role.cc
>>>>> --- a/osaf/services/saf/amf/amfd/role.cc
>>>>> +++ b/osaf/services/saf/amf/amfd/role.cc
>>>>> @@ -659,15 +659,17 @@ void avd_mds_qsd_role_evh(AVD_CL_CB *cb,
>>>>>                   _exit(EXIT_FAILURE); // should never get here...
>>>>>           }
>>>>>
>>>>> +try_again:
>>>>>           /* Take mutex here to sync with imm reinit thread.*/
>>>>>           osaf_mutex_lock_ordie(&imm_reinit_mutex);
>>>>> -
>>>>>           /* Give up IMM OI implementer role */
>>>>>           if ((rc = immutil_saImmOiImplementerClear(cb->immOiHandle)) !=
>>>> SA_AIS_OK) {
>>>>>                   osaf_mutex_unlock_ordie(&imm_reinit_mutex);
>>>>>                   LOG_ER("FAILOVER Active --> Quiesced FAILED,
>>>> ImplementerClear failed %u", rc);
>>>>>                   if (rc == SA_AIS_ERR_BAD_HANDLE) {
>>>>>                           avd_imm_reinit_bg();
>>>>> +         } else if (rc == SA_AIS_ERR_TIMEOUT) {
>>>>> +                 goto try_again;
>>>>>                   } else
>>>>>                           osafassert(0);
>>>>>           } else
>>>>> @@ -685,10 +687,11 @@ void avd_mds_qsd_role_evh(AVD_CL_CB *cb,
>>>>>                   LOG_ER("avd_imm_applier_set FAILED, %u", rc);
>>>>>                   if (rc == SA_AIS_ERR_BAD_HANDLE) {
>>>>>                           avd_imm_reinit_bg();
>>>>> -         } else if (rc == SA_AIS_ERR_EXIST) {
>>>>> +         } else if ((rc == SA_AIS_ERR_EXIST) || (rc ==
>>>> SA_AIS_ERR_INVALID_PARAM)) {
>>>>>                           /* This may arise if
>>>> immutil_saImmOiImplementerClear
>>>>>                              failed and amf reinitializes imm interface 
>>>>> and
>>>>> -                    set applier in avd_imm_reinit_bg_thread.*/
>>>>> +                    set applier in avd_imm_reinit_bg_thread. Imm may
>>>>> +                    return ERR_EXIST or INVALID_PARAM. */
>>>>>                   } else
>>>>>                           osafassert(0);
>>>>>           } else
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Opensaf-devel mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Opensaf-devel mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Opensaf-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-devel



------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] amfd: try again if ImplementerClear times out

Reply via email to