The concrete reason that the sync pauses is because it has reached the 
fevs-max-pending limit.
This means it has sent its allowed quota of messages woithout having received 
replies.
The counter is incremented on sending to IMMD and decremented when a fevs 
message is received from IMMD 
back at this IMMND. 

So the congestion could be at any of three points: 
A) At the IMMND sending out queue MDS or TIPC..
B) At the IMMD which is the bottleneck.
C) At the IMMND receiving in queue MDS or TIPC.

The TIPC link tolerance could be affecting (B). 
But this is a general mechanism and not primarily there for coping with
node departures. The primary reason for the mechanism is to avoid overloading
the IMMD as the central bottleneck, from all IMMNDs.

So the fix is to ensure that we dont give up sooner than TIPC link tolerance.
But retrying with high freqquency at least initially, to avoid slowing down the 
sync.

/AndersBj

-----Original Message-----
From: Hans Feldt 
Sent: den 24 oktober 2014 13:29
To: A V Mahesh; opensaf-devel@lists.sourceforge.net; Anders Björnerstedt
Subject: Re: [devel] [PATCH 1 of 1] IMM: Sync-retry reverted back to 
milliseconds with increase on retries [#1188]


On 10/24/2014 06:45 AM, A V Mahesh wrote:
> Hi Anders Bj,
>
> I changed of 2 second is required to address following issue,  as a 
> part of TIPC multicast following issue was observed to address that i 
> did the
> change:
>
> With TIPC multicast enabled, When A node sync is in-progress , the 
> sync message are being send in very high number/frequency when Large 
> IMM DB, at that movement , any other node in the cluster restarts ( 
> TIPC Link
> down)  , the  send buffer are getting filled up to Max with-in  TIPC 
> link tolerance of 1.5 sec time , so  sednto() at  IMMD is getting 
> blocked and  we are hitting `16` FEVS Replies pending.
> So Just increasing  SA_AIS_ERR_TRY_AGAIN    sleep time of
> saImmOmSearchNext_2() in imm_loader above  TIPC link tolerance of 1.5 
> sec  resolved the  issue.
I don't understand. Didn't just the sync pause (because of IMMD blocked in 
send) for the TIPC link tolerance and continued?
Or did it fail?
/Hans
>
> So to address the above issue i change for 15000 ms to 2 sec which is 
> above TIPC link tolerance of 1.5 sec time.
>
> -AVM
>
> On 10/23/2014 2:42 PM, Anders Bjornerstedt wrote:
>>    osaf/services/saf/immsv/immloadd/imm_loader.cc |  21 ++++++++++++++-------
>>    1 files changed, 14 insertions(+), 7 deletions(-)
>>
>>
>> The sync retry time was changed from 150msec to the huge value of 2 
>> seconds in ticket #851 (OpenSAF 4.5). This large value may have been 
>> apropriate for optimizing some variant of sync (?) but it is not optimal in 
>> general.
>>
>> The retry time is reverted back to milliseconds level, but with the 
>> value increased with each retry up to a max retry time of 0.5 seconds.
>>
>> diff --git a/osaf/services/saf/immsv/immloadd/imm_loader.cc 
>> b/osaf/services/saf/immsv/immloadd/imm_loader.cc
>> --- a/osaf/services/saf/immsv/immloadd/imm_loader.cc
>> +++ b/osaf/services/saf/immsv/immloadd/imm_loader.cc
>> @@ -2516,17 +2516,24 @@ int syncObjectsOfClass(std::string class
>>        while (err == SA_AIS_OK)
>>        {
>>            int retries = 0;
>> +    useconds_t usec = 10000;
>>    
>>            do
>>            {
>>                if(retries) {
>> -                      /* If we receive  TRY_AGAIN while sync  in progress 
>> means
>> -                          IMMD might have been  reached 
>> IMMSV_DEFAULT_FEVS_MAX_PENDING  fevs_replies_pending.
>> -                        In general  fevs_replies_pending will be hit in the 
>> case of  the messages have accumulated in the sender queue
>> -                            (The most possible reason will be receiver 
>> disconnected  but the sender link is in TIPC link tolerance of 1.5 sec)
>> -                            So give enough time to recover as if sync is 
>> not a priority messages and possibility of hitting this case because of 
>> multicast messaging.
>> -                      */
>> -             sleep(2);
>> +                /* TRY_AGAIN while sync is in progress means *this* IMMND 
>> most likely has reached IMMSV_DEFAULT_FEVS_MAX_PENDING.
>> +                   This means that *this* IMMND has sent its quota of fevs 
>> messages to IMMD without having received them back via
>> +                   broadcast from IMMD.
>> +
>> +                   Thus fevs_replies_pending will be hit in the case of the 
>> messages accumulating either in local send queues
>> +                   at the MDS/TIPC/TCP level; OR at the IMMDs receivingg 
>> queue; OR at this IMMNDs MDS/TIPC_/TCP receiving
>> +                   queue (when the fevs message comes back). The most 
>> likely case is the IMMD receive buffers.
>> +
>> +                   IMMD is in general the fevs bottleneck in the system. 
>> This since it is one process serving an open ended number
>> +                   of IMMND clients. The larger the cluster the higher the 
>> risk of hitting fevs_max_pending at IMMNDs.
>> +                */
>> +                usleep(usec);
>> +                if(usec < 500000) { usec = usec*2; } /* Increase wait time 
>> +exponentially up to 0.5 seconds */
>>                }
>>          /* Synchronous for throttling sync */
>>          err = saImmOmSearchNext_2(searchHandle, &objectName, 
>> &attributes);
>>
>> ---------------------------------------------------------------------
>> --------- _______________________________________________
>> Opensaf-devel mailing list
>> Opensaf-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>
> ----------------------------------------------------------------------
> -------- _______________________________________________
> Opensaf-devel mailing list
> Opensaf-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opensaf-devel


------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to