Hi,

This problem was fixed after 5.0 GA in below mentioned changesets.
If AMFD sees that second controller is about to join then it returns 
TRY_AGAIN other wise it will return BAD_OPERATION. Also there is 
improvement in logging.

changeset:   8500:bbde06f8e302
parent:      8497:110fe121d8e2
user:        Nagendra Kumar<[email protected]>
date:        Fri Jan 06 15:58:23 2017 +0530
summary:     amfd: return BAD OP in fault cases during si swap [#1294]

changeset:   8499:c3fd1f88bca1
branch:      opensaf-5.1.x
parent:      8493:4008004e93cd
user:        Nagendra Kumar<[email protected]>
date:        Fri Jan 06 15:57:52 2017 +0530
summary:     amfd: return BAD OP in fault cases during si swap [#1294]

changeset:   8498:9f8c22df842e
branch:      opensaf-5.0.x
parent:      8492:fd59da278b9c
user:        Nagendra Kumar<[email protected]>
date:        Fri Jan 06 15:55:35 2017 +0530
summary:     amfd: return BAD OP in fault cases during si swap [#1294]




Thanks
Praveen

On 20-Mar-17 7:13 PM, David Hoyt wrote:
> If I stop the opensaf controller on SC-2 and then issue the si-swap command, 
> it takes 60 seconds before the request times-out:
>
>
> [root@sc-1 ~]# date ; amf-adm si-swap safSi=SC-2N,safApp=OpenSAF; date
> Mon Mar 20 09:37:51 EDT 2017
> error - command timed out (alarm)
> Mon Mar 20 09:38:51 EDT 2017
> [root@sc-1 ~]# cd
> [root@sc-1 ~]#
>
>
> Corresponding logs:
>
> Mar 20 09:37:51 sc-1 osafamfd[14017]: ER safSi=SC-2N,safApp=OpenSAF SWAP 
> failed - only one assignment
> Mar 20 09:37:52 sc-1 osafamfd[14017]: ER safSi=SC-2N,safApp=OpenSAF SWAP 
> failed - only one assignment
> ...
> Mar 20 09:38:48 sc-1 osafamfd[14017]: ER safSi=SC-2N,safApp=OpenSAF SWAP 
> failed - only one assignment
> Mar 20 09:38:49 sc-1 osafamfd[14017]: ER safSi=SC-2N,safApp=OpenSAF SWAP 
> failed - only one assignment
> Mar 20 09:38:50 sc-1 osafamfd[14017]: ER safSi=SC-2N,safApp=OpenSAF SWAP 
> failed - only one assignment
>
>
> -David
>
> From: David Hoyt
> Sent: Monday, March 20, 2017 9:37 AM
> To: Alex Jones <[email protected]>; Neelakanta Reddy 
> <[email protected]>; [email protected]
> Subject: RE: [users] si-swap opensaf SUs results in error but the action 
> still completes
>
>
> We're using 4.6.0.
>
>
>
> David
>
>
>
> -----Original Message-----
> From: Alex Jones
> Sent: Monday, March 20, 2017 9:33 AM
> To: David Hoyt <[email protected]<mailto:[email protected]>>; 
> Neelakanta Reddy 
> <[email protected]<mailto:[email protected]>>; 
> [email protected]<mailto:[email protected]>
> Subject: Re: [users] si-swap opensaf SUs results in error but the action 
> still completes
>
>
>
> What version are you running?
>
>
>
> Alex
>
>
>
> On 03/20/2017 09:19 AM, David Hoyt wrote:
>
>> Correction, I believe the default time-out is 60 seconds, not 10.
>
>>
>
>> / /
>
>>
>
>> Regards,
>
>>
>
>> /David/
>
>>
>
>>
>
>>
>
>>
>
>>
>
>> *From:* David Hoyt
>
>> *Sent:* Monday, March 20, 2017 9:19 AM
>
>> *To:* Alex Jones <[email protected]<mailto:[email protected]>>; 
>> Neelakanta Reddy
>
>> <[email protected]<mailto:[email protected]>>; 
>> [email protected]<mailto:[email protected]>
>
>> *Subject:* RE: [users] si-swap opensaf SUs results in error but the
>
>> action still completes
>
>>
>
>>
>
>>
>
>> Alex, isn't the default time-out 10 seconds?
>
>>
>
>> If so, then why did immnd time-out ~7 seconds later?
>
>>
>
>>
>
>>
>
>> Mar 14 11:31:41 sb117vm0 osafamfd[21236]: NO
>
>> safSi=SC-2N,safApp=OpenSAF Swap initiated
>
>>
>
>> ...
>
>>
>
>> Mar 14 11:31:48 sb117vm0 osafimmnd[21104]: WA Timeout on syncronous
>
>> admin operation 1
>
>>
>
>> / /
>
>>
>
>> Regards,
>
>>
>
>> /David/
>
>>
>
>>
>
>>
>
>>
>
>>
>
>> -----Original Message-----
>
>> From: Alex Jones
>
>> Sent: Saturday, March 18, 2017 9:41 AM
>
>> To: David Hoyt <[email protected]
>
>> <mailto:[email protected]>>; Neelakanta Reddy
>
>> <[email protected] 
>> <mailto:[email protected]<mailto:[email protected]%20%3cmailto:[email protected]>>>;
>
>> [email protected]<mailto:[email protected]>
>
>> <mailto:[email protected]>
>
>> Subject: RE: [users] si-swap opensaf SUs results in error but the
>
>> action still completes
>
>>
>
>>
>
>>
>
>> David,
>
>>
>
>>
>
>>
>
>>   You can pass "-t <timeout in seconds> to "amf-adm" to set the
>
>> timeout to whatever you want.
>
>>
>
>>
>
>>
>
>> Alex
>
>>
>
>>
>
>>
>
>> ________________________________________
>
>>
>
>> From: David Hoyt [[email protected]]
>
>>
>
>> Sent: Friday, March 17, 2017 9:35 AM
>
>>
>
>> To: Neelakanta Reddy; 
>> [email protected]<mailto:[email protected]>
>
>> <mailto:[email protected]>
>
>>
>
>> Subject: Re: [users] si-swap opensaf SUs results in error but the
>
>> action        still completes
>
>>
>
>>
>
>>
>
>> Hi Neel,
>
>>
>
>>
>
>>
>
>> The purpose of the test is to see if our system can continue to run
>
>> "normally" when in a geographical configuration.
>
>>
>
>> That is, both SCs are NOT co-located, but reside thousands of km apart.
>
>>
>
>> This is simulated in the lab by adding a delay between the two severs
>
>> which host the SCs.
>
>>
>
>>
>
>>
>
>> What we're seeing is that when the delay is increased to a certain
>
>> value, the si-swap command between the two OpenSAF SUs results in an error.
>
>>
>
>> [root@sb117vm0 ~]# date ; amf-adm si-swap safSi=SC-2N,safApp=OpenSAF;
>
>> Tue Mar 14 11:31:41 EDT 2017 error - saImmOmAdminOperationInvoke_2
>
>> FAILED: SA_AIS_ERR_TIMEOUT (5)
>
>>
>
>>
>
>>
>
>> However, the logs show that the action actually completes about 2
>
>> seconds after the timeout.
>
>>
>
>> Mar 14 11:31:48 sb117vm0 osafimmnd[21104]: WA Timeout on syncronous
>
>> admin operation 1 Mar 14 11:31:50 sb117vm0 osafimmnd[21104]: NO
>
>> Implementer disconnected 67 <0, 2020f> (@safAmfService2020f) Mar 14
>
>> 11:31:50 sb117vm0 osafimmnd[21104]: NO Implementer connected: 72
>
>> (safAmfService) <0, 2020f> Mar 14 11:31:50 sb117vm0 osafamfd[21236]:
>
>> NO Switching Quiesced --> StandBy Mar 14 11:31:50 sb117vm0 osafrded[21057]:
>
>> NO RDE role set to STANDBY Mar 14 11:31:50 sb117vm0 osafamfd[21236]:
>
>> NO Controller switch over done
>
>>
>
>>
>
>>
>
>> I'm trying to determine if there's some way to delay the immnd
>
>> time-out so that the si-swap command returns success.
>
>>
>
>> Regards,
>
>>
>
>> David
>
>>
>
>>
>
>>
>
>>
>
>>
>
>> From: Neelakanta Reddy [mailto:[email protected]]
>
>>
>
>> Sent: Friday, March 17, 2017 7:10 AM
>
>>
>
>> To: David Hoyt <[email protected]
>
>> <mailto:[email protected]>>; 
>> [email protected]<mailto:[email protected]>
>
>> <mailto:[email protected]>
>
>>
>
>> Subject: Re: [users] si-swap opensaf SUs results in error but the
>
>> action still completes
>
>>
>
>>
>
>>
>
>> ________________________________
>
>>
>
>> NOTICE: This email was received from an EXTERNAL sender
>
>> ________________________________
>
>>
>
>>
>
>>
>
>> Hi,
>
>>
>
>>
>
>>
>
>> comments inline.
>
>>
>
>>
>
>>
>
>> On 2017/03/16 07:33 PM, David Hoyt wrote:
>
>>
>
>>> Some additional info.
>
>>
>
>>>
>
>>
>
>>> I found out that the users were testing in a lab that had a delay
>
>> between the two SC nodes. The delay was added for geographical
>
>> redundancy testing.
>
>>
>
>>> Once the time was reduced, the timeout error for the opensaf swap
>
>>> went
>
>> away.
>
>>
>
>>>
>
>>
>
>>> In looking through the osafimmnd log file, I see the following:
>
>>
>
>>> Mar 14 11:31:48.320965 osafimmnd [21104:ImmModel.cc:12042] T5 Forcing
>
>>
>
>>> Adm Req continuation to expire 609885356033 ...
>
>>
>
>>> Mar 14 11:31:48.601903 osafimmnd [21104:ImmModel.cc:12437] T5 Timeout
>
>>
>
>>> on AdministrativeOp continuation 609885356033 tmout:1 Mar 14
>
>>
>
>>> 11:31:48.601952 osafimmnd [21104:ImmModel.cc:11311] T5 REQ ADM
>
>>
>
>>> CONTINUATION 5069295 FOUND FOR 609885356033 Mar 14 11:31:48.601987
>
>>
>
>>> osafimmnd [21104:immnd_proc.c:1086] WA Timeout on syncronous admin
>
>>
>
>>> operation 1
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>> The code around line 12042 of file ImmModel.cc is as follows:
>
>>
>
>>>
>
>>
>
>>> 12040 for(ci2=sAdmReqContinuationMap.begin();
>
>>
>
>>> ci2!=sAdmReqContinuationMap.end(); ++ci2) {
>
>>
>
>>> 12041 if((ci2->second.mTimeout) && (ci2->second.mImplId ==
>
>>
>
>>> implHandle)) {
>
>>
>
>>> 12042 TRACE_5("Forcing Adm Req continuation to expire %llu",
>
>>
>
>>> ci2->first);
>
>>
>
>>> 12043 ci2->second.mTimeout = 1; /* one second is minimum timeout. */
>
>>
>
>>> 12044 }
>
>>
>
>>> 12045 }
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>> Right after the log at line 12042 is generated, the timeout value is
>
>> updated to 1 second (line12043).
>
>>
>
>> The node where the adminoperation is targeted went down from OpenSAF
>
>> perspective.
>
>>
>
>> Then the minimum timeout of 1 second is updated.
>
>>
>
>>> Can I increase this to 2 seconds?
>
>>
>
>> OpenSAF, noted the other node as down, increasing to 2 seconds what
>
>> additional benefit can be achieved?
>
>>
>
>>
>
>>
>
>>> If so, would it cause any badness?
>
>>
>
>> Explain, what is the end result you are targeting.
>
>>
>
>>
>
>>
>
>> Regards,
>
>>
>
>> Neel.
>
>>
>
>>>
>
>>
>
>>> Regards,
>
>>
>
>>> David
>
>>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Opensaf-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to