Hi, This problem was fixed after 5.0 GA in below mentioned changesets. If AMFD sees that second controller is about to join then it returns TRY_AGAIN other wise it will return BAD_OPERATION. Also there is improvement in logging.
changeset: 8500:bbde06f8e302 parent: 8497:110fe121d8e2 user: Nagendra Kumar<[email protected]> date: Fri Jan 06 15:58:23 2017 +0530 summary: amfd: return BAD OP in fault cases during si swap [#1294] changeset: 8499:c3fd1f88bca1 branch: opensaf-5.1.x parent: 8493:4008004e93cd user: Nagendra Kumar<[email protected]> date: Fri Jan 06 15:57:52 2017 +0530 summary: amfd: return BAD OP in fault cases during si swap [#1294] changeset: 8498:9f8c22df842e branch: opensaf-5.0.x parent: 8492:fd59da278b9c user: Nagendra Kumar<[email protected]> date: Fri Jan 06 15:55:35 2017 +0530 summary: amfd: return BAD OP in fault cases during si swap [#1294] Thanks Praveen On 20-Mar-17 7:13 PM, David Hoyt wrote: > If I stop the opensaf controller on SC-2 and then issue the si-swap command, > it takes 60 seconds before the request times-out: > > > [root@sc-1 ~]# date ; amf-adm si-swap safSi=SC-2N,safApp=OpenSAF; date > Mon Mar 20 09:37:51 EDT 2017 > error - command timed out (alarm) > Mon Mar 20 09:38:51 EDT 2017 > [root@sc-1 ~]# cd > [root@sc-1 ~]# > > > Corresponding logs: > > Mar 20 09:37:51 sc-1 osafamfd[14017]: ER safSi=SC-2N,safApp=OpenSAF SWAP > failed - only one assignment > Mar 20 09:37:52 sc-1 osafamfd[14017]: ER safSi=SC-2N,safApp=OpenSAF SWAP > failed - only one assignment > ... > Mar 20 09:38:48 sc-1 osafamfd[14017]: ER safSi=SC-2N,safApp=OpenSAF SWAP > failed - only one assignment > Mar 20 09:38:49 sc-1 osafamfd[14017]: ER safSi=SC-2N,safApp=OpenSAF SWAP > failed - only one assignment > Mar 20 09:38:50 sc-1 osafamfd[14017]: ER safSi=SC-2N,safApp=OpenSAF SWAP > failed - only one assignment > > > -David > > From: David Hoyt > Sent: Monday, March 20, 2017 9:37 AM > To: Alex Jones <[email protected]>; Neelakanta Reddy > <[email protected]>; [email protected] > Subject: RE: [users] si-swap opensaf SUs results in error but the action > still completes > > > We're using 4.6.0. > > > > David > > > > -----Original Message----- > From: Alex Jones > Sent: Monday, March 20, 2017 9:33 AM > To: David Hoyt <[email protected]<mailto:[email protected]>>; > Neelakanta Reddy > <[email protected]<mailto:[email protected]>>; > [email protected]<mailto:[email protected]> > Subject: Re: [users] si-swap opensaf SUs results in error but the action > still completes > > > > What version are you running? > > > > Alex > > > > On 03/20/2017 09:19 AM, David Hoyt wrote: > >> Correction, I believe the default time-out is 60 seconds, not 10. > >> > >> / / > >> > >> Regards, > >> > >> /David/ > >> > >> > >> > >> > >> > >> *From:* David Hoyt > >> *Sent:* Monday, March 20, 2017 9:19 AM > >> *To:* Alex Jones <[email protected]<mailto:[email protected]>>; >> Neelakanta Reddy > >> <[email protected]<mailto:[email protected]>>; >> [email protected]<mailto:[email protected]> > >> *Subject:* RE: [users] si-swap opensaf SUs results in error but the > >> action still completes > >> > >> > >> > >> Alex, isn't the default time-out 10 seconds? > >> > >> If so, then why did immnd time-out ~7 seconds later? > >> > >> > >> > >> Mar 14 11:31:41 sb117vm0 osafamfd[21236]: NO > >> safSi=SC-2N,safApp=OpenSAF Swap initiated > >> > >> ... > >> > >> Mar 14 11:31:48 sb117vm0 osafimmnd[21104]: WA Timeout on syncronous > >> admin operation 1 > >> > >> / / > >> > >> Regards, > >> > >> /David/ > >> > >> > >> > >> > >> > >> -----Original Message----- > >> From: Alex Jones > >> Sent: Saturday, March 18, 2017 9:41 AM > >> To: David Hoyt <[email protected] > >> <mailto:[email protected]>>; Neelakanta Reddy > >> <[email protected] >> <mailto:[email protected]<mailto:[email protected]%20%3cmailto:[email protected]>>>; > >> [email protected]<mailto:[email protected]> > >> <mailto:[email protected]> > >> Subject: RE: [users] si-swap opensaf SUs results in error but the > >> action still completes > >> > >> > >> > >> David, > >> > >> > >> > >> You can pass "-t <timeout in seconds> to "amf-adm" to set the > >> timeout to whatever you want. > >> > >> > >> > >> Alex > >> > >> > >> > >> ________________________________________ > >> > >> From: David Hoyt [[email protected]] > >> > >> Sent: Friday, March 17, 2017 9:35 AM > >> > >> To: Neelakanta Reddy; >> [email protected]<mailto:[email protected]> > >> <mailto:[email protected]> > >> > >> Subject: Re: [users] si-swap opensaf SUs results in error but the > >> action still completes > >> > >> > >> > >> Hi Neel, > >> > >> > >> > >> The purpose of the test is to see if our system can continue to run > >> "normally" when in a geographical configuration. > >> > >> That is, both SCs are NOT co-located, but reside thousands of km apart. > >> > >> This is simulated in the lab by adding a delay between the two severs > >> which host the SCs. > >> > >> > >> > >> What we're seeing is that when the delay is increased to a certain > >> value, the si-swap command between the two OpenSAF SUs results in an error. > >> > >> [root@sb117vm0 ~]# date ; amf-adm si-swap safSi=SC-2N,safApp=OpenSAF; > >> Tue Mar 14 11:31:41 EDT 2017 error - saImmOmAdminOperationInvoke_2 > >> FAILED: SA_AIS_ERR_TIMEOUT (5) > >> > >> > >> > >> However, the logs show that the action actually completes about 2 > >> seconds after the timeout. > >> > >> Mar 14 11:31:48 sb117vm0 osafimmnd[21104]: WA Timeout on syncronous > >> admin operation 1 Mar 14 11:31:50 sb117vm0 osafimmnd[21104]: NO > >> Implementer disconnected 67 <0, 2020f> (@safAmfService2020f) Mar 14 > >> 11:31:50 sb117vm0 osafimmnd[21104]: NO Implementer connected: 72 > >> (safAmfService) <0, 2020f> Mar 14 11:31:50 sb117vm0 osafamfd[21236]: > >> NO Switching Quiesced --> StandBy Mar 14 11:31:50 sb117vm0 osafrded[21057]: > >> NO RDE role set to STANDBY Mar 14 11:31:50 sb117vm0 osafamfd[21236]: > >> NO Controller switch over done > >> > >> > >> > >> I'm trying to determine if there's some way to delay the immnd > >> time-out so that the si-swap command returns success. > >> > >> Regards, > >> > >> David > >> > >> > >> > >> > >> > >> From: Neelakanta Reddy [mailto:[email protected]] > >> > >> Sent: Friday, March 17, 2017 7:10 AM > >> > >> To: David Hoyt <[email protected] > >> <mailto:[email protected]>>; >> [email protected]<mailto:[email protected]> > >> <mailto:[email protected]> > >> > >> Subject: Re: [users] si-swap opensaf SUs results in error but the > >> action still completes > >> > >> > >> > >> ________________________________ > >> > >> NOTICE: This email was received from an EXTERNAL sender > >> ________________________________ > >> > >> > >> > >> Hi, > >> > >> > >> > >> comments inline. > >> > >> > >> > >> On 2017/03/16 07:33 PM, David Hoyt wrote: > >> > >>> Some additional info. > >> > >>> > >> > >>> I found out that the users were testing in a lab that had a delay > >> between the two SC nodes. The delay was added for geographical > >> redundancy testing. > >> > >>> Once the time was reduced, the timeout error for the opensaf swap > >>> went > >> away. > >> > >>> > >> > >>> In looking through the osafimmnd log file, I see the following: > >> > >>> Mar 14 11:31:48.320965 osafimmnd [21104:ImmModel.cc:12042] T5 Forcing > >> > >>> Adm Req continuation to expire 609885356033 ... > >> > >>> Mar 14 11:31:48.601903 osafimmnd [21104:ImmModel.cc:12437] T5 Timeout > >> > >>> on AdministrativeOp continuation 609885356033 tmout:1 Mar 14 > >> > >>> 11:31:48.601952 osafimmnd [21104:ImmModel.cc:11311] T5 REQ ADM > >> > >>> CONTINUATION 5069295 FOUND FOR 609885356033 Mar 14 11:31:48.601987 > >> > >>> osafimmnd [21104:immnd_proc.c:1086] WA Timeout on syncronous admin > >> > >>> operation 1 > >> > >>> > >> > >>> > >> > >>> The code around line 12042 of file ImmModel.cc is as follows: > >> > >>> > >> > >>> 12040 for(ci2=sAdmReqContinuationMap.begin(); > >> > >>> ci2!=sAdmReqContinuationMap.end(); ++ci2) { > >> > >>> 12041 if((ci2->second.mTimeout) && (ci2->second.mImplId == > >> > >>> implHandle)) { > >> > >>> 12042 TRACE_5("Forcing Adm Req continuation to expire %llu", > >> > >>> ci2->first); > >> > >>> 12043 ci2->second.mTimeout = 1; /* one second is minimum timeout. */ > >> > >>> 12044 } > >> > >>> 12045 } > >> > >>> > >> > >>> > >> > >>> Right after the log at line 12042 is generated, the timeout value is > >> updated to 1 second (line12043). > >> > >> The node where the adminoperation is targeted went down from OpenSAF > >> perspective. > >> > >> Then the minimum timeout of 1 second is updated. > >> > >>> Can I increase this to 2 seconds? > >> > >> OpenSAF, noted the other node as down, increasing to 2 seconds what > >> additional benefit can be achieved? > >> > >> > >> > >>> If so, would it cause any badness? > >> > >> Explain, what is the end result you are targeting. > >> > >> > >> > >> Regards, > >> > >> Neel. > >> > >>> > >> > >>> Regards, > >> > >>> David > >> > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Opensaf-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-users > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
