[tickets] [opensaf:tickets] Re: #528 order of service not guaranteed during failover

Mathi Naickan Tue, 29 Oct 2013 04:04:06 -0700

Comments inline:

> -----Original Message-----
> From: Anders Bjornerstedt [mailto:ander...@users.sf.net]
> Sent: Tuesday, October 29, 2013 1:29 PM
> To: [opensaf:tickets]
> Subject: [tickets] [opensaf:tickets] Re: #528 order of service not guaranteed
> during failover
> 
> The cause of the problem is that the "discard_node" message from the
> active IMMD to all IMMNDs over FEVS (MDS broadcast) has not reached the
> IMMND at new-active where the CLMD is trying to set implementer.
> This is a timing issue due to the non homogenous communication mechanism
> in OpenSAF.
> This can be fixed in various ways.
> 1) The IMMD could postpone its reply on the new active order from AMF
> untill the discard_node has reached the local IMMND and the local IMMND
> has sent a new confirm message back to the IMMD. This is quite a
> complicated solution and will in general slow down failover a bit.


[Mathi] 
This would be a best fit solution,  I was just not sure about the complexity, 
but see comment below!

> 2) The AMFD at new active would itself have to set-implementer for the AMF
> and it could postpone invoking new active on the other directors untill it 
> itself
> succeeded. The AMFD would itself then need to cope with getting ERR_EXIST
> and treat it the same way it treats TRY_AGAIN in this particular context.
> 3) If for some reason it is preferrable for the AMFD implementation to do its
> implementer set later, it could invoke new active on one chosen service
> before the others (e.g. CLMD) and have CLMD setting implementer and
> coping with ERR_EXIST in the new active context, before replying to the
> AMFD.

[Mathi]
All (except FM, RDE) pre-AMF OpenSAF directors are already configured to be 
dependent on IMMD, i.e.
The AMF CSI dependencies for all directors are already configured such that 
they are
assigned active only after IMMD, i.e. only after IMMD responds to the AMF 
active csisetcallback.

So, another way to fix could be that, if we can make any 
adjustments/improvements (without added complication)
to the way IMMD responds to the active callback, then the following things 
should help improve things come in order:

- Make the director services perform the implementerset in the AMF callback, 
rather than in the RDA callback during failover, 
the director services will receive the RDE callback first(with one probable 
exception of failover in the middle of switchover).
  So, in the rda callbacks they would have set themselves as active, i.e. they 
would be ready and servicing their other core job!

Note: One catch in this approach is that SMF is not subscribing to the RDA 
callback!

> 4) All services could treat getting ERR_EXIST on implementerSet in the
> context of failover as getting TRY_AGAN.
> I would recommend (2) or (3) as the (from my perspective) simplest
> solutions.
> In general, it is safe to treat ERR_EXIST (or any error that has the 
> semantics of
> "nothing was done") as TRY_AGAIN.

[Mathi]
One problem with treating ERR_EXIST(failover) to be same as TRY_AGAIN is that 
one can't differentiate this with wrongly designed distributed applications in 
the way they
do implementersets! They might just wrongly repeat a tryagain periodic loop and 
may never succeed
and all those time the middleware would have never gave a hint of what was 
going wrong!

> That is, nothing bad can happen simply because the request is tried again.
> Of course the request may be futile in most other contexts. When there is
> another healthy OI occupying the implementer-name the retry loop would
> run all the way to completion making the whole retry excercise both
> pointless and delaying other meaningfull tasks. But in the particular case
> where a service knows that "this is a failover and I am the new actice" then
> that service also knows that treating ERR_EXIST on implementerSet will not
> be futile unless somethign is seriously wrong with the cluster.
> /AndersBj
> ________________________________________
> From: Mathi Naickan [mailto:mathi-naic...@users.sf.net]
> Sent: den 28 oktober 2013 17:34
> To: [opensaf:tickets]
> Subject: [tickets] [opensaf:tickets] #528 order of service not guaranteed
> during failover Well, before proceeding any further, there are questions... to
> identify a probably not yet uncovered real root cause (and that would also
> help in prioritizing this ticket)
> 1) The log snippet in this ticket indicate that IMM has been notified of a
> failover before the implementerClear reached IMM.
> So, when IMM already has received the failover indication, IMM
> implementation must ought be able to handle the implmenter clear.
> Why has that not happened? Is it because IMM waits for IMMA down? if so,
> it because IMMA down has not yet reached this IMMND?
> This is not a situation where i would expect the IMM clients to be shown an
> ERR_EXIST.
> 2) Is #528 and #599 really the same scenario? (OR) Is it that in the case of
> #599 the IMM has not received the failover indication before the
> implementerSet? in which case we could call this scenario born out of
> "timing" delay created somewhere in the stack...
> Questions apart, iam also thinking of how the dependencies(instantiation
> and csidep) among the middleware components are existing or can be
> changed such that IMM is ready first before a csiset is delivered to IMM
> clients(middleware components)!


---

** [tickets:#528] order of service not guaranteed during failover**

**Status:** unassigned
**Created:** Tue Jul 30, 2013 09:43 AM UTC by Sirisha Alla
**Last Updated:** Mon Oct 28, 2013 04:34 PM UTC
**Owner:** nobody

The issue is seen on changeset 4325 on SLES 4 node VMs.

SC-1 is active, SC-2 is standby. Failover is triggered by killing FMD on SC-1.

Jul 29 21:28:18 SLES-64BIT-SLOT1 root: killing osaffmd from invoke_failover.sh
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: NO 
'safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: ER 
safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131343, SupervisionTime = 60
Jul 29 21:28:18 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; 
timeout=60

SC-2 tried becoming Active but failed since CLMD reported ERR_EXIST on 
implementer set. The reason is IMMND has not yet disconnected the old 
implementer on 2010f. The following is the syslog which shows the sequence.

Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [  101.408188] TIPC: Resetting link 
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [  101.408194] TIPC: Lost link 
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [  101.408198] TIPC: Lost contact with 
<1.1.1>
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMND DOWN on active 
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS 
message:283106
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafrded[2332]: NO rde_rde_set_role: role set 
to 1
Jul 29 21:28:31 SLES-64BIT-SLOT2 osaflogd[2369]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafntfd[2379]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfd[2416]: NO FAILOVER StandBy --> Active
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: ER saImmOiImplementerSet 
failed rc:14, exiting
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: NO 
'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: ER 
safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131599, SupervisionTime = 60
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; 
timeout=60
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS 
message:283107
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Global discard node 
received for nodeId:2010f pid:2382
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2353 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2350 <0, 2010f(down)> (safEvtService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2349 <0, 2010f(down)> (safLckService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2348 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2347 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2342 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2352 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2346 <0, 2010f(down)> (safSmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2343 <0, 2010f(down)> (safClmService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2344 <0, 2010f(down)> (safAmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2345 <0, 2010f(down)> (safLogService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMD lost contact with peer 
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs 
message 283106 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs 
message 283107 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO ACTIVE request

Implementer is disconnected at IMMND after CLMD reported ERR_EXIST and went for 
recovery. Ideally IMMND implementers on the old node should get disconnected 
first before other opensaf processes tries to reuse the same implementer name. 
Here the order needs to be guaranteed for the failover to always succeed. In 
this test cluster went for reboot. This issue is very much time intensive and 
difficult to reproduce. 

AMF and IMM traces are available and can be provided on request. Currently 
attaching the syslogs. SC-2 is ahead by 7 seconds in time.

Syslog during successful failover:

Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMND DOWN on active 
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS 
message:279866
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS 
message:279867
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Global discard node 
received for nodeId:2010f pid:2372
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2323 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2320 <0, 2010f(down)> (safEvtService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2319 <0, 2010f(down)> (safLckService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2318 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2317 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2312 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2321 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2316 <0, 2010f(down)> (safSmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2314 <0, 2010f(down)> (safClmService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2313 <0, 2010f(down)> (safAmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2315 <0, 2010f(down)> (safLogService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMD lost contact with peer 
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs 
message 279866 since it has recently been resent.
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs 
message 279867 since it has recently been resent.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: NO Role: STANDBY, Node Down for 
node id: 2010f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Received Node Down for Active peer, OwnNodeId = 
131599, SupervisionTime = 60
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [  102.192084] TIPC: Resetting link 
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [  102.192093] TIPC: Lost link 
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [  102.192100] TIPC: Lost contact with 
<1.1.1>
Jul 29 21:24:46 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafrded[2338]: NO rde_rde_set_role: role set 
to 1
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaflogd[2379]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafntfd[2389]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafclmd[2403]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO New coord elected, resides 
at 2020f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO This IMMND is now the NEW 
Coord
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO STARTING persistent back 
end process.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2326 <11, 2020f> (@safAmfService2020f)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2324 <3, 2020f> (@safLogService)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer connected: 
2328 (safAmfService) <11, 2020f>
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO Node 'SC-1' left the cluster
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active 
DONE!



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to http://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
http://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk

_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] Re: #528 order of service not guaranteed during failover

Reply via email to