The cause of the problem is that the "discard_node" message from the active 
IMMD to all IMMNDs over FEVS (MDS broadcast)
has not reached the IMMND at new-active where the CLMD is trying to set 
implementer.
This is a timing issue due to the non homogenous communication mechanism in 
OpenSAF.
This can be fixed in various ways.

1) The IMMD could postpone its reply on the new active order from AMF untill 
the discard_node has reached the local IMMND and
the local IMMND has sent a new confirm message back to the IMMD. This is quite 
a complicated solution and will in general slow
down failover a bit.

2) The AMFD at new active would itself have to set-implementer for the AMF and 
it could postpone invoking new active on the other directors
untill it itself succeeded. The AMFD would itself then need to cope with 
getting ERR_EXIST and treat it the same way it treats TRY_AGAIN
in this particular context.

3) If for some reason it is preferrable for the AMFD implementation to do its 
implementer set later, it could invoke new active on one
chosen service before the others (e.g. CLMD) and have CLMD setting implementer 
and coping with ERR_EXIST in the new active
context, before replying to the AMFD.

4) All services could treat getting ERR_EXIST on implementerSet in the context 
of failover as getting TRY_AGAN.

I would recommend (2) or (3) as the (from my perspective) simplest solutions.

In general, it is safe to treat ERR_EXIST (or any error that has the semantics 
of "nothing was done") as TRY_AGAIN.
That is, nothing bad can happen simply because the request is tried again.
Of course the request may be futile in most other contexts. When there is 
another healthy OI occupying the implementer-name
the retry loop would run all the way to completion making the whole retry 
excercise both pointless and delaying other meaningfull
tasks. But in the particular case where a service knows that "this is a 
failover and I am the new actice" then that service also
knows that treating ERR_EXIST on implementerSet will not be futile unless 
somethign is seriously wrong with the cluster.

/AndersBj

________________________________
From: Mathi Naickan [mailto:mathi-naic...@users.sf.net]
Sent: den 28 oktober 2013 17:34
To: [opensaf:tickets]
Subject: [tickets] [opensaf:tickets] #528 order of service not guaranteed 
during failover


Well, before proceeding any further, there are questions... to identify a 
probably not yet uncovered real root cause (and that would also help in 
prioritizing this ticket)

1) The log snippet in this ticket indicate that IMM has been notified of a 
failover before the implementerClear reached IMM.
So, when IMM already has received the failover indication, IMM implementation 
must ought be able to handle the implmenter clear.
Why has that not happened? Is it because IMM waits for IMMA down? if so, it 
because IMMA down has not yet reached this IMMND?

This is not a situation where i would expect the IMM clients to be shown an 
ERR_EXIST.

2) Is #528 and #599 really the same scenario? (OR)
Is it that in the case of #599 the IMM has not received the failover indication 
before the implementerSet? in which case we could call this scenario born out 
of "timing" delay created somewhere in the stack...

Questions apart, iam also thinking of how the dependencies(instantiation and 
csidep) among the middleware components are existing or can be changed such 
that IMM is ready first before a csiset is delivered to IMM clients(middleware 
components)!

________________________________

[tickets:#528]<http://sourceforge.net/p/opensaf/tickets/528/> order of service 
not guaranteed during failover

Status: unassigned
Created: Tue Jul 30, 2013 09:43 AM UTC by Sirisha Alla
Last Updated: Mon Oct 21, 2013 07:33 AM UTC
Owner: nobody

The issue is seen on changeset 4325 on SLES 4 node VMs.

SC-1 is active, SC-2 is standby. Failover is triggered by killing FMD on SC-1.

Jul 29 21:28:18 SLES-64BIT-SLOT1 root: killing osaffmd from invoke_failover.sh
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: NO 
'safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: ER 
safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131343, SupervisionTime = 60
Jul 29 21:28:18 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; 
timeout=60

SC-2 tried becoming Active but failed since CLMD reported ERR_EXIST on 
implementer set. The reason is IMMND has not yet disconnected the old 
implementer on 2010f. The following is the syslog which shows the sequence.

Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408188] TIPC: Resetting link 
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408194] TIPC: Lost link 
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408198] TIPC: Lost contact with 
<1.1.1>
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMND DOWN on active 
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS 
message:283106
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafrded[2332]: NO rde_rde_set_role: role set 
to 1
Jul 29 21:28:31 SLES-64BIT-SLOT2 osaflogd[2369]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafntfd[2379]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfd[2416]: NO FAILOVER StandBy --> Active
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: ER saImmOiImplementerSet 
failed rc:14, exiting
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: NO 
'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: ER 
safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131599, SupervisionTime = 60
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; 
timeout=60
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS 
message:283107
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Global discard node 
received for nodeId:2010f pid:2382
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2353 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2350 <0, 2010f(down)> (safEvtService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2349 <0, 2010f(down)> (safLckService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2348 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2347 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2342 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2352 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2346 <0, 2010f(down)> (safSmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2343 <0, 2010f(down)> (safClmService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2344 <0, 2010f(down)> (safAmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2345 <0, 2010f(down)> (safLogService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMD lost contact with peer 
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs 
message 283106 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs 
message 283107 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO ACTIVE request

Implementer is disconnected at IMMND after CLMD reported ERR_EXIST and went for 
recovery. Ideally IMMND implementers on the old node should get disconnected 
first before other opensaf processes tries to reuse the same implementer name. 
Here the order needs to be guaranteed for the failover to always succeed. In 
this test cluster went for reboot. This issue is very much time intensive and 
difficult to reproduce.

AMF and IMM traces are available and can be provided on request. Currently 
attaching the syslogs. SC-2 is ahead by 7 seconds in time.

Syslog during successful failover:

Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMND DOWN on active 
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS 
message:279866
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS 
message:279867
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Global discard node 
received for nodeId:2010f pid:2372
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2323 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2320 <0, 2010f(down)> (safEvtService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2319 <0, 2010f(down)> (safLckService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2318 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2317 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2312 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2321 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2316 <0, 2010f(down)> (safSmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2314 <0, 2010f(down)> (safClmService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2313 <0, 2010f(down)> (safAmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2315 <0, 2010f(down)> (safLogService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMD lost contact with peer 
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs 
message 279866 since it has recently been resent.
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs 
message 279867 since it has recently been resent.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: NO Role: STANDBY, Node Down for 
node id: 2010f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Received Node Down for Active peer, OwnNodeId = 
131599, SupervisionTime = 60
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [ 102.192084] TIPC: Resetting link 
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [ 102.192093] TIPC: Lost link 
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [ 102.192100] TIPC: Lost contact with 
<1.1.1>
Jul 29 21:24:46 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafrded[2338]: NO rde_rde_set_role: role set 
to 1
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaflogd[2379]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafntfd[2389]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafclmd[2403]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO New coord elected, resides 
at 2020f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO This IMMND is now the NEW 
Coord
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO STARTING persistent back 
end process.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2326 <11, 2020f> (@safAmfService2020f)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2324 <3, 2020f> (@safLogService)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer connected: 
2328 (safAmfService) <11, 2020f>
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO Node 'SC-1' left the cluster
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active 
DONE!

________________________________

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a 
mailing list, you can unsubscribe from the mailing list.



---

** [tickets:#528] order of service not guaranteed during failover**

**Status:** unassigned
**Created:** Tue Jul 30, 2013 09:43 AM UTC by Sirisha Alla
**Last Updated:** Mon Oct 28, 2013 04:34 PM UTC
**Owner:** nobody

The issue is seen on changeset 4325 on SLES 4 node VMs.

SC-1 is active, SC-2 is standby. Failover is triggered by killing FMD on SC-1.

Jul 29 21:28:18 SLES-64BIT-SLOT1 root: killing osaffmd from invoke_failover.sh
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: NO 
'safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: ER 
safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131343, SupervisionTime = 60
Jul 29 21:28:18 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; 
timeout=60

SC-2 tried becoming Active but failed since CLMD reported ERR_EXIST on 
implementer set. The reason is IMMND has not yet disconnected the old 
implementer on 2010f. The following is the syslog which shows the sequence.

Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [  101.408188] TIPC: Resetting link 
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [  101.408194] TIPC: Lost link 
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [  101.408198] TIPC: Lost contact with 
<1.1.1>
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMND DOWN on active 
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS 
message:283106
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafrded[2332]: NO rde_rde_set_role: role set 
to 1
Jul 29 21:28:31 SLES-64BIT-SLOT2 osaflogd[2369]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafntfd[2379]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfd[2416]: NO FAILOVER StandBy --> Active
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: ER saImmOiImplementerSet 
failed rc:14, exiting
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: NO 
'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: ER 
safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131599, SupervisionTime = 60
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; 
timeout=60
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS 
message:283107
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Global discard node 
received for nodeId:2010f pid:2382
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2353 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2350 <0, 2010f(down)> (safEvtService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2349 <0, 2010f(down)> (safLckService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2348 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2347 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2342 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2352 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2346 <0, 2010f(down)> (safSmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2343 <0, 2010f(down)> (safClmService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2344 <0, 2010f(down)> (safAmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected 
2345 <0, 2010f(down)> (safLogService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMD lost contact with peer 
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs 
message 283106 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs 
message 283107 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO ACTIVE request

Implementer is disconnected at IMMND after CLMD reported ERR_EXIST and went for 
recovery. Ideally IMMND implementers on the old node should get disconnected 
first before other opensaf processes tries to reuse the same implementer name. 
Here the order needs to be guaranteed for the failover to always succeed. In 
this test cluster went for reboot. This issue is very much time intensive and 
difficult to reproduce. 

AMF and IMM traces are available and can be provided on request. Currently 
attaching the syslogs. SC-2 is ahead by 7 seconds in time.

Syslog during successful failover:

Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMND DOWN on active 
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS 
message:279866
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS 
message:279867
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for 
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Global discard node 
received for nodeId:2010f pid:2372
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2323 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2320 <0, 2010f(down)> (safEvtService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2319 <0, 2010f(down)> (safLckService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2318 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2317 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2312 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2321 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2316 <0, 2010f(down)> (safSmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2314 <0, 2010f(down)> (safClmService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2313 <0, 2010f(down)> (safAmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2315 <0, 2010f(down)> (safLogService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMD lost contact with peer 
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs 
message 279866 since it has recently been resent.
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs 
message 279867 since it has recently been resent.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: NO Role: STANDBY, Node Down for 
node id: 2010f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Received Node Down for Active peer, OwnNodeId = 
131599, SupervisionTime = 60
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [  102.192084] TIPC: Resetting link 
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [  102.192093] TIPC: Lost link 
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [  102.192100] TIPC: Lost contact with 
<1.1.1>
Jul 29 21:24:46 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafrded[2338]: NO rde_rde_set_role: role set 
to 1
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaflogd[2379]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafntfd[2389]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafclmd[2403]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO New coord elected, resides 
at 2020f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO This IMMND is now the NEW 
Coord
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO STARTING persistent back 
end process.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2326 <11, 2020f> (@safAmfService2020f)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected 
2324 <3, 2020f> (@safLogService)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer connected: 
2328 (safAmfService) <11, 2020f>
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO Node 'SC-1' left the cluster
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active 
DONE!



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to http://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
http://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to