The cause of the problem is that the "discard_node" message from the active
IMMD to all IMMNDs over FEVS (MDS broadcast)
has not reached the IMMND at new-active where the CLMD is trying to set
implementer.
This is a timing issue due to the non homogenous communication mechanism in
OpenSAF.
This can be fixed in various ways.
1) The IMMD could postpone its reply on the new active order from AMF untill
the discard_node has reached the local IMMND and
the local IMMND has sent a new confirm message back to the IMMD. This is quite
a complicated solution and will in general slow
down failover a bit.
2) The AMFD at new active would itself have to set-implementer for the AMF and
it could postpone invoking new active on the other directors
untill it itself succeeded. The AMFD would itself then need to cope with
getting ERR_EXIST and treat it the same way it treats TRY_AGAIN
in this particular context.
3) If for some reason it is preferrable for the AMFD implementation to do its
implementer set later, it could invoke new active on one
chosen service before the others (e.g. CLMD) and have CLMD setting implementer
and coping with ERR_EXIST in the new active
context, before replying to the AMFD.
4) All services could treat getting ERR_EXIST on implementerSet in the context
of failover as getting TRY_AGAN.
I would recommend (2) or (3) as the (from my perspective) simplest solutions.
In general, it is safe to treat ERR_EXIST (or any error that has the semantics
of "nothing was done") as TRY_AGAIN.
That is, nothing bad can happen simply because the request is tried again.
Of course the request may be futile in most other contexts. When there is
another healthy OI occupying the implementer-name
the retry loop would run all the way to completion making the whole retry
excercise both pointless and delaying other meaningfull
tasks. But in the particular case where a service knows that "this is a
failover and I am the new actice" then that service also
knows that treating ERR_EXIST on implementerSet will not be futile unless
somethign is seriously wrong with the cluster.
/AndersBj
________________________________
From: Mathi Naickan [mailto:mathi-naic...@users.sf.net]
Sent: den 28 oktober 2013 17:34
To: [opensaf:tickets]
Subject: [tickets] [opensaf:tickets] #528 order of service not guaranteed
during failover
Well, before proceeding any further, there are questions... to identify a
probably not yet uncovered real root cause (and that would also help in
prioritizing this ticket)
1) The log snippet in this ticket indicate that IMM has been notified of a
failover before the implementerClear reached IMM.
So, when IMM already has received the failover indication, IMM implementation
must ought be able to handle the implmenter clear.
Why has that not happened? Is it because IMM waits for IMMA down? if so, it
because IMMA down has not yet reached this IMMND?
This is not a situation where i would expect the IMM clients to be shown an
ERR_EXIST.
2) Is #528 and #599 really the same scenario? (OR)
Is it that in the case of #599 the IMM has not received the failover indication
before the implementerSet? in which case we could call this scenario born out
of "timing" delay created somewhere in the stack...
Questions apart, iam also thinking of how the dependencies(instantiation and
csidep) among the middleware components are existing or can be changed such
that IMM is ready first before a csiset is delivered to IMM clients(middleware
components)!
________________________________
[tickets:#528]<http://sourceforge.net/p/opensaf/tickets/528/> order of service
not guaranteed during failover
Status: unassigned
Created: Tue Jul 30, 2013 09:43 AM UTC by Sirisha Alla
Last Updated: Mon Oct 21, 2013 07:33 AM UTC
Owner: nobody
The issue is seen on changeset 4325 on SLES 4 node VMs.
SC-1 is active, SC-2 is standby. Failover is triggered by killing FMD on SC-1.
Jul 29 21:28:18 SLES-64BIT-SLOT1 root: killing osaffmd from invoke_failover.sh
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: NO
'safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' :
Recovery is 'nodeFailfast'
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: ER
safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery
is:nodeFailfast
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: Component faulted: recovery is node failfast,
OwnNodeId = 131343, SupervisionTime = 60
Jul 29 21:28:18 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
timeout=60
SC-2 tried becoming Active but failed since CLMD reported ERR_EXIST on
implementer set. The reason is IMMND has not yet disconnected the old
implementer on 2010f. The following is the syslog which shows the sequence.
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408188] TIPC: Resetting link
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408194] TIPC: Lost link
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408198] TIPC: Lost contact with
<1.1.1>
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMND DOWN on active
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS
message:283106
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the
absence of PLM is outside the scope of OpenSAF
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafrded[2332]: NO rde_rde_set_role: role set
to 1
Jul 29 21:28:31 SLES-64BIT-SLOT2 osaflogd[2369]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafntfd[2379]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfd[2416]: NO FAILOVER StandBy --> Active
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: ER saImmOiImplementerSet
failed rc:14, exiting
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: NO
'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' :
Recovery is 'nodeFailfast'
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: ER
safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery
is:nodeFailfast
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: Component faulted: recovery is node failfast,
OwnNodeId = 131599, SupervisionTime = 60
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
timeout=60
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS
message:283107
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Global discard node
received for nodeId:2010f pid:2382
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2353 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2350 <0, 2010f(down)> (safEvtService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2349 <0, 2010f(down)> (safLckService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2348 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2347 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2342 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2352 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2346 <0, 2010f(down)> (safSmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2343 <0, 2010f(down)> (safClmService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2344 <0, 2010f(down)> (safAmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2345 <0, 2010f(down)> (safLogService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMD lost contact with peer
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs
message 283106 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs
message 283107 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO ACTIVE request
Implementer is disconnected at IMMND after CLMD reported ERR_EXIST and went for
recovery. Ideally IMMND implementers on the old node should get disconnected
first before other opensaf processes tries to reuse the same implementer name.
Here the order needs to be guaranteed for the failover to always succeed. In
this test cluster went for reboot. This issue is very much time intensive and
difficult to reproduce.
AMF and IMM traces are available and can be provided on request. Currently
attaching the syslogs. SC-2 is ahead by 7 seconds in time.
Syslog during successful failover:
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMND DOWN on active
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS
message:279866
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS
message:279867
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Global discard node
received for nodeId:2010f pid:2372
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2323 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2320 <0, 2010f(down)> (safEvtService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2319 <0, 2010f(down)> (safLckService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2318 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2317 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2312 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2321 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2316 <0, 2010f(down)> (safSmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2314 <0, 2010f(down)> (safClmService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2313 <0, 2010f(down)> (safAmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2315 <0, 2010f(down)> (safLogService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMD lost contact with peer
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs
message 279866 since it has recently been resent.
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs
message 279867 since it has recently been resent.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: NO Role: STANDBY, Node Down for
node id: 2010f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: Received Node Down for Active peer, OwnNodeId =
131599, SupervisionTime = 60
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [ 102.192084] TIPC: Resetting link
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [ 102.192093] TIPC: Lost link
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [ 102.192100] TIPC: Lost contact with
<1.1.1>
Jul 29 21:24:46 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the
absence of PLM is outside the scope of OpenSAF
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafrded[2338]: NO rde_rde_set_role: role set
to 1
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaflogd[2379]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafntfd[2389]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafclmd[2403]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO New coord elected, resides
at 2020f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO This IMMND is now the NEW
Coord
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO STARTING persistent back
end process.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2326 <11, 2020f> (@safAmfService2020f)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2324 <3, 2020f> (@safLogService)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer connected:
2328 (safAmfService) <11, 2020f>
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO Node 'SC-1' left the cluster
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active
DONE!
________________________________
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
---
** [tickets:#528] order of service not guaranteed during failover**
**Status:** unassigned
**Created:** Tue Jul 30, 2013 09:43 AM UTC by Sirisha Alla
**Last Updated:** Mon Oct 28, 2013 04:34 PM UTC
**Owner:** nobody
The issue is seen on changeset 4325 on SLES 4 node VMs.
SC-1 is active, SC-2 is standby. Failover is triggered by killing FMD on SC-1.
Jul 29 21:28:18 SLES-64BIT-SLOT1 root: killing osaffmd from invoke_failover.sh
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: NO
'safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' :
Recovery is 'nodeFailfast'
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: ER
safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery
is:nodeFailfast
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: Component faulted: recovery is node failfast,
OwnNodeId = 131343, SupervisionTime = 60
Jul 29 21:28:18 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
timeout=60
SC-2 tried becoming Active but failed since CLMD reported ERR_EXIST on
implementer set. The reason is IMMND has not yet disconnected the old
implementer on 2010f. The following is the syslog which shows the sequence.
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408188] TIPC: Resetting link
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408194] TIPC: Lost link
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408198] TIPC: Lost contact with
<1.1.1>
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMND DOWN on active
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS
message:283106
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the
absence of PLM is outside the scope of OpenSAF
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafrded[2332]: NO rde_rde_set_role: role set
to 1
Jul 29 21:28:31 SLES-64BIT-SLOT2 osaflogd[2369]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafntfd[2379]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: NO ACTIVE request
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfd[2416]: NO FAILOVER StandBy --> Active
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafclmd[2393]: ER saImmOiImplementerSet
failed rc:14, exiting
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: NO
'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' :
Recovery is 'nodeFailfast'
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: ER
safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery
is:nodeFailfast
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafamfnd[2426]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: Component faulted: recovery is node failfast,
OwnNodeId = 131599, SupervisionTime = 60
Jul 29 21:28:31 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
timeout=60
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA DISCARD DUPLICATE FEVS
message:283107
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: WA Error code 2 returned for
message type 57 - ignoring
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Global discard node
received for nodeId:2010f pid:2382
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2353 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2350 <0, 2010f(down)> (safEvtService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2349 <0, 2010f(down)> (safLckService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2348 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2347 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2342 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2352 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2346 <0, 2010f(down)> (safSmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2343 <0, 2010f(down)> (safClmService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2344 <0, 2010f(down)> (safAmfService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmnd[2360]: NO Implementer disconnected
2345 <0, 2010f(down)> (safLogService)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: WA IMMD lost contact with peer
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs
message 283106 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO Skipping re-send of fevs
message 283107 since it has recently been resent.
Jul 29 21:28:31 SLES-64BIT-SLOT2 osafimmd[2350]: NO ACTIVE request
Implementer is disconnected at IMMND after CLMD reported ERR_EXIST and went for
recovery. Ideally IMMND implementers on the old node should get disconnected
first before other opensaf processes tries to reuse the same implementer name.
Here the order needs to be guaranteed for the failover to always succeed. In
this test cluster went for reboot. This issue is very much time intensive and
difficult to reproduce.
AMF and IMM traces are available and can be provided on request. Currently
attaching the syslogs. SC-2 is ahead by 7 seconds in time.
Syslog during successful failover:
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMND DOWN on active
controller f1 detected at standby immd!! f2. Possible failover
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS
message:279866
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA DISCARD DUPLICATE FEVS
message:279867
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: WA Error code 2 returned for
message type 57 - ignoring
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Global discard node
received for nodeId:2010f pid:2372
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2323 <0, 2010f(down)> (OpenSafImmPBE)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2320 <0, 2010f(down)> (safEvtService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2319 <0, 2010f(down)> (safLckService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2318 <0, 2010f(down)> (safCheckPointService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2317 <0, 2010f(down)> (safMsgGrpService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2312 <0, 2010f(down)> (MsgQueueService131343)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2321 <0, 2010f(down)> (@OpenSafImmReplicatorA)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2316 <0, 2010f(down)> (safSmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2314 <0, 2010f(down)> (safClmService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2313 <0, 2010f(down)> (safAmfService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2315 <0, 2010f(down)> (safLogService)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: WA IMMD lost contact with peer
IMMD (NCSMDS_RED_DOWN)
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs
message 279866 since it has recently been resent.
Jul 29 21:24:40 SLES-64BIT-SLOT2 osafimmd[2356]: NO Skipping re-send of fevs
message 279867 since it has recently been resent.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: NO Role: STANDBY, Node Down for
node id: 2010f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaffmd[2347]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: Received Node Down for Active peer, OwnNodeId =
131599, SupervisionTime = 60
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [ 102.192084] TIPC: Resetting link
<1.1.2:eth1-1.1.1:eth0>, peer not responding
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [ 102.192093] TIPC: Lost link
<1.1.2:eth1-1.1.1:eth0> on network plane A
Jul 29 21:24:46 SLES-64BIT-SLOT2 kernel: [ 102.192100] TIPC: Lost contact with
<1.1.1>
Jul 29 21:24:46 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the
absence of PLM is outside the scope of OpenSAF
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafrded[2338]: NO rde_rde_set_role: role set
to 1
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osaflogd[2379]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafntfd[2389]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafclmd[2403]: NO ACTIVE request
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmd[2356]: NO New coord elected, resides
at 2020f
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO This IMMND is now the NEW
Coord
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO STARTING persistent back
end process.
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2326 <11, 2020f> (@safAmfService2020f)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer disconnected
2324 <3, 2020f> (@safLogService)
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafimmnd[2366]: NO Implementer connected:
2328 (safAmfService) <11, 2020f>
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO Node 'SC-1' left the cluster
Jul 29 21:24:46 SLES-64BIT-SLOT2 osafamfd[2422]: NO FAILOVER StandBy --> Active
DONE!
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to http://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
http://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets