[tickets] [opensaf:tickets] Re: #1059 2PBE: cluster reset observed during switchovers

Anders Bjornerstedt Wed, 10 Sep 2014 10:20:01 -0700

Neel correctly points out that 946 may fix the CLM problem.
This is true if the ERR_EXIST on implementer-set is due to the prior OI having 
detached *locally* at this node
but not yet been confirmed globally over fevs. But since this is a switchover, 
it is more likley that the OI detached
on the other SC and that the AMF is faster in processing quiessed-ack from old 
active CLMD and ordering CLMD
at this node to become new active (and thus allocate OI).


So I think getting ERR_EXIST at new active OI implementer-set may unfortunately 
be a fact of life for the switch-over case.
The only fix I see here is that since the new active knows that this is (or 
could be) a switchover since it has just been
order to become active. It could in this context interpret ERR_EXIST from 
implementer-set as effectively TRY_AGAIN.

Perhaps even simpler: Director services could always interpret ERR_EXIST on 
implementer-set as TRY_AGAIN.
As always, a TRY_AGAIN loop must be finite. And implementer-set is not blocked 
by imm-sync so we are not
talking 60 secods here. At MOST we are talking sub-second, the fevs turnarround 
latency.

/AndersBj

________________________________
From: Anders Björnerstedt [mailto:anders.bjornerst...@ericsson.com]
Sent: den 10 september 2014 18:57
To: [opensaf:tickets] ; opensaf-tickets@lists.sourceforge.net
Subject: Re: [tickets] [opensaf:tickets] #1059 2PBE: cluster reset observed 
during switchovers

Good analysis.

We can add that the reason that AMFD got BAD_HANLDE when attempting to do an 
RtObjectUpdate is that
although the OI handle is valid, it was not associated with any 
implementer-name at that time.
So this must be a pure and plain bug in the AMFD. Most likely recently 
introduced since otherwise we should
have seen this before.

The AMFD interprets the BAD_HANDLE as the only "expected" reason for 
BAD_HANDLE, that the handle is invalid
due to the IMMND having restarted. By "expected" I dont mean common, I mean the 
only reason the AMFD programmer
has to re-initialize the handle: a restart of the local IMMND. But that is not 
what is happening here.

The SAF spec actually explicitly says that BAD_HANDLE is to be used for this 
particlar case (implementer-name not set
when trying to perform an OI operation). While this is not wrong. It would be 
better in this case to use ERR_BAD_OPERATION
since that is an error that is unambiguously a client error and the spec on 
that error code for this downcall also fitts the case:

ERR_BAD_OPERATION - The targeted object is not implemented by the invoking 
process.

So I think we should write an enhancement on the immsv to change the error code 
for this case.
It will also be a backwards compatible change. We are talking an interface 
violation here and that
should be handled by the process aborting.

We should also write a defect ticket on AMFD, or use this ticket, to track a 
fix for the AMFD bug - premature
use of an OI handle. This ticket also poitns to an independent CLM bug. So we 
should probably have two
tickets.

/AndersBj



________________________________
From: Neelakanta Reddy [mailto:neelaka...@users.sf.net]
Sent: den 10 september 2014 17:58
To: opensaf-tickets@lists.sourceforge.net
Subject: [tickets] [opensaf:tickets] #1059 2PBE: cluster reset observed during 
switchovers


A. SLOT1 node went down:

  1.  CLM got BAD_HANDLE and finalizes the handle

Sep 10 14:56:51.543332 osafclmd [7511:imma_oi_api.c:0622] >> saImmOiFinalize
Sep 10 14:56:51.543370 osafclmd [7511:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No 
initialized handle exists!

  1.  Discard implementer is called

Sep 10 14:56:51.538179 osafimmnd [7448:immsv_evt.c:5363] T8 Sending: 
IMMD_EVT_ND2D_DISCARD_IMPL to 0
Sep 10 14:56:51.539878 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
Sep 10 14:56:51.539994 osafimmnd [7448:ImmModel.cc:11510] NO Implementer 
locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
Sep 10 14:56:51.540181 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer

  1.  But the implemnter actually got disconnected at

Sep 10 14:56:51.580449 osafimmnd [7448:immnd_evt.c:8588] T2 Global discard 
implementer for id:190
Sep 10 14:56:51.580462 osafimmnd [7448:ImmModel.cc:11474] >> discardImplementer
Sep 10 14:56:51.580496 osafimmnd [7448:ImmModel.cc:11481] NO Implementer 
disconnected 190 <17, 2010f> (safClmService)
Sep 10 14:56:51.580518 osafimmnd [7448:ImmModel.cc:11534] << discardImplementer

  1.  CLM tries to re-initializes and receives ERR_EXISTS

Sep 10 14:56:51.551900 osafclmd [7511:imma_oi_api.c:0440] << 
saImmOiSelectionObjectGet
Sep 10 14:56:51.551942 osafclmd [7511:clms_imm.c:2286] ER saImmOiImplementerSet 
failed rc:14, exiting
Sep 10 14:59:51.245982 osafclmd [2538:clms_main.c:0267] >> clms_init

Sep 10 14:56:51.548981 osafimmnd [7448:immsv_evt.c:5382] T8 Received: 
IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2010f
Sep 10 14:56:51.549023 osafimmnd [7448:immnd_evt.c:2471] T2 SENDRSP FAIL 14

946 fixes the above problem in CLM

B. Slot2 node went down(Quiesced --> Active)

  1.  Sep 10 14:56:57.681152 osafamfd [6896:role.cc:0375] NO FAILOVER Quiesced 
--> Active

  2.  saImmOiRtObjectUpdate_2 got BAD_HANDLE so AMFD tries to re-initialize 
with IMM and calls avd_imm_reinit_bg

Sep 10 14:56:57.701333 osafamfd [6896:imma_oi_api.c:2279] >> 
saImmOiRtObjectUpdate_2
Sep 10 14:56:57.701344 osafamfd [6896:imma_oi_api.c:2345] T2 ERR_BAD_HANDLE: 
The SaImmOiHandleT is not associated with any implementer name
Sep 10 14:56:57.701353 osafamfd [6896:imma_oi_api.c:2554] << 
saImmOiRtObjectUpdate_2
Sep 10 14:56:57.701362 osafamfd [6896:imm.cc:0164] TR BADHANDLE
Sep 10 14:56:57.701370 osafamfd [6896:imm.cc:1660] >> avd_imm_reinit_bg
Sep 10 14:56:57.701406 osafamfd [6896:imm.cc:1662] NO Re-initializing with IMM
Sep 10 14:56:57.701420 osafamfd [6896:imma_oi_api.c:0622] >> saImmOiFinalize

  1.  Before the finalize is not completed in clearing the OI handle, impl_set 
is called by AMFD in the function avd_role_failover_qsd_actv(calling 
avd_imm_impl_set_task_create). Because of this amfd exited.

Sep 10 14:56:57.701178 osafamfd [6896:role.cc:0498] << 
avd_role_failover_qsd_actv

Sep 10 14:56:57.702256 osafamfd [6896:imm.cc:1215] >> avd_imm_impl_set
Sep 10 14:56:57.702273 osafamfd [6896:imma_oi_api.c:1281] T4 ERR_LIBRARY: 
Overlapping use of IMM OI handle by multiple threads
Sep 10 14:56:57.703683 osafamfd [6896:imm.cc:1218] ER saImmOiImplementerSet 
failed 2
Sep 10 14:56:57.703788 osafamfd [6896:imm.cc:1288] ER exiting since 
avd_imm_impl_set failed

Because of using shared Oihandle, across multiple threads in AMFD the 
saImmOiImplementerSet failed with ERR_LIBRARY.

________________________________

[tickets:#1059]<http://sourceforge.net/p/opensaf/tickets/1059> 2PBE: cluster 
reset observed during switchovers

Status: unassigned
Milestone: 4.3.3
Created: Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
Last Updated: Wed Sep 10, 2014 10:29 AM UTC
Owner: nobody

The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE 
with 50k application objects.

Switchovers with IMM application running is in progress when the issue is 
observed.

Syslog on SC-1:

Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 
182 <0, 2020f> (@OpenSafImmReplicatorB)
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE 
commit for PRTA update Ccb:100000063/4294967395
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on 
attempt to start prepare of ccb:100000063/4294967395
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE 
commit for Ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) 
connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on 
attempt to start prepare of ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE 
commit for PRTA update Ccb:100000064/4294967396
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 
100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA 
update Ccb:100000064/4294967396 towards PBE-B
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update 
(ccbId:100000064)
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT 
runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. 
PBE rc:20
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally 
disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet 
failed rc:14, exiting
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 
'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER 
safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131343, SupervisionTime = 60
Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; 
timeout=60
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 
190 <17, 2010f> (safClmService)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update 
continuation missing! invoc:100
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 
16 <0, 2020f> (@OpenSafImmPBE)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 
17 <0, 2020f> (OsafImmPbeRt_B)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous 
admin operation 108
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards 
slave PBE. Library or immsv replied Rc:5 - ignoring

Syslog on SC-2

Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster 
tracking 5
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 
131599, SupervisionTime = 60
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active 
controller f1 detected at standby immd!! f2. Possible failover
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
.....
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting 
role to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from 
rda_callback ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides 
at 2020f
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, 
IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW 
Coord
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting 
PBE_PRTO_PURGE_MUTATIONS, epoch:18
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT 
RTO mutations received in epoch 18
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in 
persistification of class delete fWiTRoVwpQDAfWNBVtqJ
......
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc 
NCS_MBCSV_OP_CHG_ROLE 1 failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 
(MsgQueueService131343) <410, 2020f>
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally 
disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) 
connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 
184 <10, 2020f> (safAmfService)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 
200 <410, 2020f> (MsgQueueService131343)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet 
failed 2
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since 
avd_imm_impl_set failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly 
crashed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) 
received, OwnNodeId = 131599, SupervisionTime = 60
Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; 
timeout=60

Syslog and traces are attached.

________________________________

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to 
https://sourceforge.net/p/opensaf/tickets/<https://sourceforge.net/p/opensaf/tickets>

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a 
mailing list, you can unsubscribe from the mailing list.



---

** [tickets:#1059] 2PBE: cluster reset observed during switchovers**

**Status:** unassigned
**Milestone:** 4.3.3
**Created:** Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
**Last Updated:** Wed Sep 10, 2014 03:58 PM UTC
**Owner:** nobody

The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE 
with 50k application objects.

Switchovers with IMM application running is in progress when the issue is 
observed.

Syslog on SC-1:

Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 
182 <0, 2020f> (@OpenSafImmReplicatorB)
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE 
commit for PRTA update Ccb:100000063/4294967395
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on 
attempt to start prepare of ccb:100000063/4294967395
Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE 
commit for Ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) 
connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on 
attempt to start prepare of ccb:6/6
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE 
commit for PRTA update Ccb:100000064/4294967396
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb)
Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) 
replied with transient error on prepare for ccb:100000064/4294967396
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 
100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA 
update Ccb:100000064/4294967396 towards PBE-B
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update 
(ccbId:100000064)
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT 
runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. 
PBE rc:20
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally 
disconnected. Marking it as doomed 190 <17, 2010f> (safClmService)
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet 
failed rc:14, exiting
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 
'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER 
safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131343, SupervisionTime = 60
Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; 
timeout=60
Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 
190 <17, 2010f> (safClmService)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update 
continuation missing! invoc:100
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 
16 <0, 2020f> (@OpenSafImmPBE)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 
17 <0, 2020f> (OsafImmPbeRt_B)
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous 
admin operation 108
Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards 
slave PBE. Library or immsv replied Rc:5 - ignoring

Syslog on SC-2

Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster 
tracking 5
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 
131599, SupervisionTime = 60
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active 
controller f1 detected at standby immd!! f2. Possible failover
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
.....
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting 
role to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from 
rda_callback ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides 
at 2020f
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, 
IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW 
Coord
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting 
PBE_PRTO_PURGE_MUTATIONS, epoch:18
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT 
RTO mutations received in epoch 18
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in 
persistification of class delete fWiTRoVwpQDAfWNBVtqJ
......
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc 
NCS_MBCSV_OP_CHG_ROLE 1 failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 
(MsgQueueService131343) <410, 2020f>
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally 
disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) 
connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 
184 <10, 2020f> (safAmfService)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 
200 <410, 2020f> (MsgQueueService131343)
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet 
failed 2
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since 
avd_imm_impl_set failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly 
crashed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) 
received, OwnNodeId = 131599, SupervisionTime = 60
Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; 
timeout=60

Syslog and traces are attached. 


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to http://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
http://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] Re: #1059 2PBE: cluster reset observed during switchovers

Reply via email to