Will this work? :
diff --git a/src/amf/amfnd/comp.cc b/src/amf/amfnd/comp.cc
--- a/src/amf/amfnd/comp.cc
+++ b/src/amf/amfnd/comp.cc
@@ -2650,6 +2650,9 @@ void avnd_comp_cmplete_all_csi_rec(AVND_
/* generate csi-remove-done event... csi may be
deleted */
(void)avnd_comp_csi_remove_done(cb, comp, curr);
+ if (curr == nullptr)
+ break;
+
if (0 == m_AVND_COMPDB_REC_CSI_GET(*comp,
curr->name.c_str())) {
curr =
(prv) ?
---
** [tickets:#2213] AMFND: Coredump if suFailover while shutting down**
**Status:** assigned
**Milestone:** 5.2.RC1
**Created:** Fri Dec 02, 2016 04:54 AM UTC by Minh Hon Chau
**Last Updated:** Thu Mar 02, 2017 06:09 AM UTC
**Owner:** Nagendra Kumar
**Attachments:**
- [log.tgz](https://sourceforge.net/p/opensaf/tickets/2213/attachment/log.tgz)
(548.6 kB; application/x-compressed)
Seen amfnd coredump in PL5 with bt as below while cluster is shutting down
~~~
Thread 1 (Thread 0x7f92a8925780 (LWP 411)):
#0 __strcmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1358
No locals.
#1 0x0000000000449cc9 in avsv_dblist_sastring_cmp (key1=<optimized out>,
key2=<optimized out>) at util.c:361
i = 0
str1 = <optimized out>
str2 = <optimized out>
#2 0x00007f92a84b1f95 in ncs_db_link_list_find (list_ptr=0x1ee89f0,
key=0x656d6e6769737361 <error: Cannot access memory
at address 0x656d6e6769737361>) at ncsdlib.c:169
start_ptr = 0x1ee3168
#3 0x0000000000416dc0 in avnd_comp_cmplete_all_csi_rec (cb=0x666940
<_avnd_cb>, comp=0x1ee8200) at comp.cc:2652
curr = 0x1ee8060
prv = 0x1ee3150
__FUNCTION__ = "avnd_comp_cmplete_all_csi_rec"
#4 0x000000000040ca47 in avnd_instfail_su_failover (failed_comp=0x1ee8200,
su=0x1ee74e0, cb=0x666940 <_avnd_cb>) at clc
.cc:3161
rc = <optimized out>
#5 avnd_comp_clc_st_chng_prc (cb=cb@entry=0x666940 <_avnd_cb>,
comp=comp@entry=0x1ee8200, prv_st=prv_st@entry=
SA_AMF_PRESENCE_RESTARTING,
final_st=final_st@entry=SA_AMF_PRESENCE_TERMINATION_FAILED) at clc.cc:967
csi = 0x0
__FUNCTION__ = "avnd_comp_clc_st_chng_prc"
ev = AVND_SU_PRES_FSM_EV_MAX
is_en = <optimized out>
rc = 1
#6 0x000000000040f530 in avnd_comp_clc_fsm_run (cb=cb@entry=0x666940
<_avnd_cb>, comp=comp@entry=0x1ee8200, ev=
AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_FAIL) at clc.cc:906
prv_st = <optimized out>
final_st = <optimized out>
rc = 1
__FUNCTION__ = "avnd_comp_clc_fsm_run"
#7 0x000000000040fdea in avnd_evt_clc_resp_evh (cb=0x666940 <_avnd_cb>,
evt=0x7f92900008c0) at clc.cc:414
__FUNCTION__ = "avnd_evt_clc_resp_evh"
ev = <optimized out>
clc_evt = 0x7f92900008e0
comp = 0x1ee8200
rc = 1
#8 0x000000000042676f in avnd_evt_process (evt=0x7f92900008c0) at main.cc:626
cb = 0x666940 <_avnd_cb>
rc = 1
#9 avnd_main_process () at main.cc:577
ret = <optimized out>
fds = {{fd = 12, events = 1, revents = 1}, {fd = 16, events = 1,
revents = 0}, {fd = 14, events = 1, revents =
0}, {fd = 0, events = 0, revents = 0}}
evt = 0x7f92900008c0
__FUNCTION__ = "avnd_main_process"
result = <optimized out>
rc = <optimized out>
#10 0x00000000004058f3 in main (argc=1, argv=0x7ffe700c5c78) at main.cc:202
error = 0
1358 ../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
~~~
In syslog of PL5:
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1'
component restart probation timer started (timeout: 60000000000 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of
'safSu=3,safSg=1,safApp=npm_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO
'safComp=A,safSu=3,safSg=1,safApp=npm_1' faulted due to
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1'
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1'
component restart probation timer started (timeout: 60000000000 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of
'safSu=3,safSg=1,safApp=nway_1' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO
'safComp=A,safSu=3,safSg=1,safApp=nway_1' faulted due to
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1'
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2'
component restart probation timer started (timeout: 60000000000 ns)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO Restarting a component of
'safSu=4,safSg=1,safApp=npm_2' (comp restart count: 1)
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO
'safComp=A,safSu=4,safSg=1,safApp=npm_2' faulted due to
'csiRemovecallbackTimeout' : Recovery is 'componentRestart'
2016-11-20 22:01:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2'
Presence State INSTANTIATED => RESTARTING
2016-11-20 22:01:21 PL-5 amfclccli[729]: CLEANUP request
'safComp=A,safSu=4,safSg=1,safApp=npm_2'
2016-11-20 22:01:21 PL-5 amfclccli[728]: CLEANUP request
'safComp=A,safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:01:21 PL-5 amfclccli[727]: CLEANUP request
'safComp=A,safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:12 PL-5 osafamfnd[411]: NO Removed 'safSi=2,safApp=nway_1'
from 'safSu=3,safSg=1,safApp=nway_1'
2016-11-20 22:02:12 PL-5 osafimmnd[394]: NO Global discard node received for
nodeId:2040f pid:399
2016-11-20 22:02:12 PL-5 osafdtmd[380]: NO Lost contact with 'PL-4'
2016-11-20 22:02:13 PL-5 opensafd: Stopping OpenSAF Services
2016-11-20 22:02:13 PL-5 osafamfnd[411]: NO Shutdown initiated
2016-11-20 22:02:13 PL-5 osafamfnd[411]: NO Waiting for 'safSi=1,safApp=nway_1'
(state 4)
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=npm_1'
Component or SU restart probation timer expired
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO 'safSu=3,safSg=1,safApp=nway_1'
Component or SU restart probation timer expired
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO 'safSu=4,safSg=1,safApp=npm_2'
Component or SU restart probation timer expired
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Cleanup of
'safComp=A,safSu=3,safSg=1,safApp=npm_1' failed
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Reason:'Script did not exit within
time'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: WA
'safComp=A,safSu=3,safSg=1,safApp=npm_1' Presence State RESTARTING =>
TERMINATION_FAILED
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Removed 'safSi=A2,safApp=npm_1'
from 'safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Waiting for 'safSi=1,safApp=nway_1'
(state 4)
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Assigning 'safSi=A1,safApp=npm_1'
ACTIVE to 'safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Assigned 'safSi=A1,safApp=npm_1'
ACTIVE to 'safSu=3,safSg=1,safApp=npm_1'
2016-11-20 22:02:21 PL-5 osafamfnd[411]: NO Waiting for 'safSi=1,safApp=nway_1'
(state 4)
2016-11-20 22:02:21 PL-5 A[722]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 B[671]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 A[665]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 A[629]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 A[557]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 A[593]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 osafckptnd[443]: AL AMF Node Director is down,
terminate this process
2016-11-20 22:02:21 PL-5 osafclmna[403]: AL AMF Node Director is down,
terminate this process
2016-11-20 22:02:21 PL-5 A[521]: AL AMF Node Director is down, terminate this
process
2016-11-20 22:02:21 PL-5 osafamfwd[452]: Rebooting OpenSAF NodeId = 0 EE Name =
No EE Mapped, Reason: AMF unexpectedly crashed, OwnNodeId = 132367,
SupervisionTime = 60
2016-11-20 22:02:21 PL-5 osafimmnd[394]: AL AMF Node Director is down,
terminate this process
2016-11-20 22:02:21 PL-5 osafsmfnd[421]: AL AMF Node Director is down,
terminate this process
2016-11-20 22:02:21 PL-5 osafclmna[403]: exiting for shutdown
2016-11-20 22:02:21 PL-5 osafsmfnd[421]: exiting for shutdown
2016-11-20 22:02:21 PL-5 osafckptnd[443]: exiting for shutdown
2016-11-20 22:02:21 PL-5 osafimmnd[394]: exiting for shutdown
2016-11-20 22:02:21 PL-5 opensaf_reboot: Rebooting local node; timeout=60
Observation from syslog:
- Cluster shutdown order: PL3, PL4, PL5, SCs
- On shutting down PL5, component has timeout on csiRemove callback and failed
to perform clean up script. As result, comp has moved to TERM_FAILED, but su
had not been seen to move to TERM_FAILED in syslog
- A similiar thing was happening on shutting down PL3, PL4. At the time PL5 was
struggling to shutdown, component/su was receiving a new active assignment
before SU moved to TERM_FAILED
Attach syslog
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets