[tickets] [opensaf:tickets] #2007 EVT: Service got hanged for 2 hours after saEvtEventPublish
- **Milestone**: 5.1.RC1 --> future - **Comment**: Able to reproduce the problem , it doesn't look like any newly introduce issue , This look like multiple threads concretely callingsaEvtChannelClose() & saEvtEventRetentionTimeClear() = (gdb) bt #0 0x7f757034ea00 in sem_wait () from /lib64/libpthread.so.0 #1 0x7f756f496a62 in hm_block_me () from /usr/lib64/libopensaf_core.so.0 #2 0x7f756f496bdd in ncshm_destroy_hdl () from /usr/lib64/libopensaf_core.so.0 #3 0x7f7570b7ba17 in eda_channel_hdl_rec_del () from /usr/lib64/libSaEvt.so.1 #4 0x7f7570b76d24 in saEvtChannelClose () at eda_saf_api.c:895 #5 0x00427c57 in tet_saEvtChannelClose (ptrChannelHandle=0x659710) at src/tet_edsv_wrappers.c:198 #6 0x0040ce15 in tet_RetentionTimeClear_Thread () at src/tet_eda.c:4790 #7 0x0040eb3e in tet_invoketp (icnum=300, tpnum=1) at src/tet_eda.c:6279 #8 0x00429aff in call_1tp (icnum=300, tpnum=1, testnum=300) at tcm_main.c:581 #9 0x0042a0b5 in call_tps (tpcount=, icnum=) at tcm_main.c:477 #10 tet_tcm_main (argc=, argv=) at tcm_main.c:432 #11 0x0042c0fd in main (argc=1082677280, argv=0x80) at main.c:83 (gdb) generate-core-file Saved corefile core.6197 (gdb) bt full #0 0x7f757034ea00 in sem_wait () from /lib64/libpthread.so.0 No symbol table info available. #1 0x7f756f496a62 in hm_block_me () from /usr/lib64/libopensaf_core.so.0 mbcsv_init_process_req_func = {0x7f756f49b720 , 0x7f756f49d000 , 0x7f756f49bcd0 , 0x7f756f49bbe0 , 0x7f756f49bde0 , 0x7f756f49c140 , 0x7f756f49c2d0 , 0x7f756f49c5b0 , 0x7f756f49c8c0 , 0x7f756f49ca60 , 0x7f756f49ba20 , 0x7f756f49ccc0 } #2 0x7f756f496bdd in ncshm_destroy_hdl () from /usr/lib64/libopensaf_core.so.0 mbcsv_init_process_req_func = {0x7f756f49b720 , 0x7f756f49d000 , 0x7f756f49bcd0 , 0x7f756f49bbe0 , 0x7f756f49bde0 , 0x7f756f49c140 , 0x7f756f49c2d0 , 0x7f756f49c5b0 , 0x7f756f49c8c0 , 0x7f756f49ca60 , 0x7f756f49ba20 , 0x7f756f49ccc0 } #3 0x7f7570b7ba17 in eda_channel_hdl_rec_del () from /usr/lib64/libSaEvt.so.1 s_agent_startup_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' , __align = 0} eda_use_count = 1 gl_eda_hdl = 4290773003 #4 0x7f7570b76d24 in saEvtChannelClose () at eda_saf_api.c:895 gl_eda_hdl = 4290773003 #5 0x00427c57 in tet_saEvtChannelClose (ptrChannelHandle=0x659710) at src/tet_edsv_wrappers.c:198 try_again_count = 0 #6 0x0040ce15 in tet_RetentionTimeClear_Thread () at src/tet_eda.c:4790 No locals. #7 0x0040eb3e in tet_invoketp (icnum=300, tpnum=1) at src/tet_eda.c:6279 No locals. #8 0x00429aff in call_1tp (icnum=300, tpnum=1, testnum=300) at tcm_main.c:581 No locals. #9 0x0042a0b5 in call_tps (tpcount=, icnum=) at tcm_main.c:477 testnum = -512 tpnum = 1 #10 tet_tcm_main (argc=, argv=) at tcm_main.c:432 cp = icp = 0x65a5d0 iccount = tpcount = 1 icnum = 300 rc = nsys = 0 #11 0x0042c0fd in main (argc=1082677280, argv=0x80) at main.c:83 No locals. (gdb) bt thread apply all A syntax error in expression, near `apply all'. (gdb) thread apply all bt Thread 4 (Thread 0x7f7570f9eb00 (LWP 6198)): #0 0x7f756fe224f6 in poll () from /lib64/libc.so.6 #1 0x7f756f485fd1 in osaf_ppoll () from /usr/lib64/libopensaf_core.so.0 #2 0x7f756f48d9ef in ncs_tmr_wait () from /usr/lib64/libopensaf_core.so.0 #3 0x7f75703487b6 in start_thread () from /lib64/libpthread.so.0 #4 0x7f756fe2b9cd in clone () from /lib64/libc.so.6 #5 0x in ?? () Thread 3 (Thread 0x7f7570f6bb00 (LWP 6199)): #0 0x7f756fe224f6 in poll () from /lib64/libc.so.6 #1 0x7f756f4c317e in mdtm_process_recv_events () from /usr/lib64/libopensaf_core.so.0 #2 0x7f75703487b6 in start_thread () from /lib64/libpthread.so.0 #3 0x7f756fe2b9cd in clone () from /lib64/libc.so.6 #4 0x in ?? () Thread 2 (Thread 0x7f756ef45700 (LWP 6200)): #0 0x7f756fdf9c0d in nanosleep () from /lib64/libc.so.6 #1 0x7f756fdf9a2c in sleep () from /lib64/libc.so.6 #2 0x00428e76 in eda_selection_thread () at src/tet_edsv_wrappers.c:643 #3 0x7f75703487b6 in start_thread () from /lib64/libpthread.so.0 #4 0x7f756fe2b9cd in clone () from /lib64/libc.so.6 #5 0x in ?? () Thread 1 (Thread 0x7f7570f6e720 (LWP 6197)): #0 0x7f757034ea00 in sem_wait () from /lib64/libpthread.so.0 #1 0x7f756f496a62 in hm_block_me () from /usr/lib64/libopensaf_core.so.0 #2 0x7f756f496bdd in ncshm_destroy_hdl () from /usr/lib64/libopensaf_core.so.0 #3 0x7f7570b7ba17 in eda_channel_hdl_rec_del () from
[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP
>> It appears to me that we are hitting something similar like >> >>"http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive->>timer-delaying-disconnect" Have you economized below configuration in /etc/opensaf/dtmd.conf ? The above case disconnection is via keepalive timer (idle time=40 sec, 4 probes, probe time=10 sec). == /# so_keepalive: Enable sending of keep-alive messages on connection-oriented /# sockets. Expects an integer boolean flag /# Note that without this set none of the tcp options will matter DTM_SKEEPALIVE=1 /# /# tcp_keepalive_time: The time (in seconds) the connection needs to remain /# idle before TCP starts sending keepalive probes /# Optional DTM_TCP_KEEPIDLE_TIME=2 == --- ** [tickets:#2014] Rebooted controller not detected in TCP** **Status:** assigned **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt **Last Updated:** Fri Sep 09, 2016 04:16 AM UTC **Owner:** A V Mahesh (AVM) **Attachments:** - [logs.tgz](https://sourceforge.net/p/opensaf/tickets/2014/attachment/logs.tgz) (84.1 kB; application/x-compressed-tar) In 20% of the cases a "reboot -f" on controller2 is not detected and acted on. What is in the mds.log is . Sep 7 6:44:23.918566 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x,1> Sep 7 6:44:23.918595 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Sep 7 6:44:34.018662 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured Sep 7 6:44:34.018751 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19) Sep 7 6:44:34.018789 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x,1> Sep 7 6:44:34.018818 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Sep 7 6:44:44.118832 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured Sep 7 6:44:44.118919 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19) Sep 7 6:44:44.118955 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x,1> Sep 7 6:44:44.118984 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Sep 7 6:44:54.218987 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured Sep 7 6:44:54.219085 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19) Sep 7 6:44:54.219139 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x,1> Sep 7 6:44:54.219168 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Still, there is nothing in the syslog indicating that controller2 has left the cluster. This is for TCP. When the node comes back on line (without opensaf being started) controller 1 notice finally and fail over apps. When the reboot is not detected the tcp keep alives stops and goes into retransmits instead. I have attached 2 tshark sessions captured from controller1, capturing traffic between controller1 and controller2. The failed reboot detect is captured in "ctrl2_failed_detection.trc" and for a working detection there is a file "ctrl2_working.trc" I have also attached all logs in /var/log/opensaf and the syslog (all from controller one). It appears to me that we are hitting something similar like "http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive-timer-delaying-disconnect; // Jonas --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP
- **status**: unassigned --> assigned - **assigned_to**: A V Mahesh (AVM) - **Component**: unknown --> dtm - **Part**: lib --> - - **Priority**: critical --> major - **Comment**: Can you please provide your Cluster environment ( OS / VM /container ) details --- ** [tickets:#2014] Rebooted controller not detected in TCP** **Status:** assigned **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt **Last Updated:** Thu Sep 08, 2016 06:20 PM UTC **Owner:** A V Mahesh (AVM) **Attachments:** - [logs.tgz](https://sourceforge.net/p/opensaf/tickets/2014/attachment/logs.tgz) (84.1 kB; application/x-compressed-tar) In 20% of the cases a "reboot -f" on controller2 is not detected and acted on. What is in the mds.log is . Sep 7 6:44:23.918566 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x,1> Sep 7 6:44:23.918595 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Sep 7 6:44:34.018662 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured Sep 7 6:44:34.018751 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19) Sep 7 6:44:34.018789 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x,1> Sep 7 6:44:34.018818 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Sep 7 6:44:44.118832 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured Sep 7 6:44:44.118919 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19) Sep 7 6:44:44.118955 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x,1> Sep 7 6:44:44.118984 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Sep 7 6:44:54.218987 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured Sep 7 6:44:54.219085 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19) Sep 7 6:44:54.219139 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x,1> Sep 7 6:44:54.219168 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Still, there is nothing in the syslog indicating that controller2 has left the cluster. This is for TCP. When the node comes back on line (without opensaf being started) controller 1 notice finally and fail over apps. When the reboot is not detected the tcp keep alives stops and goes into retransmits instead. I have attached 2 tshark sessions captured from controller1, capturing traffic between controller1 and controller2. The failed reboot detect is captured in "ctrl2_failed_detection.trc" and for a working detection there is a file "ctrl2_working.trc" I have also attached all logs in /var/log/opensaf and the syslog (all from controller one). It appears to me that we are hitting something similar like "http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive-timer-delaying-disconnect; // Jonas --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2008 AMFND: Coredump while shutting down
changeset: 8031:1f2d5df7d8b7 branch: opensaf-5.1.x parent: 8028:4bd26e7de69c user:minh-chaudate:Fri Sep 09 08:02:39 2016 +1000 summary: AMFND: Fix amfnd coredump if sc failover while shutting down [#2008] changeset: 8030:1412efc8c888 tag: qparent user:minh-chau date:Fri Sep 09 07:55:38 2016 +1000 summary: AMFND: Fix amfnd coredump if sc failover while shutting down [#2008] --- ** [tickets:#2008] AMFND: Coredump while shutting down** **Status:** fixed **Milestone:** 5.1.RC1 **Created:** Wed Sep 07, 2016 12:35 PM UTC by Minh Hon Chau **Last Updated:** Thu Sep 08, 2016 10:09 PM UTC **Owner:** nobody **Attachments:** - [osafamfnd](https://sourceforge.net/p/opensaf/tickets/2008/attachment/osafamfnd) (135.3 kB; application/octet-stream) During cluster shutting down phase, if both controllers do not shutdown fast enough and active controller goes down first, then a possibility of sc failover happens. In this situation, avnd_last_step_clean() gets called twice, a coredump is generated It most likely because deleting record in nodeid_mdsdest_db and hctypedb but those container still own the key. Thus, the second call of avnd_last_step_clean() cause coredump BT Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/local/lib/opensaf/osafamfnd --tracemask=0x'. Program terminated with signal SIGABRT, Aborted. 0 0x7f56a225bcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. Traceback (most recent call last): File "/usr/share/gdb/auto-load/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19-gdb.py", line 63, in from libstdcxx.v6.printers import register_libstdcxx_printers ImportError: No module named 'libstdcxx' (gdb) bt 0 0x7f56a225bcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 1 0x7f56a225f0d8 in __GI_abort () at abort.c:89 2 0x7f56a2298394 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7f56a23a6b28 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175 3 0x7f56a22a466e in malloc_printerr (ptr=, str=0x7f56a23a2c19 "free(): invalid pointer", action=1) at malloc.c:4996 4 _int_free (av=, p=, have_lock=0) at malloc.c:3840 5 0x0043a616 in _M_dispose (__a=..., this=) at /usr/include/c++/4.8/bits/basic_string.h:249 6 ~basic_string (this=0x1d5fa70, __in_chrg=) at /usr/include/c++/4.8/bits/basic_string.h:539 7 ~avnd_hctype_tag (this=0x1d5fa70, __in_chrg=) at ../../../../../osaf/services/saf/amf/amfnd/include/avnd_hc.h:46 8 avnd_last_step_clean (cb=cb@entry=0x665940 <_avnd_cb>) at term.cc:101 9 0x00436ee1 in avnd_su_si_oper_done (cb=cb@entry=0x665940 <_avnd_cb>, su=0x1d5d000, si=si@entry=0x0) at susm.cc:1169 10 0x00416629 in avnd_comp_csi_assign_done (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260, csi=csi@entry=0x0) at comp.cc:1642 11 0x00416a6e in avnd_comp_cmplete_all_assignment (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260) at comp.cc:2567 12 0x0040bb9b in avnd_comp_clc_terming_cleansucc_hdler (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260) at clc.cc:2328 13 0x0040f6ba in avnd_comp_clc_fsm_run (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260, ev=AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_SUCC) at clc.cc:876 14 0x0040ffca in avnd_evt_clc_resp_evh (cb=0x665940 <_avnd_cb>, evt=0x7f568c0008c0) at clc.cc:414 15 0x00425f5f in avnd_evt_process (evt=0x7f568c0008c0) at main.cc:625 16 avnd_main_process () at main.cc:576 --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2008 AMFND: Coredump while shutting down
- **status**: review --> fixed - **assigned_to**: Minh Hon Chau --> nobody --- ** [tickets:#2008] AMFND: Coredump while shutting down** **Status:** fixed **Milestone:** 5.1.RC1 **Created:** Wed Sep 07, 2016 12:35 PM UTC by Minh Hon Chau **Last Updated:** Thu Sep 08, 2016 12:03 PM UTC **Owner:** nobody **Attachments:** - [osafamfnd](https://sourceforge.net/p/opensaf/tickets/2008/attachment/osafamfnd) (135.3 kB; application/octet-stream) During cluster shutting down phase, if both controllers do not shutdown fast enough and active controller goes down first, then a possibility of sc failover happens. In this situation, avnd_last_step_clean() gets called twice, a coredump is generated It most likely because deleting record in nodeid_mdsdest_db and hctypedb but those container still own the key. Thus, the second call of avnd_last_step_clean() cause coredump BT Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/local/lib/opensaf/osafamfnd --tracemask=0x'. Program terminated with signal SIGABRT, Aborted. 0 0x7f56a225bcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. Traceback (most recent call last): File "/usr/share/gdb/auto-load/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19-gdb.py", line 63, in from libstdcxx.v6.printers import register_libstdcxx_printers ImportError: No module named 'libstdcxx' (gdb) bt 0 0x7f56a225bcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 1 0x7f56a225f0d8 in __GI_abort () at abort.c:89 2 0x7f56a2298394 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7f56a23a6b28 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175 3 0x7f56a22a466e in malloc_printerr (ptr=, str=0x7f56a23a2c19 "free(): invalid pointer", action=1) at malloc.c:4996 4 _int_free (av=, p=, have_lock=0) at malloc.c:3840 5 0x0043a616 in _M_dispose (__a=..., this=) at /usr/include/c++/4.8/bits/basic_string.h:249 6 ~basic_string (this=0x1d5fa70, __in_chrg=) at /usr/include/c++/4.8/bits/basic_string.h:539 7 ~avnd_hctype_tag (this=0x1d5fa70, __in_chrg=) at ../../../../../osaf/services/saf/amf/amfnd/include/avnd_hc.h:46 8 avnd_last_step_clean (cb=cb@entry=0x665940 <_avnd_cb>) at term.cc:101 9 0x00436ee1 in avnd_su_si_oper_done (cb=cb@entry=0x665940 <_avnd_cb>, su=0x1d5d000, si=si@entry=0x0) at susm.cc:1169 10 0x00416629 in avnd_comp_csi_assign_done (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260, csi=csi@entry=0x0) at comp.cc:1642 11 0x00416a6e in avnd_comp_cmplete_all_assignment (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260) at comp.cc:2567 12 0x0040bb9b in avnd_comp_clc_terming_cleansucc_hdler (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260) at clc.cc:2328 13 0x0040f6ba in avnd_comp_clc_fsm_run (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260, ev=AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_SUCC) at clc.cc:876 14 0x0040ffca in avnd_evt_clc_resp_evh (cb=0x665940 <_avnd_cb>, evt=0x7f568c0008c0) at clc.cc:414 15 0x00425f5f in avnd_evt_process (evt=0x7f568c0008c0) at main.cc:625 16 avnd_main_process () at main.cc:576 --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2014 Rebooted controller not detected in TCP
--- ** [tickets:#2014] Rebooted controller not detected in TCP** **Status:** unassigned **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 06:20 PM UTC by Jonas Arndt **Last Updated:** Thu Sep 08, 2016 06:20 PM UTC **Owner:** nobody **Attachments:** - [logs.tgz](https://sourceforge.net/p/opensaf/tickets/2014/attachment/logs.tgz) (84.1 kB; application/x-compressed-tar) In 20% of the cases a "reboot -f" on controller2 is not detected and acted on. What is in the mds.log is . Sep 7 6:44:23.918566 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x,1> Sep 7 6:44:23.918595 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Sep 7 6:44:34.018662 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured Sep 7 6:44:34.018751 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19) Sep 7 6:44:34.018789 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x,1> Sep 7 6:44:34.018818 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Sep 7 6:44:44.118832 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured Sep 7 6:44:44.118919 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19) Sep 7 6:44:44.118955 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x,1> Sep 7 6:44:44.118984 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Sep 7 6:44:54.218987 osafamfd[41365] ERR |MDS_SND_RCV: Timeout or Error occured Sep 7 6:44:54.219085 osafamfd[41365] ERR |MDS_SND_RCV: Timeout occured on red sndrsp message from svc_id = MBCSV(19), to svc_id = MBCSV(19) Sep 7 6:44:54.219139 osafamfd[41365] ERR |MDS_SND_RCV: Adest=<0x,1> Sep 7 6:44:54.219168 osafamfd[41365] ERR |MDS_SND_RCV: Anchor=<0x0002020f,1790> Still, there is nothing in the syslog indicating that controller2 has left the cluster. This is for TCP. When the node comes back on line (without opensaf being started) controller 1 notice finally and fail over apps. When the reboot is not detected the tcp keep alives stops and goes into retransmits instead. I have attached 2 tshark sessions captured from controller1, capturing traffic between controller1 and controller2. The failed reboot detect is captured in "ctrl2_failed_detection.trc" and for a working detection there is a file "ctrl2_working.trc" I have also attached all logs in /var/log/opensaf and the syslog (all from controller one). It appears to me that we are hitting something similar like "http://stackoverflow.com/questions/33553410/tcp-retranmission-timer-overrides-kills-tcp-keepalive-timer-delaying-disconnect; // Jonas --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2000 osaf: Cluster reset happend due to msgd crashed on both the controller
- **Component**: osaf --> msg --- ** [tickets:#2000] osaf: Cluster reset happend due to msgd crashed on both the controller** **Status:** unassigned **Milestone:** 5.1.RC1 **Created:** Tue Sep 06, 2016 06:04 AM UTC by Ritu Raj **Last Updated:** Wed Sep 07, 2016 09:38 AM UTC **Owner:** nobody **Attachments:** - [Active_syslog](https://sourceforge.net/p/opensaf/tickets/2000/attachment/Active_syslog) (716.7 kB; application/octet-stream) - [Standby_syslog](https://sourceforge.net/p/opensaf/tickets/2000/attachment/Standby_syslog) (696.4 kB; application/octet-stream) Environment details -- OS : Suse 64bit Changeset : 7997 ( 5.1.FC) Setup : 4 nodes ( 2 controllers and 2 payloads with headless feature disabled & 1PBE enabled with 30K objects ) Summary : -- Cluster reset happend due to assertion SA_MAX_UNEXTENDED_NAME_LENGTH failed in msgd Steps followed & Observed behaviour -- 1. Invoked failover 2. After, few successful failover, New Active Controller rebooted beacuse of Assertion 'length < SA_MAX_UNEXTENDED_NAME_LENGTH' failed in msgd. While previous Active joinig the cluster as a Standby Role resulted cluster reset happend. [Timeline: Sep 6 00:13:02 sofo-s2] Sep 6 00:13:02 sofo-s2 osafimmd[3985]: NO MDS event from svc_id 24 (change:5, dest:13) Sep 6 00:13:02 sofo-s2 osafmsgd[4145]: osaf_extended_name.c:139: osaf_extended_name_length: Assertion 'length < SA_MAX_UNEXTENDED_NAME_LENGTH' failed. Sep 6 00:13:02 sofo-s2 osafamfnd[4046]: NO 'safComp=MQD,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast' Sep 6 00:13:02 sofo-s2 osafamfnd[4046]: ER safComp=MQD,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast Sep 6 00:13:02 sofo-s2 osafamfnd[4046]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60 Sep 6 00:13:02 sofo-s2 opensaf_reboot: Rebooting local node; timeout=60 Notes: 1. Syslog attached 2 msgnd & msgd trace not enabled --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1954 log: assertion failed in log_stream_close
- **status**: review --> fixed - **assigned_to**: Vu Minh Nguyen --> nobody - **Milestone**: 4.7.2 --> 5.0.1 - **Comment**: changeset: 8029:3417fcd840a3 tag: tip parent: 8026:eed08ce4437e user:Vu Minh Nguyendate:Thu Sep 08 19:23:52 2016 +0700 summary: log: assertion failed in log_stream_close [#1954] changeset: 8028:4bd26e7de69c branch: opensaf-5.1.x parent: 8025:5c1dfa0c9bf1 user:Vu Minh Nguyen date:Thu Sep 08 19:20:48 2016 +0700 summary: log: assertion failed in log_stream_close [#1954] changeset: 8027:bc9afc86a424 branch: opensaf-5.0.x parent: 8024:4e2638e8f818 user:Vu Minh Nguyen date:Thu Sep 08 19:18:22 2016 +0700 summary: log: assertion failed in log_stream_close [#1954] --- ** [tickets:#1954] log: assertion failed in log_stream_close** **Status:** fixed **Milestone:** 5.0.1 **Created:** Tue Aug 16, 2016 09:54 AM UTC by Vu Minh Nguyen **Last Updated:** Thu Aug 18, 2016 01:52 AM UTC **Owner:** nobody In `lgs_client_delete()`, `log_stream_close()` is called without NULL check. If it is the case, the node will be rebooted due to assertion failed. > Aug 16 13:26:04 SC-1 osaflogd[6016]: lgs_stream.cc:759: log_stream_close: > Assertion 'stream != NULL' failed. This ticket is going to add the protection. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1969 smf: One step upgrade with cluster reboot does not wait for nodes to start
When SMF is started after a reboot and shall continue with a campaign it is checked that all nodes that are part of the campaign is available. In this case the campaign has requested a cluster reboot after the procedure execute state is completed. After restart the campaign shall continue with the procedure wrap-up state. The preparation for this includes asking for node Id of all nodes that’s part of the campaign and when all nodes has answered the wrap-up will be done. The problem here is that in this case each node is checked for node up with a timeout of 10s (this is hard coded) and if a node is not up within this time the campaign will fail. • Each node has a timeout of 10s • Nodes are checked in sequence meaning that the last node checked may have longer time to start if there has been any waiting done for any of the previous ones • The check starts when smfd has started on the active SC node and some of the other nodes may already have been started by then and some not Al together this means that this behavior is unpredictable and since the worst case will give a rather short timeout it may also be considered as unstable. For 2) I suggest the following to be done: 1. Create a temporary (quick) fix by just using a longer (hard coded) timeout if reboot upgrade to be released with 5.1 (defect ticket). Will this create any NBC problem? 2. Define and implement a better handling of this e.g. by making it possible to configure the timeout via a new attribute in the smf configuration object. Can be released as an enhancement in 5.2 Any better suggestions? --- ** [tickets:#1969] smf: One step upgrade with cluster reboot does not wait for nodes to start** **Status:** unassigned **Milestone:** 5.0.1 **Created:** Wed Aug 24, 2016 01:01 PM UTC by elunlen **Last Updated:** Thu Sep 01, 2016 09:50 AM UTC **Owner:** nobody When using the one step upgrade feature with a cluster reboot all nodes will restart including the SC-nodes. This is done as the last action in the upgrade step. After the active SC-node is up again SMF will continue with the procedure wrapup. When collecting information in order to prepare the wrapup the node destination for all nodes in the campaign is requested. However this information can only be collected from nodes that are started and has joined the cluster (unlocked). The problem is that SMF does not seems wait in order to give all nodes a chance to join the cluster and if SMF fails to get node destination from any of the nodes the campaign will fail as seen in the log below. When reading node destination there is a 10 sec “try again” loop waiting for “node up” for each node. It is not unlikely that the active SC-node comes up before some of the other nodes and that it will take more than 10 sec after that before some of the other nodes joins the cluster. If that's the case the campaign will fail --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2013 IMM: Search Handle getting corrupt when saImmOmSearchNext_2() returns ERR_TIMEOUT
--- ** [tickets:#2013] IMM: Search Handle getting corrupt when saImmOmSearchNext_2() returns ERR_TIMEOUT** **Status:** unassigned **Milestone:** 5.1.RC1 **Created:** Thu Sep 08, 2016 12:10 PM UTC by Chani Srivastava **Last Updated:** Thu Sep 08, 2016 12:10 PM UTC **Owner:** nobody **Attachments:** - [SearchTmOut.zip](https://sourceforge.net/p/opensaf/tickets/2013/attachment/SearchTmOut.zip) (883.9 kB; application/zip) OS : Suse 64bit Changeset : 7997 ( 5.1.FC) Setup : 4 nodes Summary: Steps to Reproduce 1. Create a runtime/config object 2. Do Search Initiliaze() 3. Delete the object created in Step1 4. Do SearchNext() 5. Do SearchNext() again Observed Bahavior: Step4 will return SA_AIS_ERR_TIMEOUT (Expected) Step5 is returning SA_AIS_ERR_BAD_HANDLE** (SA_AIS_ERR_NOT_EXIST is expected)** **Note: Test passed in OpenSAF release 5.0** Agent traces and immnd, immd traces attached --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2008 AMFND: Coredump while shutting down
- **status**: assigned --> review --- ** [tickets:#2008] AMFND: Coredump while shutting down** **Status:** review **Milestone:** 5.1.RC1 **Created:** Wed Sep 07, 2016 12:35 PM UTC by Minh Hon Chau **Last Updated:** Thu Sep 08, 2016 11:30 AM UTC **Owner:** Minh Hon Chau **Attachments:** - [osafamfnd](https://sourceforge.net/p/opensaf/tickets/2008/attachment/osafamfnd) (135.3 kB; application/octet-stream) During cluster shutting down phase, if both controllers do not shutdown fast enough and active controller goes down first, then a possibility of sc failover happens. In this situation, avnd_last_step_clean() gets called twice, a coredump is generated It most likely because deleting record in nodeid_mdsdest_db and hctypedb but those container still own the key. Thus, the second call of avnd_last_step_clean() cause coredump BT Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/local/lib/opensaf/osafamfnd --tracemask=0x'. Program terminated with signal SIGABRT, Aborted. 0 0x7f56a225bcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. Traceback (most recent call last): File "/usr/share/gdb/auto-load/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19-gdb.py", line 63, in from libstdcxx.v6.printers import register_libstdcxx_printers ImportError: No module named 'libstdcxx' (gdb) bt 0 0x7f56a225bcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 1 0x7f56a225f0d8 in __GI_abort () at abort.c:89 2 0x7f56a2298394 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7f56a23a6b28 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175 3 0x7f56a22a466e in malloc_printerr (ptr=, str=0x7f56a23a2c19 "free(): invalid pointer", action=1) at malloc.c:4996 4 _int_free (av=, p=, have_lock=0) at malloc.c:3840 5 0x0043a616 in _M_dispose (__a=..., this=) at /usr/include/c++/4.8/bits/basic_string.h:249 6 ~basic_string (this=0x1d5fa70, __in_chrg=) at /usr/include/c++/4.8/bits/basic_string.h:539 7 ~avnd_hctype_tag (this=0x1d5fa70, __in_chrg=) at ../../../../../osaf/services/saf/amf/amfnd/include/avnd_hc.h:46 8 avnd_last_step_clean (cb=cb@entry=0x665940 <_avnd_cb>) at term.cc:101 9 0x00436ee1 in avnd_su_si_oper_done (cb=cb@entry=0x665940 <_avnd_cb>, su=0x1d5d000, si=si@entry=0x0) at susm.cc:1169 10 0x00416629 in avnd_comp_csi_assign_done (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260, csi=csi@entry=0x0) at comp.cc:1642 11 0x00416a6e in avnd_comp_cmplete_all_assignment (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260) at comp.cc:2567 12 0x0040bb9b in avnd_comp_clc_terming_cleansucc_hdler (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260) at clc.cc:2328 13 0x0040f6ba in avnd_comp_clc_fsm_run (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260, ev=AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_SUCC) at clc.cc:876 14 0x0040ffca in avnd_evt_clc_resp_evh (cb=0x665940 <_avnd_cb>, evt=0x7f568c0008c0) at clc.cc:414 15 0x00425f5f in avnd_evt_process (evt=0x7f568c0008c0) at main.cc:625 16 avnd_main_process () at main.cc:576 --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1985 log: cppcheck version 1.75 find errors in logsv
- **status**: review --> fixed - **assigned_to**: Canh Truong --> nobody - **Milestone**: 4.7.2 --> 5.0.1 - **Comment**: changeset: 8026:eed08ce4437e tag: tip parent: 8020:e5f162184bbd user:Canh Van Truongdate:Thu Sep 08 18:59:51 2016 +0700 summary: log: fix errors reported by cppcheck version 1.75 [#1985] changeset: 8025:5c1dfa0c9bf1 branch: opensaf-5.1.x parent: 8021:68b29ac33324 user:Canh Van Truong date:Thu Sep 08 18:59:51 2016 +0700 summary: log: fix errors reported by cppcheck version 1.75 [#1985] changeset: 8024:4e2638e8f818 branch: opensaf-5.0.x parent: 8022:2139f3e6b37b user:Canh Van Truong date:Thu Sep 08 18:59:51 2016 +0700 summary: log: fix errors reported by cppcheck version 1.75 [#1985] --- ** [tickets:#1985] log: cppcheck version 1.75 find errors in logsv** **Status:** fixed **Milestone:** 5.0.1 **Created:** Tue Aug 30, 2016 08:33 AM UTC by Canh Truong **Last Updated:** Thu Sep 01, 2016 02:34 AM UTC **Owner:** nobody osaf/services/saf/logsv/lgs/lgs_clm.cc:120]: (error) Uninitialized variable: rc osaf/services/saf/logsv/lgs/lgs_evt.cc:892]: (error) Invalid strncmp() argument nr 3. A non-boolean value is required. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2006 NTFSv: Cluster rebooted with ntfd crashed on both controllers
- **status**: accepted --> review --- ** [tickets:#2006] NTFSv: Cluster rebooted with ntfd crashed on both controllers** **Status:** review **Milestone:** 5.1.RC1 **Created:** Wed Sep 07, 2016 06:42 AM UTC by Chani Srivastava **Last Updated:** Thu Sep 08, 2016 06:57 AM UTC **Owner:** Vu Minh Nguyen **Attachments:** - [NtfCrash.zip](https://sourceforge.net/p/opensaf/tickets/2006/attachment/NtfCrash.zip) (165.9 kB; application/zip) OS : Suse PPC 64bit Changeset : 7997 ( 5.1.FC) Setup : 4 nodes ( 2 controllers and 2 payloads with headless feature disabled & no PBE ) Ntfd traces and syslog for both controllers attached * Ntf Application is running on system * Will update ticket with core dump Note: The timings on system are not synced. After every reboot node timings are modified BT: 0 0x0fffa0848100 in .raise () from /lib64/libc.so.6 1 0x0fffa0849d10 in .abort () from /lib64/libc.so.6 2 0x0fffa0e34234 in osaf_abort (i_cause=7) at osaf_utility.c:27 3 0x1001a2f8 in NtfLogger::logNotification (this=0x100ba768, notif= std::tr1::shared_ptr (count 2, weak 0) 0x100b88f0) at NtfLogger.cc:247 4 0x10019e60 in NtfLogger::checkQueueAndLog (this=0x100ba768, newNotif=std::tr1::shared_ptr (count 2, weak 0) 0x100b88f0) at NtfLogger.cc:181 5 0x10019a74 in NtfLogger::log (this=0x100ba768, notif=std::tr1::shared_ptr (count 2, weak 0) 0x100b88f0, isLocal=true) at NtfLogger.cc:137 6 0x1002b528 in NtfAdmin::processNotification (this=0x100ba760, clientId=62, notificationType=SA_NTF_TYPE_ALARM, sendNotInfo=0x100b8800, mdsCtxt=0x100bb3dc, notificationId=47) at NtfAdmin.cc:203 7 0x1002b938 in NtfAdmin::notificationReceived (this=0x100ba760, clientId=62, notificationType=SA_NTF_TYPE_ALARM, sendNotInfo=0x100b8800, mdsCtxt=0x100bb3dc) at NtfAdmin.cc:257 8 0x1002ec20 in notificationReceived (clientId=62, notificationType=SA_NTF_TYPE_ALARM, sendNotInfo=0x100b8800, mdsCtxt=0x100bb3dc) at NtfAdmin.cc:1012 9 0x10006410 in proc_send_not_msg (cb=0x10073190 <_ntfs_cb>, evt=0x100bb3d0) at ntfs_evt.c:447 10 0x10006b28 in process_api_evt (evt=0x100bb3d0) at ntfs_evt.c:628 11 0x10006c38 in ntfs_process_mbx (mbx=0x10073190 <_ntfs_cb>) at ntfs_evt.c:660 12 0x1000b6f4 in main (argc=2, argv=0xfffc056a7f8) at ntfs_main.c:399 Active Controler: May 26 19:41:16 linux-pvra osafntfd[24205]: **osaf_abort(7) called from 0x1001a2f8 with errno=11** May 26 19:41:16 linux-pvra osafamfnd[24243]: NO 'safComp=NTF,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast' May 26 19:41:16 linux-pvra osafamfnd[24243]: ER safComp=NTF,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast May 26 19:41:16 linux-pvra osafamfnd[24243]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60 May 26 19:41:16 linux-pvra opensaf_reboot: Rebooting local node; timeout=60 Jun 2 14:11:28 linux-pvra syslog-ng[1639]: syslog-ng starting up; version='2.0.9' Ntf Trace: May 26 19:41:16.767426 osafntfd [24205:lga_api.c:1190] TR logBufSize > strlen(logBuf) + 1 May 26 19:41:16.767436 osafntfd [24205:lga_api.c:1320] << saLogWriteLogAsync Jun 2 14:11:47.153831 osafntfd [2958:ntfs_main.c:0181] >> initialize Jun 2 14:11:47.175099 osafntfd [2958:ncs_main_pub.c:0220] TR NCS:PROCESS_ID=2958 --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2002 CLM : Agent crashed for invalid check in buffer notification parameter
- **status**: unassigned --> assigned - **assigned_to**: Mathi Naickan --- ** [tickets:#2002] CLM : Agent crashed for invalid check in buffer notification parameter** **Status:** assigned **Milestone:** 5.1.RC1 **Created:** Tue Sep 06, 2016 08:15 AM UTC by Srikanth R **Last Updated:** Tue Sep 06, 2016 08:15 AM UTC **Owner:** Mathi Naickan Environment details -- OS : Suse 64bit Changeset : 7997 ( 5.1.FC) Setup : 5 nodes ( 2 controllers and 3 payloads with headless feature disabled & no PBE ) AMF Application : 2N model with SUs mapped on PL-3,PL-4 Steps followed & Observed behaviour -- -> Call saClmClusterTrack_4 api with CURRENT flag and buffer parameter populated. Here the buffer paramter is populated by allocating suffiicent memory of numberOfItems but notification is having garbage values. Agent crashed with the following back trace, if notification is having garbage values. -> #3 0x7f4ccb370c9f in osaf_extended_name_length (name=0x9d5e4e) at osaf_extended_name.c:139 -> #4 0x7f4cca9ff27c in clma_validate_flags_buf_4 (hdl_rec=0x97cbc0, flags=1 '\001', buf=0x97c190) at clma_api.c:183 ->#5 0x7f4ccaa00fe5 in clmaclustertrack (clmHandle=4290772993, flags=1 '\001', buf=0x0, buf_4=0x97c190) at clma_api.c:1032 ->#6 0x7f4ccaa00d40 in saClmClusterTrack_4 (clmHandle=4290772993, flags=1 '\001', buf=0x97c190) at clma_api.c:958 Expected behaviour -- If the buffer parameter is NULL, CLM shall invoke a callback. If the buffer parameter is not NULL, CLM should check only value of numberOfItems and evaluate whether sufficient memory is allocated by user or not. With the #1906 changes, contents of notification are also verified. But only structure member numberOfItems is to be verified. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2012 clm: inconsistant additionalText and lengthAdditionalText in notification construction
- **status**: accepted --> review --- ** [tickets:#2012] clm: inconsistant additionalText and lengthAdditionalText in notification construction** **Status:** review **Milestone:** 5.0.1 **Created:** Thu Sep 08, 2016 11:20 AM UTC by Vu Minh Nguyen **Last Updated:** Thu Sep 08, 2016 11:20 AM UTC **Owner:** Vu Minh Nguyen According to NTF AIS, `additionalText` must be consistent with `lengthAdditionalText`. In current code, CLM always set an hard-code `ADDITION_TEXT_LENGTH` to `lengthAdditionalText` regardless of what `additionalText` is. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2008 AMFND: Coredump while shutting down
I think the deletion of nodeid_mdsdest_db and hctypedb and hctypedb in avnd_last_step_clean() was introduced due to valgrind's complains while "opensafd stop" If take out those changes, there are memleak complains: ==538== 16 bytes in 1 blocks are definitely lost in loss record 16 of 142 ==538==at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==538==by 0x42D149: avnd_nodeid_mdsdest_rec_add(avnd_cb_tag*, unsigned long) (proxydb.cc:55) ==538==by 0x42B793: avnd_evt_mds_avnd_up_evh(avnd_cb_tag*, avnd_evt_tag*) (proxy.cc:52) ==538==by 0x425F5E: avnd_evt_process (main.cc:625) ==538==by 0x425F5E: avnd_main_process() (main.cc:576) ==538==by 0x4058B2: main (main.cc:201) ==538== 1,592 (312 direct, 1,280 indirect) bytes in 13 blocks are definitely lost in loss record 135 of 142 ==538==at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==538==by 0x423828: hctype_create (hcdb.cc:160) ==538==by 0x423828: avnd_hctype_config_get(unsigned long long, std::string const&) (hcdb.cc:218) ==538==by 0x423B45: avnd_hc_config_get(avnd_comp_tag*) (hcdb.cc:119) ==538==by 0x41989A: avnd_comp_config_get_su(avnd_su_tag*) (compdb.cc:1559) ==538==by 0x4304FE: avnd_evt_avd_reg_su_evh(avnd_cb_tag*, avnd_evt_tag*) (su.cc:161) ==538==by 0x425F5E: avnd_evt_process (main.cc:625) ==538==by 0x425F5E: avnd_main_process() (main.cc:576) ==538==by 0x4058B2: main (main.cc:201) --- ** [tickets:#2008] AMFND: Coredump while shutting down** **Status:** assigned **Milestone:** 5.1.RC1 **Created:** Wed Sep 07, 2016 12:35 PM UTC by Minh Hon Chau **Last Updated:** Thu Sep 08, 2016 04:32 AM UTC **Owner:** Minh Hon Chau **Attachments:** - [osafamfnd](https://sourceforge.net/p/opensaf/tickets/2008/attachment/osafamfnd) (135.3 kB; application/octet-stream) During cluster shutting down phase, if both controllers do not shutdown fast enough and active controller goes down first, then a possibility of sc failover happens. In this situation, avnd_last_step_clean() gets called twice, a coredump is generated It most likely because deleting record in nodeid_mdsdest_db and hctypedb but those container still own the key. Thus, the second call of avnd_last_step_clean() cause coredump BT Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/local/lib/opensaf/osafamfnd --tracemask=0x'. Program terminated with signal SIGABRT, Aborted. 0 0x7f56a225bcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. Traceback (most recent call last): File "/usr/share/gdb/auto-load/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19-gdb.py", line 63, in from libstdcxx.v6.printers import register_libstdcxx_printers ImportError: No module named 'libstdcxx' (gdb) bt 0 0x7f56a225bcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 1 0x7f56a225f0d8 in __GI_abort () at abort.c:89 2 0x7f56a2298394 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7f56a23a6b28 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175 3 0x7f56a22a466e in malloc_printerr (ptr=, str=0x7f56a23a2c19 "free(): invalid pointer", action=1) at malloc.c:4996 4 _int_free (av=, p=, have_lock=0) at malloc.c:3840 5 0x0043a616 in _M_dispose (__a=..., this=) at /usr/include/c++/4.8/bits/basic_string.h:249 6 ~basic_string (this=0x1d5fa70, __in_chrg=) at /usr/include/c++/4.8/bits/basic_string.h:539 7 ~avnd_hctype_tag (this=0x1d5fa70, __in_chrg=) at ../../../../../osaf/services/saf/amf/amfnd/include/avnd_hc.h:46 8 avnd_last_step_clean (cb=cb@entry=0x665940 <_avnd_cb>) at term.cc:101 9 0x00436ee1 in avnd_su_si_oper_done (cb=cb@entry=0x665940 <_avnd_cb>, su=0x1d5d000, si=si@entry=0x0) at susm.cc:1169 10 0x00416629 in avnd_comp_csi_assign_done (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260, csi=csi@entry=0x0) at comp.cc:1642 11 0x00416a6e in avnd_comp_cmplete_all_assignment (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260) at comp.cc:2567 12 0x0040bb9b in avnd_comp_clc_terming_cleansucc_hdler (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260) at clc.cc:2328 13 0x0040f6ba in avnd_comp_clc_fsm_run (cb=cb@entry=0x665940 <_avnd_cb>, comp=comp@entry=0x1d63260, ev=AVND_COMP_CLC_PRES_FSM_EV_CLEANUP_SUCC) at clc.cc:876 14 0x0040ffca in avnd_evt_clc_resp_evh (cb=0x665940 <_avnd_cb>, evt=0x7f568c0008c0) at clc.cc:414 15 0x00425f5f in avnd_evt_process (evt=0x7f568c0008c0) at main.cc:625 16 avnd_main_process () at main.cc:576 --- Sent from sourceforge.net because
[tickets] [opensaf:tickets] #2012 clm: inconsistant additionalText and lengthAdditionalText in notification construction
- **summary**: clm: inconsistant additionalText and lengthAdditionalText in construct notification --> clm: inconsistant additionalText and lengthAdditionalText in notification construction --- ** [tickets:#2012] clm: inconsistant additionalText and lengthAdditionalText in notification construction** **Status:** accepted **Milestone:** 5.0.1 **Created:** Thu Sep 08, 2016 11:20 AM UTC by Vu Minh Nguyen **Last Updated:** Thu Sep 08, 2016 11:20 AM UTC **Owner:** Vu Minh Nguyen According to NTF AIS, `additionalText` must be consistent with `lengthAdditionalText`. In current code, CLM always set an hard-code `ADDITION_TEXT_LENGTH` to `lengthAdditionalText` regardless of what `additionalText` is. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2012 clm: inconsistant additionalText and lengthAdditionalText in construct notification
--- ** [tickets:#2012] clm: inconsistant additionalText and lengthAdditionalText in construct notification ** **Status:** accepted **Milestone:** 5.0.1 **Created:** Thu Sep 08, 2016 11:20 AM UTC by Vu Minh Nguyen **Last Updated:** Thu Sep 08, 2016 11:20 AM UTC **Owner:** Vu Minh Nguyen According to NTF AIS, `additionalText` must be consistent with `lengthAdditionalText`. In current code, CLM always set an hard-code `ADDITION_TEXT_LENGTH` to `lengthAdditionalText` regardless of what `additionalText` is. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1995 AMF : amfd crashed while dumping AMF state
- **status**: unassigned --> accepted - **assigned_to**: Praveen - **Part**: - --> d --- ** [tickets:#1995] AMF : amfd crashed while dumping AMF state** **Status:** accepted **Milestone:** 5.1.RC1 **Created:** Fri Sep 02, 2016 08:42 AM UTC by Srikanth R **Last Updated:** Fri Sep 02, 2016 08:42 AM UTC **Owner:** Praveen Changeset : 7997 5.1 FC AMFD crashed while dumping the amf state, with the following command. immadm -a @safAmfService2020f -o 99 @safAmfService2020f Sep 2 12:51:26 CONTROLLER-2 osafamfd[2691]: NO unknown type: @safAmfService2020f Sep 2 12:51:26 CONTROLLER-2 osafamfd[2691]: imm.cc:648: object_name_to_class_type: Assertion 'false' failed. Sep 2 12:51:26 CONTROLLER-2 osafamfnd[2701]: WA AMF director unexpectedly crashed Sep 2 12:51:26 CONTROLLER-2 osafamfnd[2701]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60 --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1970 imm: immoitest testsuite 4 fails when CCB takes more than 2 seconds to commit
- **status**: review --> fixed - **Comment**: default(5.2) [staging:e5f162] changeset: 8020:e5f162184bbd user:Hung Nguyendate:Fri Aug 26 13:29:26 2016 +0700 summary: imm: Remove the poll timeout in IMM testcases [#1970] opensaf-5.1.x [staging:68b29a] changeset: 8021:68b29ac33324 user:Hung Nguyen date:Fri Aug 26 13:29:26 2016 +0700 summary: imm: Remove the poll timeout in IMM testcases [#1970] opensaf-5.0.x [staging:2139f3] changeset: 8022:2139f3e6b37b user:Hung Nguyen date:Fri Aug 26 13:29:26 2016 +0700 summary: imm: Remove the poll timeout in IMM testcases [#1970] opensaf-4.7.x [staging:eef359] changeset: 8023:eef3593c3597 user:Hung Nguyen date:Fri Aug 26 13:29:26 2016 +0700 summary: imm: Remove the poll timeout in IMM testcases [#1970] --- ** [tickets:#1970] imm: immoitest testsuite 4 fails when CCB takes more than 2 seconds to commit** **Status:** fixed **Milestone:** 4.7.2 **Created:** Thu Aug 25, 2016 06:09 AM UTC by Hung Nguyen **Last Updated:** Sun Aug 28, 2016 06:25 AM UTC **Owner:** Hung Nguyen In classImplementerThreadMain(), poll() was invoked with timeout of 2 seconds. The test uses that timeout to stop the thread (i.e stopping the thread when there's no callback in 2 seconds). But that also causes: * The testcase fails if pbe takes more than 2 seconds to dump. The while() loop stops after 2 seconds but then it fails to release the implementer name as the ccb is still active. * The testcase is slow because it has to wait for 2 seconds to stop the thread. ~~~ 2016-08-02 21:45:43 SC-2 osafimmnd[437]: NO Create of class TestClassConfig is PERSISTENT. 2016-08-02 21:45:43 SC-2 osafimmnd[437]: NO Create of class TestClassRuntime is PERSISTENT. 2016-08-02 21:45:43 SC-2 osafimmnd[437]: NO Ccb 4925 COMMITTED (startup) 2016-08-02 21:45:43 SC-2 osafimmnd[437]: NO Ccb 4926 COMMITTED (om_setup) 2016-08-02 21:45:43 SC-2 osafimmnd[437]: NO Ccb 4927 COMMITTED (om_setup) 2016-08-02 21:45:43 SC-2 osafimmnd[437]: NO Implementer connected: 312 (classImplementerThreadMain) <1170, 2020f> 2016-08-02 21:45:43 SC-2 osafimmnd[437]: NO implementer for class 'TestClassConfig' is classImplementerThreadMain => class extent is safe. 2016-08-02 21:45:46 SC-2 osafimmnd[437]: NO ERR_BUSY: ccb 4928 is active on object Obj1,rdn=root of class TestClassConfig. Can not release class implementer 2016-08-02 21:45:46 SC-2 osafimmnd[437]: NO Implementer locally disconnected. Marking it as doomed 312 <1170, 2020f> (classImplementerThreadMain) 2016-08-02 21:45:46 SC-2 osafimmnd[437]: WA CCB 4928 is in critical state, can not abort 2016-08-02 21:45:46 SC-2 osafimmnd[437]: WA Will not terminate ccb 4928 in critical state ~~~ --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2003 amf: SG unstable when SU moves to TERM_FAILED state during fresh assignments.
- **status**: review --> fixed - **Comment**: changeset: 8016:77a9f5df113f branch: opensaf-4.7.x user:praveen.malv...@oracle.com date:Thu Sep 08 14:13:52 2016 +0530 summary: amfnd: send recovery request to amfd for term-failed su [#2003] changeset: 8017:f15bc3868b81 branch: opensaf-5.0.x parent: 8014:0b491ef33bb8 user:praveen.malv...@oracle.com date:Thu Sep 08 14:14:25 2016 +0530 summary: amfnd: send recovery request to amfd for term-failed su [#2003] changeset: 8018:466142dde156 branch: opensaf-5.1.x parent: 8013:9acf7c9aecab user:praveen.malv...@oracle.com date:Thu Sep 08 14:14:49 2016 +0530 summary: amfnd: send recovery request to amfd for term-failed su [#2003] changeset: 8019:21bf64e1130a tag: tip parent: 8012:46edfce1d524 user:praveen.malv...@oracle.com date:Thu Sep 08 14:14:59 2016 +0530 summary: amfnd: send recovery request to amfd for term-failed su [#2003] --- ** [tickets:#2003] amf: SG unstable when SU moves to TERM_FAILED state during fresh assignments.** **Status:** fixed **Milestone:** 4.7.2 **Created:** Tue Sep 06, 2016 08:31 AM UTC by Praveen **Last Updated:** Tue Sep 06, 2016 09:30 AM UTC **Owner:** Praveen **Attachments:** - [term_failed.tgz](https://sourceforge.net/p/opensaf/tickets/2003/attachment/term_failed.tgz) (30.1 kB; application/x-compressed) Conf: 2N model, one NPI comp in NPI SU. Steps to reproduce: 1)Add application using immcfg command. 2)Lock SG. 3)Unlock-in and unlock SUs. 4)Make provisions so that instantiation and clean up scripts returns with non-zero status. 5)Unlock SG. When SG is unlocked, AMFND initiates active assignments by instantiating the only component. After instantiation failure, AMFND tries to clean up the component. Cleanup fails. AMFND marks comp and SU in TERM_FAILED state, but it neither responds to AMFD for the completion of assignment nor it sends any recovery request. Because of this SG remains unstable in REALIGN state.In this state, no admin operation is allowed. Attached are traces. Even though issue seems to be similar to #538, it is different in one aspect. In #538, SU moves to TERM_FAILED state and there is possibiltiy of failover/switchover as standby assignments are present. In the present case, it happened during initial assignments and thus there is no standby to switchover/failover to. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1973 imm: IMM test returns zero even when it fails
- **status**: review --> fixed - **Comment**: default(5.2) [staging:46edfc] changeset: 8012:46edfce1d524 user:Hung Nguyendate:Fri Aug 26 17:43:01 2016 +0700 summary: imm: Remove pthread_exit from IMM test [#1973] opensaf-5.1.x [staging:ba9a42] changeset: 8013:9acf7c9aecab user:Hung Nguyen date:Fri Aug 26 17:43:01 2016 +0700 summary: imm: Remove pthread_exit from IMM test [#1973] opensaf-5.0.x [staging:0b491e] changeset: 8014:0b491ef33bb8 user:Hung Nguyen date:Fri Aug 26 17:43:01 2016 +0700 summary: imm: Remove pthread_exit from IMM test [#1973] opensaf-4.7.x [staging:] changeset: 8015:a2728b93c7c0 user:Hung Nguyen date:Fri Aug 26 17:43:01 2016 +0700 summary: imm: Remove pthread_exit from IMM test [#1973] --- ** [tickets:#1973] imm: IMM test returns zero even when it fails** **Status:** fixed **Milestone:** 4.7.2 **Created:** Fri Aug 26, 2016 08:15 AM UTC by Hung Nguyen **Last Updated:** Sun Aug 28, 2016 06:25 AM UTC **Owner:** Hung Nguyen Snippet from main() in immtest.c ~~~ int main(int argc, char **argv) { ... /* Added pthread_exit() to remove dlopen@@GLIBC leak from valgrind */ pthread_exit(NULL); return rc; } ~~~ pthread_exit() should be removed because it makes the test exit before 'return rc'. I tried to run valgrind without pthread_exit(), it didn't complain anything about dlopen. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2011 ckptd seg faulted on active controller when trying to create checkpoint
--- ** [tickets:#2011] ckptd seg faulted on active controller when trying to create checkpoint** **Status:** unassigned **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 07:28 AM UTC by Ritu Raj **Last Updated:** Thu Sep 08, 2016 07:28 AM UTC **Owner:** nobody **Attachments:** - [ckptd_bt](https://sourceforge.net/p/opensaf/tickets/2011/attachment/ckptd_bt) (2.6 kB; application/octet-stream) - [messages-20160907.bz2](https://sourceforge.net/p/opensaf/tickets/2011/attachment/messages-20160907.bz2) (380.1 kB; application/x-bzip) - [syslog2](https://sourceforge.net/p/opensaf/tickets/2011/attachment/syslog2) (1.4 MB; application/octet-stream) Environment details OS : Suse 64bit Changeset : 7997 ( 5.1.FC) Setup : 4 nodes ( 2 controllers and 2 payloads with headless feature disabled & 1PBE enabled with 30K objects ) Summary : ckptd crashed on active controller when trying to create checkpoint during failover Steps followed & Observed behaviour 1. Initially ran some CKPT test scenarios, along with failovers. After the end of the test scenarios, The following IMM objects & replicas are not deleted sofo-s3:/dev/shm # immfind | grep 101 safCkpt=all_replicas_ckpt_name_101 safCkpt=collocated_ckpt_name_101 safReplica=safNode=PL-3\,safCluster=myClmCluster,safCkpt=all_replicas_ckpt_name_101 safReplica=safNode=PL-3\,safCluster=myClmCluster,safCkpt=collocated_ckpt_name_101 safReplica=safNode=SC-1\,safCluster=myClmCluster,safCkpt=all_replicas_ckpt_name_101 safReplica=safNode=SC-2\,safCluster=myClmCluster,safCkpt=all_replicas_ckpt_name_101 2. When ckpt is created with the earlier name (all_replicas_ckpt_name_101) observed the following error in syslog. Also CkptOpen failed with ERR_LIBRARY. >> saImmOiRtObjectCreate_2 failed with error = 14 >> Sep 7 17:21:11 sofo-s2 osafimmnd[2137]: NO PBE-OI established on this SC. Dumping incrementally to file imm.db Sep 7 17:21:12 sofo-s2 osafckptd[2284]: ER create_runtime_ckpt_object - saImmOiRtObjectCreate_2 failed with error = 14 Sep 7 17:21:12 sofo-s2 osafckptd[2284]: ER create runtime ckpt object failed with error: 14 Sep 7 17:21:12 sofo-s2 osafckptd[2284]: ER cpd db add ckpt_node failed for ckpt_id:2 4. After some time cpktd seg faulted on active controller >> Sep 7 17:21:43 sofo-s2 osafamfnd[2187]: NO 'safComp=CPD,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast' Sep 7 17:21:43 sofo-s2 osafamfnd[2187]: ER safComp=CPD,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast Sep 7 17:21:43 sofo-s2 osafamfnd[2187]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60 Sep 7 17:21:43 sofo-s2 opensaf_reboot: Rebooting local node; timeout=60 5. Below is the bt 0- 0x7fbbd5ffcb20 in memcmp () from /lib64/libc.so.6 1- 0x7fbbd7a10929 in ncs_patricia_tree_get (pTree=0x67b4c8, pKey=0x7d22531c "\017\001\002") at patricia.c:435 2- 0x0040800d in cpd_cpnd_info_node_get (cpnd_tree=0x67b4c8, dest=0x67ec60, cpnd_info_node=0x7d225350) at cpd_db.c:706 3- 0x0040cd56 in cpd_evt_proc_mds_evt (cb=0x67b340, evt=0x67ec50) at cpd_evt.c:1378 4- 0x004091cb in cpd_process_evt (evt=0x67ec40) at cpd_evt.c:107 5- 0x0041185f in cpd_main_process (cb=0x67b340) at cpd_init.c:661 6 - 0x00411b89 in main (argc=1, argv=0x7d225578) at cpd_main.c:74 Notes: 1. Syslog attached 2. bt attached 3. ckptd traces not enabled --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2010 IMM: library receives wrong response when a ccb is aborted
- Description has changed: Diff: --- old +++ new @@ -3,7 +3,7 @@ In some cases the client is not in a sync call (i.e. not waiting for response) but IMMND still sends that response to the client. One example is when the OI attaches/deattaches. That may cause the client to receive unexpected response if the client at that time calls an sync IMM api. Details of the problem is explained here -[Click me !!!](http://sequencediagram.org/index.html?initialData=A4QwTgLglgxloDsIAICSBZdBBAUKSs8ISamAcgCJ7jRyIobpU5gBGA9gB7LsBuApmFJMAXFmQwYrALQgOkZFADOyAOb8EgkBH4ATADoIA7gAsNyAPKpk2iCBgmoCVQHpd-W-cfOcORhWkAYlUwfg0APkZKEQoAJkoAfSwAIQsAJQAVBIBhbOTkAApAgEYASj9MLCDWABsAV35I8goxeIocvKSABS6AGQBNQsDY8pYObj5BYWioimRgMHYYfiUlFYkpWXkUAFslVWQAM0Wd4QpDYl1kNYRdFVClYHYENeQIdh4wKFUnbScDhDsdyGNb8RQ7HboIH8GoJSSsLDbAqjWZBUK6JrYESUWJYBKMBIAUTSaXSCViyCMUAgJmQADEsKheoT2hYusSsBlUBYyEMAMylQwAKhFGUcKmUbzMEmeawAjg0EMseIdkCVfGwuDwBEJGFgRHrkKFllABCoaWCHk8XmDLlKnABrc0mbSUkDOy0ra1rQyHdhCYYAGmQrDqKHsEDqIBqNQAnooIAByFQAKzqSnDMptCo0yvYqpKhkMhuN-FN6wtlMWziNXtlYNDKF0UF0CETKBgYHdtLMoUMrH4MBA6bBFh2uQRwGAceRNkk-GAEBUOLxBOJpLS5I1421Uz1IjH2SkWCnM9KtcjYBe9MZzNZXUMJ+nsD+zyz0AQDRUVJplh2WF0HYnAsIxNDAOlfhqKAAC9+GRXw9WqepGlmVpEiwCh0AsBI6VQMgsF6VAAC1CSGAAWUZNQmHVphaWZkEBIxrg0O4pU9R56yOf01ViBDmjRPRMX1Fd8UwIkSTJCkf1pBkmRZBI2Q5LkeX5QUEBFIUxUlSVKytTi-QDXixi1SZdUqA1KhsVQQCcWsTTNGxAQtIQjGrA49JtIsEFQDsuyUMxnR0qAdgbQdh1eMcAKAhAQLAiCEG jGC4LU3R2BWNtw3nRdkBEtcJM3XigA) +http://sequencediagram.org/index.html?initialData=A4QwTgLglgxloDsIAICSBZdBBAUKSs8ISamAcgCJ7jRyIobpU5gBGA9gB7LsBuApmFJMAXFmQwYrALQgOkZFADOyAOb8EgkBH4ATADoIA7gAsNyAPKpk2iCBgmoCVQHpd-W-cfOcORhWkAYlUwfg0APkZKEQoAJkoAfSwAIQsAJQAVBIBhbOTkAApAgEYASj9MLCDWABsAV35I8goxeIocvKSABS6AGQBNQsDY8pYObj5BYWioimRgMHYYfiUlFYkpWXkUAFslVWQAM0Wd4QpDYl1kNYRdFVClYHYENeQIdh4wKFUnbScDhDsdyGNb8RQ7HboIH8GoJSSsLDbAqjWZBUK6JrYESUWJYBKMBIAUTSaXSCViyCMUAgJmQADEsKheoT2hYusSsBlUBYyEMAMylQwAKhFGUcKmUbzMEmeawAjg0EMseIdkCVfGwuDwBEJGFgRHrkKFllABCoaWCHk8XmDLlKnABrc0mbSUkDOy0ra1rQyHdhCYYAGmQrDqKHsEDqIBqNQAnooIAByFQAKzqSnDMptCo0yvYqpKhkMhuN-FN6wtlMWziNXtlYNDKF0UF0CETKBgYHdtLMoUMrH4MBA6bBFh2uQRwGAceRNkk-GAEBUOLxBOJpLS5I1421Uz1IjH2SkWCnM9KtcjYBe9MZzNZXUMJ+nsD+zyz0AQDRUVJplh2WF0HYnAsIxNDAOlfhqKAAC9+GRXw9WqepGlmVpEiwCh0AsBI6VQMgsF6VAAC1CSGAAWUZNQmHVphaWZkEBIxrg0O4pU9R56yOf01ViBDmjRPRMX1Fd8UwIkSTJCkf1pBkmRZBI2Q5LkeX5QUEBFIUxUlSVKytTi-QDXixi1SZdUqA1KhsVQQCcWsTTNGxAQtIQjGrA49JtIsEFQDsuyUMxnR0qAdgbQdh1eMcAKAhAQLAiCEGjGC4LU3R2BWNtw3 nRdkBEtcJM3XigA ~~~ 09:45:58 SC-2-2 osafimmnd[3918]: NO ERR_TRY_AGAIN: ccb 1266 is active on object CmwSwMswMId=1 of class CmwSwMSwM. Can not add class implementer --- ** [tickets:#2010] IMM: library receives wrong response when a ccb is aborted** **Status:** accepted **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 07:10 AM UTC by Hung Nguyen **Last Updated:** Thu Sep 08, 2016 07:15 AM UTC **Owner:** Hung Nguyen **Attachments:** - [logs.7z](https://sourceforge.net/p/opensaf/tickets/2010/attachment/logs.7z) (6.8 MB; application/octet-stream) When receiving the ccb abort message (D2ND_ABORT_CCB) over fevs, IMMND will abort the message and send response to client if it's the originating node. See immnd_evt_proc_ccb_finalize(). In some cases the client is not in a sync call (i.e. not waiting for response) but IMMND still sends that response to the client. One example is when the OI attaches/deattaches. That may cause the client to receive unexpected response if the client at that time calls an sync IMM api. Details of the problem is explained here http://sequencediagram.org/index.html?initialData=A4QwTgLglgxloDsIAICSBZdBBAUKSs8ISamAcgCJ7jRyIobpU5gBGA9gB7LsBuApmFJMAXFmQwYrALQgOkZFADOyAOb8EgkBH4ATADoIA7gAsNyAPKpk2iCBgmoCVQHpd-W-cfOcORhWkAYlUwfg0APkZKEQoAJkoAfSwAIQsAJQAVBIBhbOTkAApAgEYASj9MLCDWABsAV35I8goxeIocvKSABS6AGQBNQsDY8pYObj5BYWioimRgMHYYfiUlFYkpWXkUAFslVWQAM0Wd4QpDYl1kNYRdFVClYHYENeQIdh4wKFUnbScDhDsdyGNb8RQ7HboIH8GoJSSsLDbAqjWZBUK6JrYESUWJYBKMBIAUTSaXSCViyCMUAgJmQADEsKheoT2hYusSsBlUBYyEMAMylQwAKhFGUcKmUbzMEmeawAjg0EMseIdkCVfGwuDwBEJGFgRHrkKFllABCoaWCHk8XmDLlKnABrc0mbSUkDOy0ra1rQyHdhCYYAGmQrDqKHsEDqIBqNQAnooIAByFQAKzqSnDMptCo0yvYqpKhkMhuN-FN6wtlMWziNXtlYNDKF0UF0CETKBgYHdtLMoUMrH4MBA6bBFh2uQRwGAceRNkk-GAEBUOLxBOJpLS5I1421Uz1IjH2SkWCnM9KtcjYBe9MZzNZXUMJ+nsD+zyz0AQDRUVJplh2WF0HYnAsIxNDAOlfhqKAAC9+GRXw9WqepGlmVpEiwCh0AsBI6VQMgsF6VAAC1CSGAAWUZNQmHVphaWZkEBIxrg0O4pU9R56yOf01ViBDmjRPRMX1Fd8UwIkSTJCkf1pBkmRZBI2Q5LkeX5QUEBFIUxUlSVKytTi-QDXixi1SZdUqA1KhsVQQCcWsTTNGxAQtIQjGrA49JtIsEFQDsuyUMxnR0qAdgbQdh1eMcAKAhAQLAiCEGjGC4LU3R2BWNtw3n RdkBEtcJM3XigA ~~~ 09:45:58 SC-2-2 osafimmnd[3918]: NO ERR_TRY_AGAIN: ccb 1266 is active on object CmwSwMswMId=1 of class CmwSwMSwM. Can not add class implementer 09:45:58 SC-2-2 osafimmnd[3918]: NO Trying to abort ccb 1266 to allow implementer CoreMwSwM to protect class CmwSwMSwM 09:45:58 SC-2-2 osafimmnd[3918]: NO implementer for class 'CmwIspConfig' is CmwIsp => class extent is safe. 09:45:58 SC-2-2 osafimmnd[3918]: NO Implementer disconnected 169 <0,
[tickets] [opensaf:tickets] #2010 IMM: library receives wrong response when a ccb is aborted
- Description has changed: Diff: --- old +++ new @@ -3,7 +3,7 @@ In some cases the client is not in a sync call (i.e. not waiting for response) but IMMND still sends that response to the client. One example is when the OI attaches/deattaches. That may cause the client to receive unexpected response if the client at that time calls an sync IMM api. Details of the problem is explained here -[Click me !!!](http://sequencediagram.org/index.html?initialData=FABwhgTgLglgxjcA7KACAkgWUwQVJWBZNLTAOQBF9p5EwUNsrgIAjAewA9V2A3AUwiNMFAFw5UcOKwC0YDtFQwAzqgDm-JILBR+AEwA6SAO4ALTagDy6VDqhg4pmEjUB6PfzsOnL4MFIUMgDEahD8mgB8pJSiFABMlAD6OABClgBKACqJAMI5KagAFEEAjACU-tg4wawANgCu-FHYMTgJFLn5yQAK3QAyAJpFQXEVLBzcfILCMdEUqCAQ7HD8ysqrktJyCmgAtspqqABmS7vCFEb0eqjrSHqqYcog7EjrqFDsPBAwas46zockOwPEZ1vwlLtdphgfxaokpKwcDtCmM5sEwnpmrhRJQ4jhEqREgBRdLpDKJOKoYwwKCmVAAMRw6D6RI6lm6JJwmXQljIwwAzGUjAAqUWZJyqFTvcySF7rACOjSQKx4R1QpT8bC4PAEQlIOFE+tQYRWMAEqlp4Mez1e4Ku0ucAGsLaYdFSwC6rasbesjEd2EIRgAaVCsepoBxQepgWq1ACeSigAHJVAArerKCOy22KzQq9hq0pGIxGk38M0bS1UpYuY3euXgsNoPQwPRIJNoOAQD108xhIysfhwMAZ8GWXZ5REgEDxlG2KT8EBQVS4-GEklk9IUzUTHXTfWicc5aQ4aezsp1qMQV4MpkstndIynmfwf4vbOwJCNVTU2lWXY4HouzOJYxhaBA9J-LUMAAF78Cifj6jUDRNHM4jtMkFCYJYiT0ugZA4H06AAFpEsMAAsYxapMuozGIcyoECxg3Jo9zSl6TwNscAbqnEiEtIEQQYliBqrgS2DEqS5KUr+dKMsyrKJOynLcryApCkgorCuKUpSlW1pcf6gZ8eM2pTHqVSGlUthqGAzh1qa5q2EClpCMYNaHAZtrFkg6Cdt2yjmC6ekwLsjZDiObzjoBwFIKB4G QUgMawfBGl6OwqzthGC5LqgYnrlJW58UAA) +[Click me !!!](http://sequencediagram.org/index.html?initialData=A4QwTgLglgxloDsIAICSBZdBBAUKSs8ISamAcgCJ7jRyIobpU5gBGA9gB7LsBuApmFJMAXFmQwYrALQgOkZFADOyAOb8EgkBH4ATADoIA7gAsNyAPKpk2iCBgmoCVQHpd-W-cfOcORhWkAYlUwfg0APkZKEQoAJkoAfSwAIQsAJQAVBIBhbOTkAApAgEYASj9MLCDWABsAV35I8goxeIocvKSABS6AGQBNQsDY8pYObj5BYWioimRgMHYYfiUlFYkpWXkUAFslVWQAM0Wd4QpDYl1kNYRdFVClYHYENeQIdh4wKFUnbScDhDsdyGNb8RQ7HboIH8GoJSSsLDbAqjWZBUK6JrYESUWJYBKMBIAUTSaXSCViyCMUAgJmQADEsKheoT2hYusSsBlUBYyEMAMylQwAKhFGUcKmUbzMEmeawAjg0EMseIdkCVfGwuDwBEJGFgRHrkKFllABCoaWCHk8XmDLlKnABrc0mbSUkDOy0ra1rQyHdhCYYAGmQrDqKHsEDqIBqNQAnooIAByFQAKzqSnDMptCo0yvYqpKhkMhuN-FN6wtlMWziNXtlYNDKF0UF0CETKBgYHdtLMoUMrH4MBA6bBFh2uQRwGAceRNkk-GAEBUOLxBOJpLS5I1421Uz1IjH2SkWCnM9KtcjYBe9MZzNZXUMJ+nsD+zyz0AQDRUVJplh2WF0HYnAsIxNDAOlfhqKAAC9+GRXw9WqepGlmVpEiwCh0AsBI6VQMgsF6VAAC1CSGAAWUZNQmHVphaWZkEBIxrg0O4pU9R56yOf01ViBDmjRPRMX1Fd8UwIkSTJCkf1pBkmRZBI2Q5LkeX5QUEBFIUxUlSVKytTi-QDXixi1SZdUqA1KhsVQQCcWsTTNGxAQtIQjGrA49JtIsEFQDsuyUMxnR0qAdgbQdh1eMcAKAhAQLAiCEG jGC4LU3R2BWNtw3nRdkBEtcJM3XigA) ~~~ 09:45:58 SC-2-2 osafimmnd[3918]: NO ERR_TRY_AGAIN: ccb 1266 is active on object CmwSwMswMId=1 of class CmwSwMSwM. Can not add class implementer --- ** [tickets:#2010] IMM: library receives wrong response when a ccb is aborted** **Status:** accepted **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 07:10 AM UTC by Hung Nguyen **Last Updated:** Thu Sep 08, 2016 07:10 AM UTC **Owner:** Hung Nguyen **Attachments:** - [logs.7z](https://sourceforge.net/p/opensaf/tickets/2010/attachment/logs.7z) (6.8 MB; application/octet-stream) When receiving the ccb abort message (D2ND_ABORT_CCB) over fevs, IMMND will abort the message and send response to client if it's the originating node. See immnd_evt_proc_ccb_finalize(). In some cases the client is not in a sync call (i.e. not waiting for response) but IMMND still sends that response to the client. One example is when the OI attaches/deattaches. That may cause the client to receive unexpected response if the client at that time calls an sync IMM api. Details of the problem is explained here [Click me !!!](http://sequencediagram.org/index.html?initialData=A4QwTgLglgxloDsIAICSBZdBBAUKSs8ISamAcgCJ7jRyIobpU5gBGA9gB7LsBuApmFJMAXFmQwYrALQgOkZFADOyAOb8EgkBH4ATADoIA7gAsNyAPKpk2iCBgmoCVQHpd-W-cfOcORhWkAYlUwfg0APkZKEQoAJkoAfSwAIQsAJQAVBIBhbOTkAApAgEYASj9MLCDWABsAV35I8goxeIocvKSABS6AGQBNQsDY8pYObj5BYWioimRgMHYYfiUlFYkpWXkUAFslVWQAM0Wd4QpDYl1kNYRdFVClYHYENeQIdh4wKFUnbScDhDsdyGNb8RQ7HboIH8GoJSSsLDbAqjWZBUK6JrYESUWJYBKMBIAUTSaXSCViyCMUAgJmQADEsKheoT2hYusSsBlUBYyEMAMylQwAKhFGUcKmUbzMEmeawAjg0EMseIdkCVfGwuDwBEJGFgRHrkKFllABCoaWCHk8XmDLlKnABrc0mbSUkDOy0ra1rQyHdhCYYAGmQrDqKHsEDqIBqNQAnooIAByFQAKzqSnDMptCo0yvYqpKhkMhuN-FN6wtlMWziNXtlYNDKF0UF0CETKBgYHdtLMoUMrH4MBA6bBFh2uQRwGAceRNkk-GAEBUOLxBOJpLS5I1421Uz1IjH2SkWCnM9KtcjYBe9MZzNZXUMJ+nsD+zyz0AQDRUVJplh2WF0HYnAsIxNDAOlfhqKAAC9+GRXw9WqepGlmVpEiwCh0AsBI6VQMgsF6VAAC1CSGAAWUZNQmHVphaWZkEBIxrg0O4pU9R56yOf01ViBDmjRPRMX1Fd8UwIkSTJCkf1pBkmRZBI2Q5LkeX5QUEBFIUxUlSVKytTi-QDXixi1SZdUqA1KhsVQQCcWsTTNGxAQtIQjGrA49JtIsEFQDsuyUMxnR0qAdgbQdh1eMcAKAhAQLAiCEGj GC4LU3R2BWNtw3nRdkBEtcJM3XigA) ~~~ 09:45:58 SC-2-2 osafimmnd[3918]: NO ERR_TRY_AGAIN: ccb 1266 is active on object CmwSwMswMId=1 of class CmwSwMSwM. Can not add class implementer 09:45:58 SC-2-2 osafimmnd[3918]: NO Trying to abort ccb 1266 to allow implementer CoreMwSwM to protect class CmwSwMSwM 09:45:58 SC-2-2 osafimmnd[3918]: NO implementer for class 'CmwIspConfig' is CmwIsp => class extent is safe. 09:45:58 SC-2-2 osafimmnd[3918]: NO
[tickets] [opensaf:tickets] #2010 IMM: library receives wrong response when a ccb is aborted
--- ** [tickets:#2010] IMM: library receives wrong response when a ccb is aborted** **Status:** accepted **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 07:10 AM UTC by Hung Nguyen **Last Updated:** Thu Sep 08, 2016 07:10 AM UTC **Owner:** Hung Nguyen **Attachments:** - [logs.7z](https://sourceforge.net/p/opensaf/tickets/2010/attachment/logs.7z) (6.8 MB; application/octet-stream) When receiving the ccb abort message (D2ND_ABORT_CCB) over fevs, IMMND will abort the message and send response to client if it's the originating node. See immnd_evt_proc_ccb_finalize(). In some cases the client is not in a sync call (i.e. not waiting for response) but IMMND still sends that response to the client. One example is when the OI attaches/deattaches. That may cause the client to receive unexpected response if the client at that time calls an sync IMM api. Details of the problem is explained here [Click me !!!](http://sequencediagram.org/index.html?initialData=FABwhgTgLglgxjcA7KACAkgWUwQVJWBZNLTAOQBF9p5EwUNsrgIAjAewA9V2A3AUwiNMFAFw5UcOKwC0YDtFQwAzqgDm-JILBR+AEwA6SAO4ALTagDy6VDqhg4pmEjUB6PfzsOnL4MFIUMgDEahD8mgB8pJSiFABMlAD6OABClgBKACqJAMI5KagAFEEAjACU-tg4wawANgCu-FHYMTgJFLn5yQAK3QAyAJpFQXEVLBzcfILCMdEUqCAQ7HD8ysqrktJyCmgAtspqqABmS7vCFEb0eqjrSHqqYcog7EjrqFDsPBAwas46zockOwPEZ1vwlLtdphgfxaokpKwcDtCmM5sEwnpmrhRJQ4jhEqREgBRdLpDKJOKoYwwKCmVAAMRw6D6RI6lm6JJwmXQljIwwAzGUjAAqUWZJyqFTvcySF7rACOjSQKx4R1QpT8bC4PAEQlIOFE+tQYRWMAEqlp4Mez1e4Ku0ucAGsLaYdFSwC6rasbesjEd2EIRgAaVCsepoBxQepgWq1ACeSigAHJVAArerKCOy22KzQq9hq0pGIxGk38M0bS1UpYuY3euXgsNoPQwPRIJNoOAQD108xhIysfhwMAZ8GWXZ5REgEDxlG2KT8EBQVS4-GEklk9IUzUTHXTfWicc5aQ4aezsp1qMQV4MpkstndIynmfwf4vbOwJCNVTU2lWXY4HouzOJYxhaBA9J-LUMAAF78Cifj6jUDRNHM4jtMkFCYJYiT0ugZA4H06AAFpEsMAAsYxapMuozGIcyoECxg3Jo9zSl6TwNscAbqnEiEtIEQQYliBqrgS2DEqS5KUr+dKMsyrKJOynLcryApCkgorCuKUpSlW1pcf6gZ8eM2pTHqVSGlUthqGAzh1qa5q2EClpCMYNaHAZtrFkg6Cdt2yjmC6ekwLsjZDiObzjoBwFIKB4GQ UgMawfBGl6OwqzthGC5LqgYnrlJW58UAA) ~~~ 09:45:58 SC-2-2 osafimmnd[3918]: NO ERR_TRY_AGAIN: ccb 1266 is active on object CmwSwMswMId=1 of class CmwSwMSwM. Can not add class implementer 09:45:58 SC-2-2 osafimmnd[3918]: NO Trying to abort ccb 1266 to allow implementer CoreMwSwM to protect class CmwSwMSwM 09:45:58 SC-2-2 osafimmnd[3918]: NO implementer for class 'CmwIspConfig' is CmwIsp => class extent is safe. 09:45:58 SC-2-2 osafimmnd[3918]: NO Implementer disconnected 169 <0, 2010f> (@ClusMonEE) 09:45:58 SC-2-2 osafimmnd[3918]: NO Ccb 1266 ABORTED (CoreMwEcimSwMBackgroundThread) 09:45:58 SC-2-2 ecimswm: ImmUtils::doImmOperations:saImmOmCcbApply failed SaAisErrorT=21 09:45:58 SC-2-2 ecimswm: EcimSwmAsyncImmOperation::main() failed with rc = 21(SA_AIS_ERR_FAILED_OPERATION) 09:45:58 SC-2-2 ecimswm: imma_om_api.c:8769: saImmOmAdminOwnerFinalize: Assertion 'out_evt->info.imma.type == IMMA_EVT_ND2A_IMM_ERROR' failed. 09:45:58 SC-2-2 osafimmnd[3918]: NO Implementer connected: 173 (ClusMonEE) <0, 2010f> 09:45:58 SC-2-2 osafimmnd[3918]: WA >>s_info->to_svc == 0<< reply context destroyed before this reply could be made 09:45:58 SC-2-2 osafimmnd[3918]: WA Failed to send response to agent/client over MDS ~~~ Attached is syslog and IMM traces --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2006 NTFSv: Cluster rebooted with ntfd crashed on both controllers
AIS states`additionalText` and `lengthAdditionalText` must be consistent. Need to add an check of this. Return INVALID_PARAM if there is a mismatch. --- ** [tickets:#2006] NTFSv: Cluster rebooted with ntfd crashed on both controllers** **Status:** accepted **Milestone:** 5.1.RC1 **Created:** Wed Sep 07, 2016 06:42 AM UTC by Chani Srivastava **Last Updated:** Wed Sep 07, 2016 08:22 AM UTC **Owner:** Vu Minh Nguyen **Attachments:** - [NtfCrash.zip](https://sourceforge.net/p/opensaf/tickets/2006/attachment/NtfCrash.zip) (165.9 kB; application/zip) OS : Suse PPC 64bit Changeset : 7997 ( 5.1.FC) Setup : 4 nodes ( 2 controllers and 2 payloads with headless feature disabled & no PBE ) Ntfd traces and syslog for both controllers attached * Ntf Application is running on system * Will update ticket with core dump Note: The timings on system are not synced. After every reboot node timings are modified BT: 0 0x0fffa0848100 in .raise () from /lib64/libc.so.6 1 0x0fffa0849d10 in .abort () from /lib64/libc.so.6 2 0x0fffa0e34234 in osaf_abort (i_cause=7) at osaf_utility.c:27 3 0x1001a2f8 in NtfLogger::logNotification (this=0x100ba768, notif= std::tr1::shared_ptr (count 2, weak 0) 0x100b88f0) at NtfLogger.cc:247 4 0x10019e60 in NtfLogger::checkQueueAndLog (this=0x100ba768, newNotif=std::tr1::shared_ptr (count 2, weak 0) 0x100b88f0) at NtfLogger.cc:181 5 0x10019a74 in NtfLogger::log (this=0x100ba768, notif=std::tr1::shared_ptr (count 2, weak 0) 0x100b88f0, isLocal=true) at NtfLogger.cc:137 6 0x1002b528 in NtfAdmin::processNotification (this=0x100ba760, clientId=62, notificationType=SA_NTF_TYPE_ALARM, sendNotInfo=0x100b8800, mdsCtxt=0x100bb3dc, notificationId=47) at NtfAdmin.cc:203 7 0x1002b938 in NtfAdmin::notificationReceived (this=0x100ba760, clientId=62, notificationType=SA_NTF_TYPE_ALARM, sendNotInfo=0x100b8800, mdsCtxt=0x100bb3dc) at NtfAdmin.cc:257 8 0x1002ec20 in notificationReceived (clientId=62, notificationType=SA_NTF_TYPE_ALARM, sendNotInfo=0x100b8800, mdsCtxt=0x100bb3dc) at NtfAdmin.cc:1012 9 0x10006410 in proc_send_not_msg (cb=0x10073190 <_ntfs_cb>, evt=0x100bb3d0) at ntfs_evt.c:447 10 0x10006b28 in process_api_evt (evt=0x100bb3d0) at ntfs_evt.c:628 11 0x10006c38 in ntfs_process_mbx (mbx=0x10073190 <_ntfs_cb>) at ntfs_evt.c:660 12 0x1000b6f4 in main (argc=2, argv=0xfffc056a7f8) at ntfs_main.c:399 Active Controler: May 26 19:41:16 linux-pvra osafntfd[24205]: **osaf_abort(7) called from 0x1001a2f8 with errno=11** May 26 19:41:16 linux-pvra osafamfnd[24243]: NO 'safComp=NTF,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast' May 26 19:41:16 linux-pvra osafamfnd[24243]: ER safComp=NTF,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast May 26 19:41:16 linux-pvra osafamfnd[24243]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60 May 26 19:41:16 linux-pvra opensaf_reboot: Rebooting local node; timeout=60 Jun 2 14:11:28 linux-pvra syslog-ng[1639]: syslog-ng starting up; version='2.0.9' Ntf Trace: May 26 19:41:16.767426 osafntfd [24205:lga_api.c:1190] TR logBufSize > strlen(logBuf) + 1 May 26 19:41:16.767436 osafntfd [24205:lga_api.c:1320] << saLogWriteLogAsync Jun 2 14:11:47.153831 osafntfd [2958:ntfs_main.c:0181] >> initialize Jun 2 14:11:47.175099 osafntfd [2958:ncs_main_pub.c:0220] TR NCS:PROCESS_ID=2958 --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1994 IMMSv: Finalized CCB are counted under Max Ccb Limit
- **status**: unassigned --> accepted - **assigned_to**: Neelakanta Reddy - **Part**: - --> nd - **Milestone**: 4.7.2 --> 5.1.RC1 - **Comment**: The limit is considered only for active ccbs --- ** [tickets:#1994] IMMSv: Finalized CCB are counted under Max Ccb Limit** **Status:** accepted **Milestone:** 5.1.RC1 **Created:** Thu Sep 01, 2016 12:32 PM UTC by Chani Srivastava **Last Updated:** Thu Sep 01, 2016 12:49 PM UTC **Owner:** Neelakanta Reddy setup: Version - OpenSAF 5.1.FC : changeset - 7997 4-Node cluster 1PBE with 30K objects - Default maxCcb is configured to 1 as in object opensafImm=opensafImm,safApp=safImmService - Try creating more than 1 Ccb operations ~~~ for (( i = 1 ; i <=2; i++)) immcfg -c TestClass testClass=$i ~~~ Above operation fails with ERR_NO_RESOURCE after the Ccb count for cluster reached 1. Even when a max limit is reached; after few minutes more Ccbs are allowed. See the below syslog snippet Sep 1 14:58:35 OSAF-SC1 osafimmnd[27298]: NO Ccb 45008 COMMITTED (chaniTestClass) Sep 1 14:58:35 OSAF-SC1 osafimmnd[27298]: NO Ccb 45009 COMMITTED (chaniTestClass) Sep 1 14:58:35 OSAF-SC1 osafimmnd[27298]: NO Ccb 45010 COMMITTED (chaniTestClass) Sep 1 14:58:35 OSAF-SC1 osafimmnd[27298]: NO Ccb 45011 COMMITTED (chaniTestClass) Sep 1 14:58:35 OSAF-SC1 osafimmnd[27298]: NO Ccb 45012 COMMITTED (chaniTestClass) **Sep 1 *14:58:35* OSAF-SC1 osafimmnd[27298]: *NO ERR_NO_RESOURCES: maximum Ccbs limit 2 has been reached for the cluster*** Sep 1 15:00:34 OSAF-SC1 syslog-ng[1194]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=92951', processed='center(received)=47084', processed='destination(messages)=47077', processed='destination(mailinfo)=7', processed='destination(mailwarn)=0', processed='destination(localmessages)=45786', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=42', processed='destination(console)=16', processed='destination(null)=0', processed='destination(mail)=7', processed='destination(xconsole)=16', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=47084' **Sep 1 *15:10:14 *OSAF-SC1 osafimmnd[27298]: *NO Ccb 45014 COMMITTED (chaniTestClass)*** Sep 1 15:10:14 OSAF-SC1 osafimmnd[27298]: NO Ccb 45015 COMMITTED (chaniTestClass) Sep 1 15:10:14 OSAF-SC1 osafimmnd[27298]: NO Ccb 45016 COMMITTED (chaniTestClass) Sep 1 15:10:14 OSAF-SC1 osafimmnd[27298]: NO Ccb 45017 COMMITTED (chaniTestClass) --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2009 AMF: App Si is moving to UNASSIGNED state after middleware failover
-> In addition to the steps mentioned in the ticket, for the below operations following message is printed in syslog. Sep 8 12:06:29 CONTROLLER-1 osafamfd[]: ER exec: create FAILED 12 Sep 8 12:06:35 CONTROLLER-1 osafamfd[]: ER exec: create FAILED 12 Sep 8 12:06:45 CONTROLLER-1 osafamfd[]: ER exec: create FAILED 12 Sep 8 12:06:55 CONTROLLER-1 osafamfd[]: ER exec: create FAILED 12 Below are the steps. -> Delete all the application objects. -> Perform the middleware switchover / failover. -> New active controller is trying to access the application SI object which is already deleted earlier. Sep 8 12:08:36.647738 osafamfd [:main.cc:0810] << process_event Sep 8 12:08:36.647743 osafamfd [:imm.cc:0396] >> execute Sep 8 12:08:36.647748 osafamfd [:imm.cc:0142] >> exec: Create safCsi=CSI1,safSi=TestApp_SI4,safApp=TestApp_TwoN Sep 8 12:08:36.647754 osafamfd [:imma_oi_api.c:2786] >> rt_object_create_common Sep 8 12:08:36.647761 osafamfd [:imma_oi_api.c:2892] TR attr:safCSIComp Sep 8 12:08:36.647768 osafamfd [:imma_oi_api.c:2892] TR attr:saAmfCSICompHAState Sep 8 12:08:36.647795 osafamfd [:imma_oi_api.c:2892] TR attr:saAmfCSICompHAReadinessState Sep 8 12:08:36.650289 osafamfd [:imma_oi_api.c:3063] << rt_object_create_common Sep 8 12:08:36.650330 osafamfd [:imm.cc:0163] ER exec: create FAILED 12 --- ** [tickets:#2009] AMF: App Si is moving to UNASSIGNED state after middleware failover** **Status:** unassigned **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 06:07 AM UTC by Srikanth R **Last Updated:** Thu Sep 08, 2016 06:09 AM UTC **Owner:** nobody Environment details -- OS : Suse 64bit Changeset : 7997 ( 5.1.FC) Setup : 5 nodes ( 2 controllers and 3 payloads with headless feature enabled & no PBE ) AMF Application : 2N model with SUs mapped on PL-3,PL-4 ( si-si deps enabled) Summary : -- Application SIs are moving to UNASSIGNED state after middleware failover. Steps followed & Observed behaviour -- -> Initially brought up AMF application (2n model) on two payloads. -> All the SIs are fully assigned state and SUs are in INSERVICE state. -> Performed middleware failover. -> After standby became active controller, SIs moved to unassigned state. But 'amf-state siass' is showing proper output. -> Application received CSI remove callbacks after locking the SUs Expected behaviour -- -> As no fault happened on the application, SIs should not move to UNASSIGNED state for middleware failover. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2009 AMF: App Si is moving to UNASSIGNED state after middleware failover
amfd traces on both the controllers Attachments: - [2009.tgz](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/98b72c10/7108/attachment/2009.tgz) (849.1 kB; application/x-compressed-tar) --- ** [tickets:#2009] AMF: App Si is moving to UNASSIGNED state after middleware failover** **Status:** unassigned **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 06:07 AM UTC by Srikanth R **Last Updated:** Thu Sep 08, 2016 06:07 AM UTC **Owner:** nobody Environment details -- OS : Suse 64bit Changeset : 7997 ( 5.1.FC) Setup : 5 nodes ( 2 controllers and 3 payloads with headless feature enabled & no PBE ) AMF Application : 2N model with SUs mapped on PL-3,PL-4 ( si-si deps enabled) Summary : -- Application SIs are moving to UNASSIGNED state after middleware failover. Steps followed & Observed behaviour -- -> Initially brought up AMF application (2n model) on two payloads. -> All the SIs are fully assigned state and SUs are in INSERVICE state. -> Performed middleware failover. -> After standby became active controller, SIs moved to unassigned state. But 'amf-state siass' is showing proper output. -> Application received CSI remove callbacks after locking the SUs Expected behaviour -- -> As no fault happened on the application, SIs should not move to UNASSIGNED state for middleware failover. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2009 AMF: App Si is moving to UNASSIGNED state after middleware failover
--- ** [tickets:#2009] AMF: App Si is moving to UNASSIGNED state after middleware failover** **Status:** unassigned **Milestone:** 4.7.2 **Created:** Thu Sep 08, 2016 06:07 AM UTC by Srikanth R **Last Updated:** Thu Sep 08, 2016 06:07 AM UTC **Owner:** nobody Environment details -- OS : Suse 64bit Changeset : 7997 ( 5.1.FC) Setup : 5 nodes ( 2 controllers and 3 payloads with headless feature enabled & no PBE ) AMF Application : 2N model with SUs mapped on PL-3,PL-4 ( si-si deps enabled) Summary : -- Application SIs are moving to UNASSIGNED state after middleware failover. Steps followed & Observed behaviour -- -> Initially brought up AMF application (2n model) on two payloads. -> All the SIs are fully assigned state and SUs are in INSERVICE state. -> Performed middleware failover. -> After standby became active controller, SIs moved to unassigned state. But 'amf-state siass' is showing proper output. -> Application received CSI remove callbacks after locking the SUs Expected behaviour -- -> As no fault happened on the application, SIs should not move to UNASSIGNED state for middleware failover. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- ___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets