V1 worked with the same test - no ntfimcnd coredump. Created #2223 for another
ntfd coredump - different issue.
---
** [tickets:#2219] ntfd: circular dependency with osafntfimcnd**
**Status:** assigned
**Milestone:** 5.0.2
**Created:** Thu Dec 08, 2016 05:14 AM UTC by Gary Lee
**Last Updated:** Fri Dec 09, 2016 01:02 PM UTC
**Owner:** Praveen
A circular dependency can be seen when performing a si-swap of
safSi=SC-2N,safApp=OpenSAF:
1. Active NTFD is trying to sync with Standby using MBC
2. Standby NTFD is the process of terminating its local osafntfimcnd. It is
stuck in timedwait_imcn_exit() and cannot reply to the Active.
3. osafntfimcnd [on standby] is trying to send a notfication to Active NTFD
So we have (1) depending on (2) depending on (3) depending on (1)
This results in a temporary deadlock that dramatically slows down NTFD's
ability to process its main dispatch loop. The deadlock only lasts for approx.
1 second, when mbcsv_mds_send_msg() times out. But since there could be lots of
MBC messages to send, sometimes osafntfimcnd is killed with SIGABRT generating
a coredump. The si-swap operation will also timeout.
steps to reproduce
- Run loop of ntftest 32
root@SC-1:~# for i in {1..10}; do ntftest 32; done
- On another terminal, keep swapping 2N Opensaf SI, got coredump after couples
of swaps
root@SC-1:~# amf-adm si-swap safSi=SC-2N,safApp=OpenSAF
...
root@SC-1:~# amf-adm si-swap safSi=SC-2N,safApp=OpenSAF
~~~
SC-2 (active)
There are a lot of send failures. Each taking approx. 1 second to timeout.
During these 1 second timeouts, NTFD cannot process the main dispatch loop.
Dec 7 11:01:37.531772 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:37.531781 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:38.537307 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:38.537758 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:38.537766 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:39.543180 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:39.543695 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:39.543698 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:40.545252 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:40.545719 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:40.545726 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:41.551328 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:41.551971 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:41.551979 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:42.557594 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:42.558171 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:42.558179 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:43.564051 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:43.564874 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:43.564883 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:44.572407 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:44.573262 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
Dec 7 11:01:44.573271 osafntfd [452:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Dec 7 11:01:45.575091 osafntfd [452:mbcsv_mds.c:0247] <<
mbcsv_mds_send_msg: failure
Dec 7 11:01:47.083548 osafntfd [452:mbcsv_mds.c:0185] >>
mbcsv_mds_send_msg: sending to vdest:e
~~~
~~~
SC-1 (standby)
NTFD is trying to terminate osafntfimcnd. While it is doing that, it cannot
reply to NTFD on SC-2. Meanwhile, osafntfimcnd is sending NTF notifications to
NTFD on SC-1.
Dec 7 11:01:35.453151 osafntfd [464:ntfs_imcnutil.c:0316] TR
handle_state_ntfimcn: Terminating osafntfimcnd process
Dec 7 11:01:45.474313 osafntfd [464:ntfs_imcnutil.c:0124] TR Termination
timeout
Dec 7 11:01:45.474375 osafntfd [464:ntfs_imcnutil.c:0130] <<
wait_imcnproc_termination: rc = -1, retry_cnt = 101
Dec 7 11:01:45.474387 osafntfd [464:ntfs_imcnutil.c:0168] TR Normal
termination failed. Escalate to abort
Dec 7 11:01:45.574703 osafntfd [464:ntfs_imcnutil.c:0172] TR Imcn
successfully aborted
Dec 7 11:01:45.574712 osafntfd [464:ntfs_imcnutil.c:0187] <<
timedwait_imcn_exit
~~~
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets