Hi Srinivas
If something that prevents ntfimcn to send notifications happen ntfimcn shall
be restarted so that a possible missed notification message is sent. If this
may happen in “normal” situations ntfimcn shall just exit without any coredump.
Possibly a notification (LOG_NO) should be written to the syslog. If the
problem is “not normal”, something that should never happen (an error that must
be analyzed and maybe fixed) an abort is better since it gives us a back-trace
of what has happened. Before any abort is done an error message shall be
written to syslog (LOG_ER). This error message should contain information about
what has happened and where it has happened (__FUNCTION__, __LINE__) in order
to have at least some information if a coredump is lost, could not be
genereated, no back-trace was created etc.
So, the question is, is this something that could happen in a “real” system and
is it of any interest to get a core-dump to analyze the problem? In any case
ntfimcn will recover.
If the answer is that this event triggering the core-dump is not a fail to
analyze and that it is something ntfimcn should just gracefully recover from
then change to LOG_NO + _Exit.
Regards
Lennart
From: Srinivas Mangipudy [mailto:srinivas.mangip...@oracle.com]
Sent: den 2 november 2017 16:42
To: Lennart Lund <lennart.l...@ericsson.com>; Minh Hon Chau
<minh.c...@dektech.com.au>
Cc: Ravi Sekhar Reddy Konda <ravisekhar.ko...@oracle.com>
Subject: RE: OSAF : ntfimcnd core dump issue -- 2647.
Hi Lennart,
Thanks a lot for the explanation.
It was really helpful.
Do you think it is better to log the error and then call _exit ( we will have
a graceful restart) then calling abort and allowing the process to dump core
in this case?
Please let me know your thoughts about this.
Thanks and best regards
Srinivas
From: Lennart Lund [mailto:lennart.l...@ericsson.com]
Sent: Thursday, November 2, 2017 6:26 PM
To: Minh Hon Chau <minh.c...@dektech.com.au>; Srinivas Mangipudy
<srinivas.mangip...@oracle.com>
Cc: Ravi Sekhar Reddy Konda <ravisekhar.ko...@oracle.com>; Lennart Lund
<lennart.l...@ericsson.com>
Subject: RE: OSAF : ntfimcnd core dump issue -- 2647.
Hi
Ntfimcn implements a so called “special applier”. This is an IMM applier that
receives IMM callbacks in the same way as an object implementer or an ordinary
applier. The difference is that a special applier is not requesting to become
applier for any special objects or classes. Instead a configuration attribute
in an object can be given a flag, ATTR_NOTIFY.
The general error handling in ntfimcn is to exit. This will be detected by the
osafntfd process that then restarts osfafntfimcnd. To notify that this has
happened (ntfimcn may have “missed” some IMM modifications while it was down)
ntfimcn always sends a special notification (may have lost notifications
notification) when it is started. This notification is used by com-sa to
request com to do a (re)synchronization.
The “normal” way for ntfimcn to exit is by calling _Exit. When something
happens that should never happen abort() is used instead (a coredump will be
created).
If something not normal and really bad is done in the system an abort is
motivated. This means that ntfimcn shall not be made to “defensively” avoid to
abort (coredump) if the cluster system and IMM is abused. Instead if we want to
test something like this it shall be accepted that a coredump is created, this
is not a Fail! However if something like that ntfimcn is not restarted or that
no “may have lost notification notification” is sent it is to be considered as
a Fail.
Note: If ntfimcn exist (or abort) the node or ntf as such is not affected. It
is only the ntfimcn process that is restarted.
Ticket #2647 should probably not be fixed instead it should be set to invalid.
/Lennart
From: Minh Hon Chau [mailto:minh.c...@dektech.com.au]
Sent: den 2 november 2017 05:45
To: Srinivas Mangipudy <srinivas.mangip...@oracle.com>
Cc: Ravi Sekhar Reddy Konda <ravisekhar.ko...@oracle.com>; Lennart Lund
<lennart.l...@ericsson.com>
Subject: Re: OSAF : ntfimcnd core dump issue -- 2647.
Hi Srinivas,
+ Lennart.
If I understand correctly, the test creates a huge amount of objects (are they
RT or Config?), and while the callbacks are coming, the test deletes the class.
The latter callbacks can't find class name so it aborts.
I think we can defensively avoid coredump and not sending notification as your
suggestion, but I'm wondering the integrity from IMM, as IMM user receives a
callback but the associated class is not existed.
Thanks,
Minh
On 01/11/17 22:04, Srinivas Mangipudy wrote:
Hi Minh,
This is regarding Notification issue
https://sourceforge.net/p/opensaf/tickets/2647/.
This issue is occurring since IMM deleted the objects and ntfimcnd was not able
to fetch the object, so it returned back “SA_AIS_ERR_NOT_EXIST” error.
Since “SA_AIS_ERR_NOT_EXIST” was returned, ntfimcnd aborted, leading to core
dump.
I have fixed the core dump, but I have a question regarding the notification to
be sent.
I think in this case ntfimcnd should not be sending the notification at all,
since it could not retrieve the class details.
Is that fix fine? Or do you suggest something else? Please let me know your
thoughts about it.
Thank you
Srinivas
---
** [tickets:#2647] ntfd: ntfimcnd crashed on handling for Object creation
callback**
**Status:** assigned
**Milestone:** 5.18.01
**Created:** Thu Oct 19, 2017 12:45 PM UTC by Srinivas Siva Mangipudy
**Last Updated:** Fri Nov 03, 2017 09:50 PM UTC
**Owner:** Srinivas Siva Mangipudy
1. Problem
========================================================================= =
NTFimcnd crashed while 200k objects was created and the deleted in IMM, the
classes associated with these objects were also removed.
2. Analysis
========================================================================= =
A huge number of objects, about 200k objects, were created in IMM with PBE
disabled. Then these objects were deleted in one CCB (by deleting the root
object). After deleting the objects, the classes were also removed. Then the
crash happened.
../../../../../../../opensaf/osaf/services/saf/ntfsv/ntfimcnd/ntfimcn_imm.c:167:
get_rdn_attr_name: Assertion '0' failed.
When NTFimcnd created notifications, some information had to be looked up in
IMM.
NTFimcnd only asked for class information in case of object creation:
static SaAisErrorT saImmOiCcbObjectCreateCallback(SaImmOiHandleT immOiHandle,
SaImmOiCcbIdT ccbId,
const SaImmClassNameT className,
const SaNameT *parentName, const
SaImmAttrValuesT_2 **attr) { … dn_ptr = get_created_dn(className, parentName,
attr); … } static void saImmOiCcbApplyCallback(SaImmOiHandleT immOiHandle,
SaImmOiCcbIdT ccbId) { …
switch (ccbUtilOperationData->operationType) { case
CCBUTIL_CREATE: rdn_attr_name = get_rdn_attr_name(
ccbUtilOperationData-
>param.create.className);
internal_rc = ntfimcn_send_object_create_notification(
ccbUtilOperationData, rdn_attr_name,
ccbLast); … }
In this case, NTF was still handling notifications for a big number of created
objects. Usually, NTFimcnd cached the class information. But because these
objects belong to many classes, it had to ask IMM for class information. But
the class was already removed in IMM. IMM was much faster to create and delete
the objects and class (with PBE disabled), while NTF was still processing the
objects, so the information was not there in IMM anymore, and the crashed
happened.
3. Reproduction
========================================================================= =
Can be reproduced like below: - Disable PBE 1. Create a huge amount of objects
with one parent object. 2. In my case, I created 2 root objects of two
different classes with about 100k child objects each Delete first root object,
then delete the class associated with this object. 3. Delete second root
object, delete the class associated with this object.
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets