[tickets] [opensaf:tickets] #2647 ntfd: ntfimcnd crashed on handling for Object creation callback

Srinivas Siva Mangipudy via Opensaf-tickets Mon, 06 Nov 2017 01:07:57 -0800

Hi Srinivas

If something that prevents ntfimcn to send notifications happen ntfimcn shall 
be restarted so that a possible missed notification message is sent. If this 
may happen in “normal” situations ntfimcn shall just exit without any coredump. 
Possibly a notification (LOG_NO) should be written to the syslog. If the 
problem is “not normal”, something that should never happen (an error that must 
be analyzed and maybe fixed) an abort is better since it gives us a back-trace 
of what has happened. Before any abort is done an error message shall be 
written to syslog (LOG_ER). This error message should contain information about 
what has happened and where it has happened (__FUNCTION__, __LINE__) in order 
to have at least some information if a coredump is lost, could not be 
genereated, no back-trace was created etc.


So, the question is, is this something that could happen in a “real” system and 
is it of any interest to get a core-dump to analyze the problem? In any case 
ntfimcn will recover.
If the answer is that this event triggering the core-dump is not a fail to 
analyze and that it is something ntfimcn should just gracefully recover from 
then change to LOG_NO + _Exit.

Regards
Lennart

From: Srinivas Mangipudy [mailto:srinivas.mangip...@oracle.com] 
Sent: den 2 november 2017 16:42
To: Lennart Lund <lennart.l...@ericsson.com>; Minh Hon Chau 
<minh.c...@dektech.com.au>
Cc: Ravi Sekhar Reddy Konda <ravisekhar.ko...@oracle.com>
Subject: RE: OSAF : ntfimcnd core dump issue -- 2647.

Hi Lennart,

Thanks a lot for the explanation.
It was really helpful.

Do you think it is better to log the error and then call _exit  ( we will have 
a graceful restart)  then calling abort and allowing the process to dump core 
in this case?
Please let me know your thoughts about this.

Thanks and best regards
Srinivas


From: Lennart Lund [mailto:lennart.l...@ericsson.com] 
Sent: Thursday, November 2, 2017 6:26 PM
To: Minh Hon Chau <minh.c...@dektech.com.au>; Srinivas Mangipudy 
<srinivas.mangip...@oracle.com>
Cc: Ravi Sekhar Reddy Konda <ravisekhar.ko...@oracle.com>; Lennart Lund 
<lennart.l...@ericsson.com>
Subject: RE: OSAF : ntfimcnd core dump issue -- 2647.

Hi

Ntfimcn implements a so called “special applier”. This is an IMM applier that 
receives IMM callbacks in the same way as an object implementer or an ordinary 
applier. The difference is that a special applier is not requesting to become 
applier for any special objects or classes. Instead a configuration attribute 
in an object can be given a flag, ATTR_NOTIFY.
The general error handling in ntfimcn is to exit. This will be detected by the 
osafntfd process that then restarts osfafntfimcnd. To notify that this has 
happened (ntfimcn may have “missed” some IMM modifications while it was down) 
ntfimcn always sends a special notification (may have lost notifications 
notification) when it is started. This notification is used by com-sa to 
request com to do a (re)synchronization.
The “normal” way for ntfimcn to exit is by calling _Exit. When something 
happens that should never happen abort() is used instead (a coredump will be 
created).
If something not normal and really bad is done in the system an abort is 
motivated. This means that ntfimcn shall not be made to “defensively” avoid to 
abort (coredump) if the cluster system and IMM is abused. Instead if we want to 
test something like this it shall be accepted that a coredump is created, this 
is not a Fail! However if something like that ntfimcn is not restarted or that 
no “may have lost notification notification” is sent it is to be considered as 
a Fail.
Note: If ntfimcn exist (or abort) the node or ntf as such is not affected. It 
is only the ntfimcn process that is restarted.

Ticket #2647 should probably not be fixed instead it should be set to invalid.

/Lennart

From: Minh Hon Chau [mailto:minh.c...@dektech.com.au] 
Sent: den 2 november 2017 05:45
To: Srinivas Mangipudy <srinivas.mangip...@oracle.com>
Cc: Ravi Sekhar Reddy Konda <ravisekhar.ko...@oracle.com>; Lennart Lund 
<lennart.l...@ericsson.com>
Subject: Re: OSAF : ntfimcnd core dump issue -- 2647.

Hi Srinivas,
+ Lennart.
If I understand correctly, the test creates a huge amount of objects (are they 
RT or Config?), and while the callbacks are coming, the test deletes the class. 
The latter callbacks can't find class name so it aborts.
I think we can defensively avoid coredump and not sending notification as your 
suggestion, but I'm wondering the integrity from IMM, as IMM user receives a 
callback but the associated class is not existed.
Thanks,
Minh

On 01/11/17 22:04, Srinivas Mangipudy wrote:
Hi Minh,
 
This is regarding Notification issue 
https://sourceforge.net/p/opensaf/tickets/2647/.
 
This issue is occurring since IMM deleted the objects and ntfimcnd was not able 
to fetch the object, so it returned back “SA_AIS_ERR_NOT_EXIST” error.
Since “SA_AIS_ERR_NOT_EXIST” was returned,  ntfimcnd aborted, leading to core 
dump.
 
I have fixed the core dump, but I have a question regarding the notification to 
be sent.
I think in this case ntfimcnd should not be sending the notification at all, 
since it could not retrieve the class details.
Is that fix fine? Or do you suggest something else? Please let me know your 
thoughts about it.
 
Thank you
Srinivas
 
 




---

** [tickets:#2647] ntfd: ntfimcnd crashed on handling for Object creation 
callback**

**Status:** assigned
**Milestone:** 5.18.01
**Created:** Thu Oct 19, 2017 12:45 PM UTC by Srinivas Siva Mangipudy
**Last Updated:** Fri Nov 03, 2017 09:50 PM UTC
**Owner:** Srinivas Siva Mangipudy


1. Problem 
========================================================================= = 
NTFimcnd crashed while 200k objects was created and the deleted in IMM, the 
classes  associated with these objects were also removed.
 
2. Analysis 
========================================================================= =
A huge number of objects, about 200k objects, were created in IMM with PBE 
disabled. Then these objects were deleted in one CCB (by deleting the root 
object). After deleting the objects, the classes were also removed. Then the 
crash happened.

 
../../../../../../../opensaf/osaf/services/saf/ntfsv/ntfimcnd/ntfimcn_imm.c:167:
 get_rdn_attr_name:  Assertion '0' failed. 

When NTFimcnd created notifications, some information had to be looked up in 
IMM.  
NTFimcnd only asked for class information in case of object creation: 

static SaAisErrorT saImmOiCcbObjectCreateCallback(SaImmOiHandleT immOiHandle,   
                                             SaImmOiCcbIdT ccbId,               
                                 const SaImmClassNameT className,               
                                 const SaNameT *parentName, const 
SaImmAttrValuesT_2 **attr) { … dn_ptr = get_created_dn(className, parentName, 
attr); … } static void saImmOiCcbApplyCallback(SaImmOiHandleT immOiHandle, 
SaImmOiCcbIdT ccbId) {                 …                                        
         switch (ccbUtilOperationData->operationType) {                 case 
CCBUTIL_CREATE:                         rdn_attr_name = get_rdn_attr_name(      
                                   ccbUtilOperationData- 
>param.create.className); 
                        internal_rc = ntfimcn_send_object_create_notification(  
                                       ccbUtilOperationData, rdn_attr_name,     
                                    ccbLast); … }
                        
In this case, NTF was still handling notifications for a big number of created 
objects.  Usually,  NTFimcnd cached the class information.  But because these 
objects belong to many classes, it had to ask IMM for class information. But  
the class was already removed in IMM.  IMM was much faster to create and delete 
the objects and class (with PBE disabled), while NTF  was still processing the 
objects, so the information was not there in IMM anymore, and the  crashed 
happened.


3. Reproduction 
========================================================================= = 
Can be reproduced like below: - Disable PBE 1. Create a huge amount of objects 
with one parent object.  2. In my case, I created 2 root objects of two 
different classes with about 100k  child objects each Delete first root object, 
then delete the class associated with this object. 3. Delete second root 
object, delete the class associated with this object. 


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #2647 ntfd: ntfimcnd crashed on handling for Object creation callback

Reply via email to