Additional information as of today...

Correct me if I am wrong here, but it appears that the host pointers are stored 
in the Sessions (mutex???).    Most of this debugging is a bit new to me, but I 
am figuring it out as I go along.   Anyhow,  in the hash.c file:

  /* Now free the entries */
  for(idx=0; idx<numHosts; idx++) {
#ifdef IDLE_PURGE_DEBUG
    traceEvent(CONST_TRACE_INFO, "IDLE_PURGE_DEBUG: Purging host %d [last 
seen=%d]... %s",
        idx, theFlaggedHosts[idx]->lastSeen, 
theFlaggedHosts[idx]->hostResolvedName);
#endif
    freeHostInfo(theFlaggedHosts[idx], actDevice);
    numFreedBuckets++;
    ntop_conditional_sched_yield(); /* Allow other threads to run */
  }
  free(theFlaggedHosts);
  if(myGlobals.runningPref.enableSessionHandling)
    scanTimedoutTCPSessions(actDevice); /* let's check timedout sessions too */


This purges the hosts before running the scanTimedoutTCPSessions.

If we look at the scanTimedoutTCPSessions function, we find:

theSession = myGlobals.device[actualDeviceId].tcpSession[idx];

and a little later....

freeSession(theSession, actualDeviceId, 1, 0 /* locked by the purge thread */);

Looking at freeSession, we find:

theHost = sessionToPurge->initiator, theRemHost = sessionToPurge->remotePeer;

(Note: sessionToPurge = theSession passed on)

This host pointer comes from a different location, and it is possible and I 
have shown that the memory pointed to by this pointer can be re-used before 
theHost is set.   This causes:

if((theHost != NULL) && (theRemHost != NULL) && allocateMemoryIfNeeded) {

To validate,

And causes:

incrementUsageCounter(&theHost->secHostPkts->closedEmptyTCPConnSent, 
theRemHost, actualDeviceId);

to segfault.

So, basically, the reference pointer in the sessions storage is not being 
purged when the hosts are.  I am trying to work around this by having the 
scanIdleTCPSessions run before the hosts are purged to hope that the sessions 
get purged before the hosts do, but by looking over the code, I think the risk 
still exists that a non-purged session could refer to a purged host.   I am not 
sure the best approach to double checking the Sessions mutex to ensure the host 
pointer is set to null.   Also, I think this is what is causing the other 
segfaults as well, but I am not intimate with the code to know where all the 
host pointers are stored and potienally reffered to during execution.

Again, I would think the ultimate best would be to have that pointer set to 
NULL in the Sessions mutex when the host is purged, but the how might be very 
difficult.

--Brian

________________________________________
From: Brian Behrens
Sent: Monday, November 07, 2011 9:49 AM
To: [email protected]
Subject: RE: [Ntop-misc] Easily Reproducable Segfaults

Luca,

Here is the code:

  if(sessionToPurge->session_info != NULL)
    free(sessionToPurge->session_info);

According to gdb the session_info points to an address 0xffffffff, which causes 
a segfault when the free function gets called.

--Brian

________________________________________
From: [email protected] 
[[email protected]] on behalf of Luca Deri [[email protected]]
Sent: Monday, November 07, 2011 2:29 AM
To: [email protected]
Subject: Re: [Ntop-misc] Easily Reproducable Segfaults

Brian
I agree with your that there's something wrong with sessions. However
sessions.c:343 contains something different from what you reported. Can
you please send me the source code round line 343 so I can see what you
mean?

Thanks Luca

On 11/05/2011 08:09 PM, Brian Behrens wrote:
> No problem,
>
> I did some more work on this and found that line 343 in sessions.c is the 
> culprit.   Basically here is a breakdown of whats happening.
>
> That line attempts to free a memory at a pointer at the address specified by 
> sessionToPurge->session_info.  When you dump what is in the address pointed 
> to session_info, it contains 0xffffffff.   Since this is not a NULL value, it 
> attempts to free the memory at that address which is out of bounds and causes 
> a segfault.
>
> So, in perspective, its most likely trying to free memory that has already 
> been freed.   The question becomes why is the code thinking there is still a 
> valid memory address at that pointer?   I think I have an idea on why that 
> might be,  I started watching the session counters and even though I have 
> specified an upper limit of 65536 sessions, I can see the count does actually 
> get this high.  When the count gets that high, it clears and starts over.   
> Now, I have not investigated on what actually transpires when this reset 
> occurs, but my guess is that it still thinks that there are sessions that 
> need to be purged that have already been purged by the clearing.
>
> I have also noticed that once that bound is reached, the count seems to stay 
> around 14k sessions.  The ESX server I am running this on has 98Gb of memory, 
> so memory constraints are not really a concern, this might be just tuning the 
> max sessions to tolerate enough sessions so that the purge cycle that is 
> supposed to purge these idle sessions can do its job effectively.
>
> I would think that this might be occurring on the lower load networks as the 
> DEFAULT_NTOP_MAX_NUM_SESSIONS is set lower, and thus the limit might also 
> being reached and causing the clear routine, and the segfault as the use of 
> 0xffffffff is implemented in various places and could easily be stored in 
> many memory locations.
>
> So, I might try to work around this by elevating the 
> DEFAULT_NTOP_MAX_NUM_SESSIONS to see if that helps out.   Also, taking a 
> deeper look at what happens when this bound is reached might be productive 
> for me to understand to help eliminate this.
>
> I hope this helps out some as I have seen similar postings to this in the 
> threads.
>
> --Brian
> ________________________________________
> From: [email protected] 
> [[email protected]] on behalf of Luca Deri 
> [[email protected]]
> Sent: Saturday, November 05, 2011 6:22 AM
> To: [email protected]
> Subject: Re: [Ntop-misc] Easily Reproducable Segfaults
>
> Brian
> thanks for your report. I do not have the ability to reproduce the crash you 
> reported using the code in SVN (this is the only version I can support). Can 
> you please crash ntop, generate a core and analyze it a bit so that I can 
> understand where the problem could be? Before doing that, please resync with 
> SVN.
>
> Thanks for your support Luca
>
> On Nov 4, 2011, at 5:09 PM, Brian Behrens wrote:
>
>> Hello,
>>
>> I have been working for days trying to resolve a segfault issue like the 
>> following:
>>
>> Nov  4 10:46:54 NTOP-SC kernel: ntop[25479]: segfault at 645 ip 
>> 00007f95f3cf3395 sp 00007f95e9b75ae8 error 6 in 
>> libntop-4.1.1.so[7f95f3cb9000+56000]
>>
>> The environment is an ESX 5 VM.
>>
>> Guest OS I have tried:
>>
>> 1. CentOS 6
>> 2. Fedora 15
>> 3. Network Security Toolkit (uses 4865 of the current dev tree)
>>
>> Versions I have tried:
>>
>> 1. Current dev tree.
>> 2. Current stable version (4.1.0)
>>
>> The times variate on where these faults occur, but it is relevant to network 
>> load factors.
>>
>> My test networks:
>>
>> 1. Simple home network with all packets going to NTOP.
>> 2. High load work network that can see 25 Gig in 15 mins.
>>
>> The most stable I have seen is a clean CentOS install, build ntop from trunk 
>> tree, install and run.
>>
>> The quickest segfault I can obtain is when I implement PF_RING, use a e1000 
>> card in the vm, and use the pf_ring aware e1000 driver.   Can get a segfault 
>> usually within 30 mins on the busy network.
>>
>> The common theme is the segfaulting.  I did attempt a gdb on the device one 
>> time and saw a malloc issue, but all these VMs have 4GB memory and I have 
>> tried tuning different hash sizes to see how this impacts the issue, but it 
>> really never does.  Use smaller hash values, and I get more messages of low 
>> memory, etc.
>>
>> I am really not sure what else to do, if there is anything I can do to 
>> present more information, please let me know as I would like to stop this 
>> incessant segfaulting.
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Ntop-misc mailing list
>> [email protected]
>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
> ---
> We can't solve problems by using the same kind of thinking we used when we 
> created them - Albert Einstein
>
> _______________________________________________
> Ntop-misc mailing list
> [email protected]
> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
> _______________________________________________
> Ntop-misc mailing list
> [email protected]
> http://listgateway.unipi.it/mailman/listinfo/ntop-misc

_______________________________________________
Ntop-misc mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-misc
_______________________________________________
Ntop-misc mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-misc

Reply via email to