Brian
I agree with your that there's something wrong with sessions. However
sessions.c:343 contains something different from what you reported. Can
you please send me the source code round line 343 so I can see what you
mean?
Thanks Luca
On 11/05/2011 08:09 PM, Brian Behrens wrote:
No problem,
I did some more work on this and found that line 343 in sessions.c is the
culprit. Basically here is a breakdown of whats happening.
That line attempts to free a memory at a pointer at the address specified by
sessionToPurge->session_info. When you dump what is in the address pointed to
session_info, it contains 0xffffffff. Since this is not a NULL value, it
attempts to free the memory at that address which is out of bounds and causes a
segfault.
So, in perspective, its most likely trying to free memory that has already been
freed. The question becomes why is the code thinking there is still a valid
memory address at that pointer? I think I have an idea on why that might be,
I started watching the session counters and even though I have specified an
upper limit of 65536 sessions, I can see the count does actually get this high.
When the count gets that high, it clears and starts over. Now, I have not
investigated on what actually transpires when this reset occurs, but my guess
is that it still thinks that there are sessions that need to be purged that
have already been purged by the clearing.
I have also noticed that once that bound is reached, the count seems to stay
around 14k sessions. The ESX server I am running this on has 98Gb of memory,
so memory constraints are not really a concern, this might be just tuning the
max sessions to tolerate enough sessions so that the purge cycle that is
supposed to purge these idle sessions can do its job effectively.
I would think that this might be occurring on the lower load networks as the
DEFAULT_NTOP_MAX_NUM_SESSIONS is set lower, and thus the limit might also being
reached and causing the clear routine, and the segfault as the use of
0xffffffff is implemented in various places and could easily be stored in many
memory locations.
So, I might try to work around this by elevating the
DEFAULT_NTOP_MAX_NUM_SESSIONS to see if that helps out. Also, taking a deeper
look at what happens when this bound is reached might be productive for me to
understand to help eliminate this.
I hope this helps out some as I have seen similar postings to this in the
threads.
--Brian
________________________________________
From: [email protected]
[[email protected]] on behalf of Luca Deri [[email protected]]
Sent: Saturday, November 05, 2011 6:22 AM
To: [email protected]
Subject: Re: [Ntop-misc] Easily Reproducable Segfaults
Brian
thanks for your report. I do not have the ability to reproduce the crash you
reported using the code in SVN (this is the only version I can support). Can
you please crash ntop, generate a core and analyze it a bit so that I can
understand where the problem could be? Before doing that, please resync with
SVN.
Thanks for your support Luca
On Nov 4, 2011, at 5:09 PM, Brian Behrens wrote:
Hello,
I have been working for days trying to resolve a segfault issue like the
following:
Nov 4 10:46:54 NTOP-SC kernel: ntop[25479]: segfault at 645 ip
00007f95f3cf3395 sp 00007f95e9b75ae8 error 6 in
libntop-4.1.1.so[7f95f3cb9000+56000]
The environment is an ESX 5 VM.
Guest OS I have tried:
1. CentOS 6
2. Fedora 15
3. Network Security Toolkit (uses 4865 of the current dev tree)
Versions I have tried:
1. Current dev tree.
2. Current stable version (4.1.0)
The times variate on where these faults occur, but it is relevant to network
load factors.
My test networks:
1. Simple home network with all packets going to NTOP.
2. High load work network that can see 25 Gig in 15 mins.
The most stable I have seen is a clean CentOS install, build ntop from trunk
tree, install and run.
The quickest segfault I can obtain is when I implement PF_RING, use a e1000
card in the vm, and use the pf_ring aware e1000 driver. Can get a segfault
usually within 30 mins on the busy network.
The common theme is the segfaulting. I did attempt a gdb on the device one
time and saw a malloc issue, but all these VMs have 4GB memory and I have tried
tuning different hash sizes to see how this impacts the issue, but it really
never does. Use smaller hash values, and I get more messages of low memory,
etc.
I am really not sure what else to do, if there is anything I can do to present
more information, please let me know as I would like to stop this incessant
segfaulting.
_______________________________________________
Ntop-misc mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-misc
---
We can't solve problems by using the same kind of thinking we used when we
created them - Albert Einstein
_______________________________________________
Ntop-misc mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-misc
_______________________________________________
Ntop-misc mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-misc
_______________________________________________
Ntop-misc mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-misc