(1) Packet counts. Yeah if NPS is dead/hung, received
won't rise.
(2) StopCap appears in the standard page footer (in
red):
Packet capture stopped
Plus, a bunch of the threads would have shutdown as there's
no more work for them. Bad thought...
(3) You missed the interesting part of the trace. The
util.c line 71 is just a utility routine - not much info.
Let's skip this for a second. If you follow the
analysis below, I have a rough hypothesis... that a host entry just happens to
become active as it's being purged. I think that makes it a locking
problem. Luca re-wrote that code, and he's the one who commented out the
accessMutex() calls in _getFirstHost, et al, so I'm going to dump this in his
lap...
However, you could try making this change in
hash.c around 259:
void freeHostInfo(HostTraffic *host, int
actualDeviceId) {
...
if(host->magic != CONST_MAGIC_NUMBER) {
traceEvent(CONST_TRACE_WARNING, "Error: bad magic number (expected=%d/real=%d) freeHostInfo()",
CONST_MAGIC_NUMBER, host->magic);
}
if(host->magic != CONST_MAGIC_NUMBER) {
traceEvent(CONST_TRACE_WARNING, "Error: bad magic number (expected=%d/real=%d) freeHostInfo()",
CONST_MAGIC_NUMBER, host->magic);
}
to
void freeHostInfo(HostTraffic *host, int
actualDeviceId) {
...
if(host->magic != CONST_MAGIC_NUMBER) {
traceEvent(CONST_TRACE_WARNING, "Error: bad magic number (expected=%d/real=%d) freeHostInfo()",
CONST_MAGIC_NUMBER, host->magic);
} else
if(host->magic != CONST_MAGIC_NUMBER) {
traceEvent(CONST_TRACE_WARNING, "Error: bad magic number (expected=%d/real=%d) freeHostInfo()",
CONST_MAGIC_NUMBER, host->magic);
} else
host->magic =
999;
That way if somebody tries to update an entry as it's
being freed, s/he should (might) trip one of the magic number tests
elsewhere in the code (I *swear* that I had that in the code once upon a time,
for just this reason). That would then tell us WHICH process is getting
stepped on.
OK, continue the analysis of (3),
above...
Trace the code back further and it
shows where it was called from - you find yourself in
resolveAddress(). And then further back all the way to processIpPkt
etc.
#7 0x400c296c in lookupHost (hostIpAddress=0x452fe4f8,
ether_addr=0x452fe8c8 "", vlanId=-1, checkForMultihoming=1 '\001',
forceUsingIPaddress=0 '\0',
actualDeviceId=0) at hash.c:1101
#8 0x400ce65f in processIpPkt (bp=0x452fe926 "E", h=0x453009a8, length=522, ether_src=0x452fe8ce "", ether_dst=0x452fe8c8 "", actualDeviceId=0, vlanId=-1)
at pbuf.c:1088
actualDeviceId=0) at hash.c:1101
#8 0x400ce65f in processIpPkt (bp=0x452fe926 "E", h=0x453009a8, length=522, ether_src=0x452fe8ce "", ether_dst=0x452fe8c8 "", actualDeviceId=0, vlanId=-1)
at pbuf.c:1088
static void processIpPkt(const u_char
*bp,
const struct pcap_pkthdr *h,
u_int length,
u_char *ether_src,
u_char *ether_dst,
int actualDeviceId,
int vlanId) {
const struct pcap_pkthdr *h,
u_int length,
u_char *ether_src,
u_char *ether_dst,
int actualDeviceId,
int vlanId) {
is what handles a packet that has a TCP/IP
payload. You can see the packet buffers (bp and h) and the length of the
packet 522. That's all normal and doesn't look that odd.
#6 0x400bb858 in ipaddr2str
(hostIpAddress=
{hostFamily = 2, addr = {_hostIp4Address = {s_addr = 2228392643}, _hostIp6Address = {in6_u = {u6_addr8 = "?\222?\204", '\0' <repeats 11 times>, u6_addr16 = {37571, 34002, 0, 0, 0, 0, 0, 0}, u6_addr32 = {2228392643, 0, 0, 0}}}}}, updateHost=1) at address.c:871
{hostFamily = 2, addr = {_hostIp4Address = {s_addr = 2228392643}, _hostIp6Address = {in6_u = {u6_addr8 = "?\222?\204", '\0' <repeats 11 times>, u6_addr16 = {37571, 34002, 0, 0, 0, 0, 0, 0}, u6_addr32 = {2228392643, 0, 0, 0}}}}}, updateHost=1) at address.c:871
At this
level we've figured out it's an IPv4 packet (hostFamily 2) with address
0x84d292c3 or 132.210.146.195
symbolic=0x452fe148 "132.210.146.195",
type=19
Means
we've so far been unable to resolve the name and so we're marking it as an IP
address. Again, all that's normal. Here's the code in
updateDeviceHostNameInfo():
for(el=getFirstHost(actualDeviceId); el != NULL; el =
getNextHost(actualDeviceId, el)) {
if((el->hostNumIpAddress != NULL) && (addrcmp(&el->hostIpAddress, &addr) == 0)) {
accessAddrResMutex("updateHostNameInfo");
if((el->hostNumIpAddress != NULL) && (addrcmp(&el->hostIpAddress, &addr) == 0)) {
accessAddrResMutex("updateHostNameInfo");
if(el != NULL)
{
unsigned short i;
unsigned short i;
if(strlen(symbolic) >=
(MAX_LEN_SYM_HOST_NAME-1))
symbolic[MAX_LEN_SYM_HOST_NAME-2] = '\0';
symbolic[MAX_LEN_SYM_HOST_NAME-2] = '\0';
for(i=0;
i<strlen(symbolic);
i++)
if(isupper(symbolic[i])) tolower(symbolic[i]);
if(isupper(symbolic[i])) tolower(symbolic[i]);
setResolvedName(el, symbolic,
type);
}
}
releaseAddrResMutex();
}
}
}
}
If you
read ntop-c, that's walking through the entire HostTraffic structure, looking
for the entry which matches the given IP address to set the 'resolved
name'. And it's in this scan where it's finding the bad
magic. Not on this entry, but on one of them it checks along the
walk.
So
what we've learned is that it's not the NEW host that's the problem, it's just
one of them in memory. That's a worse problem, as it could be ANYTHING,
except the thing we can actually see in the code. All we know is
that ntop somehow corrupted a HostTraffic entry.
We can pretty well assume that the problem really did
happen around 0'dark&scary, since every new host goes through
this process.
In
your original log, it SEEMS to happen during idlePurge.
Aug 9
01:49:24 gbp ntop[30628]: [MSGID0825709] IDLE_PURGE: Device 0 [eth1]
FINISHED selection, 1385 [out of 10360] hosts selected
Aug 9 01:49:24 gbp ntop[30628]: [MSGID9123622] **WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]
Aug 9 01:49:24 gbp last message repeated 11 times
Aug 9 01:49:24 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 1385/10358 hosts deleted, elapsed time is 0.044406 seconds (0.000032 per host)
Aug 9 01:49:24 gbp ntop[30628]: [MSGID9123622] **WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]
Aug 9 01:49:24 gbp last message repeated 11 times
Aug 9 01:49:24 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 1385/10358 hosts deleted, elapsed time is 0.044406 seconds (0.000032 per host)
Or in
this code:
traceEvent(CONST_TRACE_NOISY,
"IDLE_PURGE: Device %d [%s] FINISHED selection, %d [out of %d] hosts
selected",
actDevice, myGlobals.device[actDevice].name, numHosts, scannedHosts);
/* Now free the entries */
for(idx=0; idx<numHosts; idx++) {
#ifdef IDLE_PURGE_DEBUG
traceEvent(CONST_TRACE_INFO, "IDLE_PURGE_DEBUG: Purging host %d [last seen=%d]... %s",
idx, theFlaggedHosts[idx]->lastSeen, theFlaggedHosts[idx]->hostResolvedName);
#endif
freeHostInfo(theFlaggedHosts[idx], actDevice);
numFreedBuckets++;
ntop_conditional_sched_yield(); /* Allow other threads to run */
}
actDevice, myGlobals.device[actDevice].name, numHosts, scannedHosts);
/* Now free the entries */
for(idx=0; idx<numHosts; idx++) {
#ifdef IDLE_PURGE_DEBUG
traceEvent(CONST_TRACE_INFO, "IDLE_PURGE_DEBUG: Purging host %d [last seen=%d]... %s",
idx, theFlaggedHosts[idx]->lastSeen, theFlaggedHosts[idx]->hostResolvedName);
#endif
freeHostInfo(theFlaggedHosts[idx], actDevice);
numFreedBuckets++;
ntop_conditional_sched_yield(); /* Allow other threads to run */
}
free(theFlaggedHosts);
if(myGlobals.runningPref.enableSessionHandling)
scanTimedoutTCPSessions(actDevice); /* let's check timedout sessions too */
scanTimedoutTCPSessions(actDevice); /* let's check timedout sessions too */
gettimeofday(&hiresTimeEnd,
NULL);
hiresDeltaTime = timeval_subtract(hiresTimeEnd, hiresTimeStart);
if(numFreedBuckets > 0)
traceEvent(CONST_TRACE_NOISY,
"IDLE_PURGE: Device %d [%s]: %d/%d hosts deleted, elapsed time is %.6f seconds (%.6f per host)",
actDevice, myGlobals.device[actDevice].name,
numFreedBuckets, maxHosts,
hiresDeltaTime,
hiresDeltaTime / numFreedBuckets);
else
traceEvent(CONST_TRACE_NOISY, "IDLE_PURGE: Device %s: no hosts [out of %d] deleted",
myGlobals.device[actDevice].name, maxHosts);
hiresDeltaTime = timeval_subtract(hiresTimeEnd, hiresTimeStart);
if(numFreedBuckets > 0)
traceEvent(CONST_TRACE_NOISY,
"IDLE_PURGE: Device %d [%s]: %d/%d hosts deleted, elapsed time is %.6f seconds (%.6f per host)",
actDevice, myGlobals.device[actDevice].name,
numFreedBuckets, maxHosts,
hiresDeltaTime,
hiresDeltaTime / numFreedBuckets);
else
traceEvent(CONST_TRACE_NOISY, "IDLE_PURGE: Device %s: no hosts [out of %d] deleted",
myGlobals.device[actDevice].name, maxHosts);
If I had to guess, it
would be that some path in the code isn't locking HostTraffic properly and so it
stepped on the free routine. Or maybe freeHostInfo() needs to do the lock,
or ????
But finding it is going to be a b*tch...
-----Burton
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Marc Mazuhelli
Sent: Tuesday, August 09, 2005 1:59 PM
To: [email protected]
Subject: Re: [Ntop-dev] A lot of "**WARNING** Error: bad magic number[expected=1968/real=0][deviceId=0]" messages...
Thanks for the rapid and detailed response!
I didn’t restart my ntop instance yet in case it could be used for some tests, so it’s still available for debugging with gdb.
Regarding the packet counts:
After a few more hours, the counts are exactly the same:
I think you are right - the NPA (Network Packet Analyzer) or NPS (Network Packet Sniffing) thread(s) has/have died and that's why purge gradually ages every host out of the structures - it's not seeing any more traffic.
I'd bet NPA because of this:
Packets
Received.....310,400,744
Processed Immediately.....199,990,101 (64.4 %)
Queued.....0 (0.0 %)
Packets
Received.....310,400,744
Processed Immediately.....199,990,101 (64.4 %)
Queued.....0 (0.0 %)
(I got these from a different browser to make sure it didn’t come from the browser cache). It looks to me like the NPS thread (at least) is dead.
I didn’t see any mention of this in any web page this morning (and I still don’t now). Where whould it appear? In the footer with the “Report created on ...” text?
So, I wonder if we were in stopcap? If so, it should have been indicated at the bottom of every web page. (I'm going to commit a patch to add the run state to the info.html / textinfo.html page.)
I did that – see below.
You can use gdb to connect to the running ntop. For example:
<snip>
Check the run mode:
I don’t know how to interpret the “$1 = 4” after the command: print myGlobals.ntopRunState.
I pasted (in dark red) the corresponding lines from the log on each thread. Interesting: it seems the NPA and NPS threads are not dead after all, but the NPS thread (#4) seems to be in a weird state.
It seems the “Bad magic number” message comes from the call of traceEvent at line 71 of util.c:
<snip>
Loaded symbols for /usr/local/ntop3cvs/lib/ntop/plugins/snmpPlugin.so
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
0xffffe002 in ?? ()
(gdb) info stack
#0 0xffffe002 in ?? ()
#1 0x400eac32 in _ntopSleepWhileSameState (file=0x8052114 "main.c", line=588, uSeconds=4294967292) at util.c:3997
#2 0x0804ad2b in main (argc=4, argv=0xbfffe4e4) at main.c:588
#3 0x42015704 in __libc_start_main () from /lib/tls/libc.so.6
(gdb) frame 1
#1 0x400eac32 in _ntopSleepWhileSameState (file=0x8052114 "main.c", line=588, uSeconds=4294967292) at util.c:3997
3997 rc = _ntopSleepMSWhileSameState(file, line, 1000L*uSeconds) / 1000L;
(gdb) print myGlobals.ntopRunState
$1 = 4
(gdb) info threads
9 Thread 1117223728 (LWP 30631) 0xffffe002 in ?? () <== THREADMGMT[t1117223728]: NPA: network packet analyzer (packet processor) thread running [p30628]
8 Thread 1125616432 (LWP 30632) 0xffffe002 in ?? () <== THREADMGMT[t1125616432]: SFP: Fingerprint scan thread starting [p30628]
7 Thread 1134009136 (LWP 30633) 0xffffe002 in ?? () <== THREADMGMT[t1134009136]: SIH: Idle host scan thread running [p30628]
6 Thread 1142401840 (LWP 30634) 0xffffe002 in ?? () <== THREADMGMT[t1142401840]: DNSAR(1): Address resolution thread running [p30628]
5 Thread 1150794544 (LWP 30636) 0xffffe002 in ?? () <== THREADMGMT[t1150794544]: WEB: Server connection thread starting [p30628]
4 Thread 1160776496 (LWP 30637) 0xffffe002 in ?? () <== THREADMGMT[t1160776496]: NPS(1): Started thread for network packet sniffing
3 Thread 1195379504 (LWP 31115) 0xffffe002 in ?? () <== THREADMGMT[t1195379504]: RRD: Data collection thread starting [p30628]
2 Thread 1205865264 (LWP 31116) 0xffffe002 in ?? () <== THREADMGMT[t1205865264]: RRD: Throughput data collection: Thread starting [p30628]
1 Thread 1102172480 (LWP 30628) 0xffffe002 in ?? () <== THREADMGMT[t1102172480]: Now running as a daemon
(gdb) thread 9
[Switching to thread 9 (Thread 1117223728 (LWP 30631))]#0 0xffffe002 in ?? ()
(gdb) info stack
#0 0xffffe002 in ?? ()
#1 0x400d2692 in dequeuePacket (notUsed=0x0) at pbuf.c:2407
#2 0x41819484 in start_thread () from /lib/tls/libpthread.so.0
(gdb) thread 4
[Switching to thread 4 (Thread 1160776496 (LWP 30637))]#0 0xffffe002 in ?? ()
(gdb) info stack
#0 0xffffe002 in ?? ()
#1 0x420daf2f in syslog () from /lib/tls/libc.so.6
#2 0x400e87d9 in traceEvent (eventTraceLevel=2, file=0x400fa00e "util.c", line=71,
format=0x400fa820 "Error: bad magic number [expected=%d/real=%d][deviceId=%d]") at util.c:2665
#3 0x400e4a21 in _getFirstHost (actualDeviceId=0, beginIdx=1) at util.c:71
#4 0x400ba79f in updateDeviceHostNameInfo (addr=
{hostFamily = 2, addr = {_hostIp4Address = {s_addr = 2228392643}, _hostIp6Address = {in6_u = {u6_addr8 = "?\222?\204", '\0' <repeats 11 times>, u6_addr16 = {37571, 34002, 0, 0, 0, 0, 0, 0}, u6_addr32 = {2228392643, 0, 0, 0}}}}}, symbolic=0x452fe148 "132.210.146.195", actualDeviceId=0, type=19)
at address.c:54
#5 0x400ba832 in updateHostNameInfo (addr=
{hostFamily = 2, addr = {_hostIp4Address = {s_addr = 2228392643}, _hostIp6Address = {in6_u = {u6_addr8 = "?\222?\204", '\0' <repeats 11 times>, u6_addr16 = {37571, 34002, 0, 0, 0, 0, 0, 0}, u6_addr32 = {2228392643, 0, 0, 0}}}}}, symbolic=0x452fe148 "132.210.146.195", type=19) at address.c:81
#6 0x400bb858 in ipaddr2str (hostIpAddress=
{hostFamily = 2, addr = {_hostIp4Address = {s_addr = 2228392643}, _hostIp6Address = {in6_u = {u6_addr8 = "?\222?\204", '\0' <repeats 11 times>, u6_addr16 = {37571, 34002, 0, 0, 0, 0, 0, 0}, u6_addr32 = {2228392643, 0, 0, 0}}}}}, updateHost=1) at address.c:871
#7 0x400c296c in lookupHost (hostIpAddress=0x452fe4f8, ether_addr=0x452fe8c8 "", vlanId=-1, checkForMultihoming=1 '\001', forceUsingIPaddress=0 '\0',
actualDeviceId=0) at hash.c:1101
#8 0x400ce65f in processIpPkt (bp=0x452fe926 "E", h=0x453009a8, length=522, ether_src=0x452fe8ce "", ether_dst=0x452fe8c8 "", actualDeviceId=0, vlanId=-1)
at pbuf.c:1088
#9 0x400d59e0 in processPacket (_deviceId=0x0, h=0x453009a8, p=0x452fe918 "") at pbuf.c:3528
#10 0x400d2356 in queuePacket (_deviceId=0x0, h=0x453009a8, p=0x80c7c72 "") at pbuf.c:2317
#11 0x419a15aa in pcap_read_packet (handle=0x80c7ad0, callback=0x400d21a8 <queuePacket>, userdata=0x42f8f110 "") at pcap-linux.c:638
#12 0x419a2b9a in pcap_loop (p=0x80c7ad0, cnt=-1, callback=0x400d21a8 <queuePacket>, user=0x0) at pcap.c:81
#13 0x400c9406 in pcapDispatch (_i=0x0) at ntop.c:97
#14 0x41819484 in start_thread () from /lib/tls/libpthread.so.0
(gdb)
Back to what's wrong...
* There are 10K hosts - so if it's something really, really generic, we would see it much more often and much sooner.
How much memory on the host?
Are you running out?
Please scan back in the log for allocation failure messages.
The host has 4 gigs of memory (I wrote 1.5 in the original message but that was from memory - I just checked and it has 4 gigs of real memory). I haven’t seen it run out recently and no messages to that effect appear in the logs.
I don’t see anything that could have happened at 1:44 that seems obviously wrong. As you know 1:44 is in the middle of the night (OK maybe the start of the night depending on who you ask!) but I was asleep at home and dreaming of other things than ntop... :-)
* When you started to get the bad magic number messages, there were 11 instances.
What changed around 1:44? It looks like there are 18 hosts that were bad and yet most of them were idle at 1:49.
Does 11 ring any bells? You know, back to Mr. Rogers - some of these things are not like the others? Which of these things do not belong??
We are a university and we have students who have Internet access in the residence halls so there is always some traffic even in the middle of the night, but there is a lot more (3 to 4 times more) during the day.
I think what you ask for is included in the gdb trace above.
* Another question is WHICH bad magic number message this is? Can you run with a higher trace level so that we get the file/line info?? (Again, I'm going to be commiting a patch to improve this data).
It looks like the one from hash.c around 802 in lookupHost(), so it could be called from just about anywhere...
I will do that next and report here the next time it happens.
Next Steps: Grab the cvs (wait for the 621 and 622 commit messages) and rerun the test - when it goes bad, please do the gdb stuff above, and also grab the updated log messages. This should give us much more info.
Again, thank you for your support!
Regards,
Marc.
--
Marc Mazuhelli
Computer security analyst
Service des technologies de l'information
Universite de Sherbrooke
Sherbrooke, Quebec, Canada
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Marc Mazuhelli
Sent: Tuesday, August 09, 2005 9:25 AM
To: [email protected]
Subject: [Ntop] A lot of "**WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]" messages...
<snip />
_______________________________________________ Ntop-dev mailing list [email protected] http://listgateway.unipi.it/mailman/listinfo/ntop-dev
