Title: Re: [Ntop-dev] A lot of "**WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]" messages...
(1) Packet counts.  Yeah if NPS is dead/hung, received won't rise.
 
(2) StopCap appears in the standard page footer (in red):
 
Packet capture stopped
 
Plus, a bunch of the threads would have shutdown as there's no more work for them.  Bad thought...
 
(3) You missed the interesting part of the trace.  The util.c line 71 is just a utility routine - not much info.
 
Let's skip this for a second.  If you follow the analysis below, I have a rough hypothesis... that a host entry just happens to become active as it's being purged.  I think that makes it a locking problem.  Luca re-wrote that code, and he's the one who commented out the accessMutex() calls in _getFirstHost, et al, so I'm going to dump this in his lap...
 
However, you could try making this change in hash.c around 259:
 
void freeHostInfo(HostTraffic *host, int actualDeviceId) {
...
  if(host->magic != CONST_MAGIC_NUMBER) {
    traceEvent(CONST_TRACE_WARNING, "Error: bad magic number (expected=%d/real=%d) freeHostInfo()",
               CONST_MAGIC_NUMBER, host->magic);
  }
 
to
 
void freeHostInfo(HostTraffic *host, int actualDeviceId) {
...
  if(host->magic != CONST_MAGIC_NUMBER) {
    traceEvent(CONST_TRACE_WARNING, "Error: bad magic number (expected=%d/real=%d) freeHostInfo()",
               CONST_MAGIC_NUMBER, host->magic);
  } else
    host->magic = 999;
 
That way if somebody tries to update an entry as it's being freed, s/he should (might) trip one of the magic number tests elsewhere in the code (I *swear* that I had that in the code once upon a time, for just this reason).  That would then tell us WHICH process is getting stepped on.
 
 
 
 
OK, continue the analysis of (3), above...
 
Trace the code back further and it shows where it was called from - you find yourself in resolveAddress(). And then further back all the way to processIpPkt etc.
 
#7  0x400c296c in lookupHost (hostIpAddress=0x452fe4f8, ether_addr=0x452fe8c8 "", vlanId=-1, checkForMultihoming=1 '\001', forceUsingIPaddress=0 '\0',
    actualDeviceId=0) at hash.c:1101
#8  0x400ce65f in processIpPkt (bp=0x452fe926 "E", h=0x453009a8, length=522, ether_src=0x452fe8ce "", ether_dst=0x452fe8c8 "", actualDeviceId=0, vlanId=-1)
    at pbuf.c:1088
 
static void processIpPkt(const u_char *bp,
                         const struct pcap_pkthdr *h,
                         u_int length,
                         u_char *ether_src,
                         u_char *ether_dst,
                         int actualDeviceId,
                         int vlanId) {
 
is what handles a packet that has a TCP/IP payload.  You can see the packet buffers (bp and h) and the length of the packet 522.   That's all normal and doesn't look that odd.
 
#6  0x400bb858 in ipaddr2str (hostIpAddress=
              {hostFamily = 2, addr = {_hostIp4Address = {s_addr = 2228392643}, _hostIp6Address = {in6_u = {u6_addr8 = "?\222?\204", '\0' <repeats 11 times>, u6_addr16 = {37571, 34002, 0, 0, 0, 0, 0, 0}, u6_addr32 = {2228392643, 0, 0, 0}}}}}, updateHost=1) at address.c:871
 
At this level we've figured out it's an IPv4 packet (hostFamily 2) with address 0x84d292c3 or 132.210.146.195
 
symbolic=0x452fe148 "132.210.146.195", type=19
 
Means we've so far been unable to resolve the name and so we're marking it as an IP address.  Again, all that's normal.  Here's the code in updateDeviceHostNameInfo():
 
  for(el=getFirstHost(actualDeviceId); el != NULL; el = getNextHost(actualDeviceId, el)) {
    if((el->hostNumIpAddress != NULL) && (addrcmp(&el->hostIpAddress, &addr) == 0)) {
      accessAddrResMutex("updateHostNameInfo");
 
      if(el != NULL) {
        unsigned short i;
 
        if(strlen(symbolic) >= (MAX_LEN_SYM_HOST_NAME-1))
          symbolic[MAX_LEN_SYM_HOST_NAME-2] = '\0';
 
        for(i=0; i<strlen(symbolic); i++)
          if(isupper(symbolic[i])) tolower(symbolic[i]);
 
        setResolvedName(el, symbolic, type);
      }
 
      releaseAddrResMutex();
    }
  }
 
If you read ntop-c, that's walking through the entire HostTraffic structure, looking for the entry which matches the given IP address to set the 'resolved name'.   And it's in this scan where it's finding the bad magic.  Not on this entry, but on one of them it checks along the walk.
 
So what we've learned is that it's not the NEW host that's the problem, it's just one of them in memory.  That's a worse problem, as it could be ANYTHING, except the thing we can actually see in the code.  All we know is that ntop somehow corrupted a HostTraffic entry.  We can pretty well assume that the problem really did happen around 0'dark&scary, since every new host goes through this process.
 
In your original log, it SEEMS to happen during idlePurge.
 
Aug  9 01:49:24 gbp ntop[30628]:  [MSGID0825709] IDLE_PURGE: Device 0 [eth1] FINISHED selection, 1385 [out of 10360] hosts selected
Aug  9 01:49:24 gbp ntop[30628]:  [MSGID9123622] **WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]
Aug  9 01:49:24 gbp last message repeated 11 times
Aug  9 01:49:24 gbp ntop[30628]:  [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 1385/10358 hosts deleted, elapsed time is 0.044406 seconds (0.000032 per host)
 
Or in this code:
 
  traceEvent(CONST_TRACE_NOISY, "IDLE_PURGE: Device %d [%s] FINISHED selection, %d [out of %d] hosts selected",
             actDevice, myGlobals.device[actDevice].name, numHosts, scannedHosts);
            
  /* Now free the entries */
  for(idx=0; idx<numHosts; idx++) {
#ifdef IDLE_PURGE_DEBUG
    traceEvent(CONST_TRACE_INFO, "IDLE_PURGE_DEBUG: Purging host %d [last seen=%d]... %s",
               idx, theFlaggedHosts[idx]->lastSeen, theFlaggedHosts[idx]->hostResolvedName);
#endif
    freeHostInfo(theFlaggedHosts[idx], actDevice);
    numFreedBuckets++;
    ntop_conditional_sched_yield(); /* Allow other threads to run */
  }
 
  free(theFlaggedHosts);
 
  if(myGlobals.runningPref.enableSessionHandling)
    scanTimedoutTCPSessions(actDevice); /* let's check timedout sessions too */
 
  gettimeofday(&hiresTimeEnd, NULL);
  hiresDeltaTime = timeval_subtract(hiresTimeEnd, hiresTimeStart);
 
  if(numFreedBuckets > 0)
    traceEvent(CONST_TRACE_NOISY,
               "IDLE_PURGE: Device %d [%s]: %d/%d hosts deleted, elapsed time is %.6f seconds (%.6f per host)",
               actDevice, myGlobals.device[actDevice].name,
               numFreedBuckets, maxHosts,
               hiresDeltaTime,
               hiresDeltaTime / numFreedBuckets);
  else
    traceEvent(CONST_TRACE_NOISY, "IDLE_PURGE: Device %s: no hosts [out of %d] deleted",
               myGlobals.device[actDevice].name, maxHosts);
 
If I had to guess, it would be that some path in the code isn't locking HostTraffic properly and so it stepped on the free routine.  Or maybe freeHostInfo() needs to do the lock, or ????
 
But finding it is going to be a b*tch... 
 
 
 
-----Burton


From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Marc Mazuhelli
Sent: Tuesday, August 09, 2005 1:59 PM
To: [email protected]
Subject: Re: [Ntop-dev] A lot of "**WARNING** Error: bad magic number[expected=1968/real=0][deviceId=0]" messages...

Hi Burton,

Thanks for the rapid and detailed response!

I didn’t restart my ntop instance yet in case it could be used for some tests, so it’s still available for debugging with gdb.

Regarding the packet counts:

I think you are right - the NPA (Network Packet Analyzer) or NPS (Network Packet Sniffing) thread(s) has/have died and that's why purge gradually ages every host out of the structures - it's not seeing any more traffic.

I'd bet NPA because of this:

Packets

Received.....310,400,744
Processed Immediately.....199,990,101 (64.4 %)
Queued.....0 (0.0 %)


After a few more hours, the counts are exactly the same:

Packets

Received.....310,400,744
Processed Immediately.....199,990,101 (64.4 %)
Queued.....0 (0.0 %)

(I got these from a different browser to make sure it didn’t come from the browser cache). It looks to me like the NPS thread (at least) is dead.


So, I wonder if we were in stopcap?  If so, it should have been indicated at the bottom of every web page.  (I'm going to commit a patch to add the run state to the info.html / textinfo.html page.)

I didn’t see any mention of this in any web page this morning (and I still don’t now). Where whould it appear? In the footer with the “Report created on ...” text?

You can use gdb to connect to the running ntop.  For example:
<snip>
Check the run mode:

I did that – see below.
I don’t know how to interpret the “$1 = 4” after the command: print myGlobals.ntopRunState.

I pasted (in dark red) the corresponding lines from the log on each thread. Interesting: it seems the NPA and NPS threads are not dead after all, but the NPS thread (#4) seems to be in a weird state.

It seems the “Bad magic number” message comes from the call of traceEvent at line 71 of util.c:

<snip>
Loaded symbols for /usr/local/ntop3cvs/lib/ntop/plugins/snmpPlugin.so
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
0xffffe002 in ?? ()
(gdb) info stack
#0  0xffffe002 in ?? ()
#1  0x400eac32 in _ntopSleepWhileSameState (file=0x8052114 "main.c", line=588, uSeconds=4294967292) at util.c:3997
#2  0x0804ad2b in main (argc=4, argv=0xbfffe4e4) at main.c:588
#3  0x42015704 in __libc_start_main () from /lib/tls/libc.so.6
(gdb) frame 1
#1  0x400eac32 in _ntopSleepWhileSameState (file=0x8052114 "main.c", line=588, uSeconds=4294967292) at util.c:3997
3997      rc = _ntopSleepMSWhileSameState(file, line, 1000L*uSeconds) / 1000L;
(gdb) print myGlobals.ntopRunState
$1 = 4
(gdb) info threads
  9 Thread 1117223728 (LWP 30631)  0xffffe002 in ?? () <==
THREADMGMT[t1117223728]: NPA: network packet analyzer (packet processor) thread running [p30628]
 8 Thread 1125616432 (LWP 30632)  0xffffe002 in ?? () <== THREADMGMT[t1125616432]: SFP: Fingerprint scan thread starting [p30628]
 7 Thread 1134009136 (LWP 30633)  0xffffe002 in ?? () <== THREADMGMT[t1134009136]: SIH: Idle host scan thread running [p30628]
 6 Thread 1142401840 (LWP 30634)  0xffffe002 in ?? () <== THREADMGMT[t1142401840]: DNSAR(1): Address resolution thread running [p30628]
 5 Thread 1150794544 (LWP 30636)  0xffffe002 in ?? () <== THREADMGMT[t1150794544]: WEB: Server connection thread starting [p30628]
 4 Thread 1160776496 (LWP 30637)  0xffffe002 in ?? () <== THREADMGMT[t1160776496]: NPS(1): Started thread for network packet sniffing
 3 Thread 1195379504 (LWP 31115)  0xffffe002 in ?? () <== THREADMGMT[t1195379504]: RRD: Data collection thread starting [p30628]
 2 Thread 1205865264 (LWP 31116)  0xffffe002 in ?? () <== THREADMGMT[t1205865264]: RRD: Throughput data collection: Thread starting [p30628]
 1 Thread 1102172480 (LWP 30628)  0xffffe002 in ?? () <== THREADMGMT[t1102172480]: Now running as a daemon
(gdb) thread 9
[Switching to thread 9 (Thread 1117223728 (LWP 30631))]#0  0xffffe002 in ?? ()
(gdb) info stack
#0  0xffffe002 in ?? ()
#1  0x400d2692 in dequeuePacket (notUsed=0x0) at pbuf.c:2407
#2  0x41819484 in start_thread () from /lib/tls/libpthread.so.0
(gdb) thread 4
[Switching to thread 4 (Thread 1160776496 (LWP 30637))]#0  0xffffe002 in ?? ()
(gdb) info stack
#0  0xffffe002 in ?? ()
#1  0x420daf2f in syslog () from /lib/tls/libc.so.6
#2  0x400e87d9 in traceEvent (eventTraceLevel=2, file=0x400fa00e "util.c", line=71,
    format=0x400fa820 "Error: bad magic number [expected=%d/real=%d][deviceId=%d]") at util.c:2665
#3  0x400e4a21 in _getFirstHost (actualDeviceId=0, beginIdx=1) at util.c:71
#4  0x400ba79f in updateDeviceHostNameInfo (addr=
              {hostFamily = 2, addr = {_hostIp4Address = {s_addr = 2228392643}, _hostIp6Address = {in6_u = {u6_addr8 = "?\222?\204", '\0' <repeats 11 times>, u6_addr16 = {37571, 34002, 0, 0, 0, 0, 0, 0}, u6_addr32 = {2228392643, 0, 0, 0}}}}}, symbolic=0x452fe148 "132.210.146.195", actualDeviceId=0, type=19)
    at address.c:54
#5  0x400ba832 in updateHostNameInfo (addr=
              {hostFamily = 2, addr = {_hostIp4Address = {s_addr = 2228392643}, _hostIp6Address = {in6_u = {u6_addr8 = "?\222?\204", '\0' <repeats 11 times>, u6_addr16 = {37571, 34002, 0, 0, 0, 0, 0, 0}, u6_addr32 = {2228392643, 0, 0, 0}}}}}, symbolic=0x452fe148 "132.210.146.195", type=19) at address.c:81
#6  0x400bb858 in ipaddr2str (hostIpAddress=
              {hostFamily = 2, addr = {_hostIp4Address = {s_addr = 2228392643}, _hostIp6Address = {in6_u = {u6_addr8 = "?\222?\204", '\0' <repeats 11 times>, u6_addr16 = {37571, 34002, 0, 0, 0, 0, 0, 0}, u6_addr32 = {2228392643, 0, 0, 0}}}}}, updateHost=1) at address.c:871
#7  0x400c296c in lookupHost (hostIpAddress=0x452fe4f8, ether_addr=0x452fe8c8 "", vlanId=-1, checkForMultihoming=1 '\001', forceUsingIPaddress=0 '\0',
    actualDeviceId=0) at hash.c:1101
#8  0x400ce65f in processIpPkt (bp=0x452fe926 "E", h=0x453009a8, length=522, ether_src=0x452fe8ce "", ether_dst=0x452fe8c8 "", actualDeviceId=0, vlanId=-1)
    at pbuf.c:1088
#9  0x400d59e0 in processPacket (_deviceId=0x0, h=0x453009a8, p=0x452fe918 "") at pbuf.c:3528
#10 0x400d2356 in queuePacket (_deviceId=0x0, h=0x453009a8, p=0x80c7c72 "") at pbuf.c:2317
#11 0x419a15aa in pcap_read_packet (handle=0x80c7ad0, callback=0x400d21a8 <queuePacket>, userdata=0x42f8f110 "") at pcap-linux.c:638
#12 0x419a2b9a in pcap_loop (p=0x80c7ad0, cnt=-1, callback=0x400d21a8 <queuePacket>, user=0x0) at pcap.c:81
#13 0x400c9406 in pcapDispatch (_i=0x0) at ntop.c:97
#14 0x41819484 in start_thread () from /lib/tls/libpthread.so.0
(gdb)



Back to what's wrong...

* There are 10K hosts - so if it's something really, really generic, we would see it much more often and much sooner.

     How much memory on the host?
      Are you running out?
      Please scan back in the log for allocation failure messages.

The host has 4 gigs of memory (I wrote 1.5 in the original message but that was from memory - I just checked and it has 4 gigs of real memory). I haven’t seen it run out recently and no messages to that effect appear in the logs.

* When you started to get the bad magic number messages, there were 11 instances.

    What changed around 1:44?  It looks like there are 18 hosts that were bad and yet most of them were idle at 1:49.  

    Does 11 ring any bells?  You know, back to Mr. Rogers - some of these things are not like the others?  Which of these things do not belong??

I don’t see anything that could have happened at 1:44 that seems obviously wrong. As you know 1:44 is in the middle of the night (OK maybe the start of the night depending on who you ask!) but I was asleep at home and dreaming of other things than ntop... :-)

We are a university and we have students who have Internet access in the residence halls so there is always some traffic even in the middle of the night, but there is a lot more (3 to 4 times more) during the day.

* Another question is WHICH bad magic number message this is?  Can you run with a higher trace level so that we get the file/line info??  (Again, I'm going to be commiting a patch to improve this data).  

   It looks like the one from hash.c around 802 in lookupHost(), so it could be called from just about anywhere...

I think what you ask for is included in the gdb trace above.

 
Next Steps:  Grab the cvs (wait for the 621 and 622 commit messages) and rerun the test - when it goes bad, please do the gdb stuff above, and also grab the updated log messages.  This should give us much more info.

I will do that next and report here the next time it happens.

Again, thank you for your support!

Regards,

Marc.
--
Marc Mazuhelli
Computer security analyst
Service des technologies de l'information
Universite de Sherbrooke
Sherbrooke, Quebec, Canada


From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Marc Mazuhelli
Sent: Tuesday, August 09, 2005 9:25 AM
To: [email protected]
Subject: [Ntop] A lot of "**WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]" messages...

 <snip />
_______________________________________________
Ntop-dev mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-dev

Reply via email to