(ref: BMS0052) Looking into Mike Klinkert [[EMAIL PROTECTED]] 's report that ntop was using 100% of the CPU, I found another case... (oh, yeah, Michael, I *have* seen your problem too, and will look at it next...)
This one is subtle, and was introduced in a code change in mid Feb plus one
in early Jan that may never have really been right... but wasn't returning
wrong results, so it's never been noticed.
But it's complex and so rather than a simple patch file, I'm going to dump
the analysis into the mail list too...
The symptom is 100% of available CPU used in the thread "7", which is used
for DNS address resolution.
Interrupting with gdb shows this:
(gdb) thread 7
[Switching to thread 7 (Thread 5126 (LWP 31227))]#0 0x404cc0fc in
get_next_key ()
from /usr/lib/libgdbm.so.2
(gdb) info stack
#0 0x404cc0fc in get_next_key () from /usr/lib/libgdbm.so.2
#1 0x404cc234 in gdbm_firstkey () from /usr/lib/libgdbm.so.2
#2 0x4027aef1 in dequeueAddress (notUsed=0x0) at address.c:436
#3 0x404d7c6f in pthread_start_thread (arg=0x4300bbe0) at manager.c:284
#4 0x404d7d5f in pthread_start_thread_event (arg=0x4300bbe0) at
manager.c:308
(gdb) print myGlobals.addressQueueLen
$1 = 5 (or some non-zero value)
Part of the problem is that the value to look up isn't being properly
stored, but even fixing that, there is still a problem. The "best" way to
trip this is to hit one of those annoying sites with 10s of
popups/popunders... BUT: Anything that causes a DNS lookup can cause this,
IF it queues more than one lookup of the same host.
The lookup code is in address.c:
static void queueAddress(struct in_addr elem) { @ 337ff
void* dequeueAddress(void* notUsed _UNUSED_) { @ 399ff
Now the first thing to note is that the loop we actually use in
dequeueAddress is NOT guaranteed to retrieve every key. In fact, it's
EXACTLY the one warned about in the documentation! Surprised? Quoting from
gdbm manual (man 3 gdbm):
" The next two routines allow for accessing all items in the
database. This access is not key sequential, but it is
guaranteed to visit every key in the database once. (The
order has to do with the hash values.)
key = gdbm_firstkey ( dbf )
nextkey = gdbm_nextkey ( dbf, key )
Dbf is the pointer returned by gdbm_open. Key is the key
data.
The return values are both of type datum. If the dptr
element of the return value is NULL, there is no first key
or next key. Again notice that dptr points to data allo�
cated by malloc(3C) and gdbm will not free it for you.
These functions were intended to visit the database in
read-only algorithms, for instance, to validate the
database or similar operations.
File `visiting' is based on a `hash table'. gdbm_delete
re-arranges the hash table to make sure that any colli�
sions in the table do not leave some item `un-findable'.
The original key order is NOT guaranteed to remain
unchanged in ALL instances. It is possible that some key
will not be visited if a loop like the following is exe�
cuted:
nextkey = gdbm_nextkey ( dbf, key );
if ( some condition ) {
gdbm_delete ( dbf, key );
free ( key.dptr );
}
key = nextkey;
}"
So it is possible for the gdbm_firstkey/gdbm_nextkey to receive a null
value, and yet have a queue length > 0. However, we should pick it up on
the next cycle of the while loop! So that's not the problem...
Putting the obvious traceEvent shows it dequeuing a NULL value... which
wasn't right.
That was an easy fix - figure out what's changed - I had to diff the CVS
back a couple of revs to find this:
348a350
> /*
351a354,356
> */
> key_data.dptr = (void*)&elem.s_addr;
> key_data.dsize = 4;
which changed this:
sprintf(tmpBuf, "%u", elem.s_addr);
key_data.dptr = tmpBuf;
key_data.dsize = strlen(tmpBuf)+1;
to this:
key_data.dptr = (void*)&elem.s_addr;
key_data.dsize = 4;
but that means that the sprintf that sets tmpBuf is also commented out. The
fix is to move the comment down one line.
However, that left the question of visiting all the keys. So, I dumped some
traceEvents in there and behold, it does happen:
04/Apr/2002 11:06:12 Queued address '2205334533' [addr queue=1/max=1]
04/Apr/2002 11:06:12 Dequeued address... [2205334533][key=<>] (#addr=0)
04/Apr/2002 11:06:13 Queued address '2205334533' [addr queue=1/max=1]
04/Apr/2002 11:06:13 Queued address '2205291786' [addr queue=2/max=2]
04/Apr/2002 11:06:13 Resolved address 2205334533
04/Apr/2002 11:06:13 Dequeued address... [2205291786][key=<>] (#addr=1)
04/Apr/2002 11:06:13 Resolved address 2205291786
04/Apr/2002 11:06:13 *BMS* gdbm_nextkey() returned NULL but queue length is
1
04/Apr/2002 11:06:13 Dequeued address... [2205334533][key=<>] (#addr=0)
04/Apr/2002 11:06:13 Resolved address 2205334533
However, the block of code is wrapped in a loop that explicitly handles
this, and the only other way to do this would be to hold the mutex through
the entire resolve process, then go back and delete all the entries, and
then free the mutex, which would end up hanging ntop.
So, we pretty much have to do what is being done and live with it!
s'be'it...
Of course, it's never THAT simple, is it? With that problem stripped away,
the real one popped up. Basically, on every store into the gdbm database,
the queue length was being incremented. Regardless of whether it was a new
value or replaced an existing one. So, on that page that retrieved multiple
images (for example) from the same host, ntop ends up with a queue length
bigger than the actual # of database items and never gets out of the loop.
Other than burning CPU, this really doesn't hurt anything (surprise!)
because - if something IS stored into the database after we're looping it
will be properly retrieved and resolved (and then the empty looping
continues). But, the mutex is freed in this loop and there are enough
system calls that anything else that needs the CPU gets it...
However, this should all be fixed so that we don't get dinged for the CPU.
I took the opportunity to eliminate the fetch/store cycle and clean up
indentation in the routine, but the patch attached fixes all of this...
Comments welcomed!
-----Burton
BMS0052-addressres-100pct-cpu.patch
Description: Binary data
