Vlad,
Try increasing your per-thread stack size and see if the problem goes
away. If a thread uses more of its stack than it is supposed to, it
corrupts other thread stacks which usually leads to the server
getting stuck. The parameter to do this is below:
ns_param stacksize [expr 128*1024] ;# Per-thread stack size.
I don't think there is a reliable way to select a perfect stack size,
but I suggest doubling it to begin with, and if the problem happens
less frequently as a result, then you're on the right track.
Should this not solve the problem, tell me if you're using nsopenssl
and if you're seeing your CPU(s) maxed out when the server hangs.
/s.
Scott Goodwin
e: [EMAIL PROTECTED]
k: 0x8CCA5533
On Jul 3, 2007, at 7:17 AM, Vlad Hociota wrote:
Hello folks.
I’m digging into this issue and thought maybe someone might
remember anything from those days …
The piece of software is a server based on AOL-3.4 code, but with
some proprietary modules added on top.
The issue cannot be reproduced in a repetitive way -- but
sometimes we find that one or another instance of the server
(different installations, many machines) is not servicing requests
anymore, even if the task load is very low. Upon inspection of a
“locked” server (dumping core via attached gdb) we found, that the
conn threads were waiting to join one another (like in a queue) -
I’m talking about the last sequence of code in NsConnThread where
each thread that exits joins the one that exited before it. The
stack in those conn threads looks like this:
Thread 3 (Thread 1075616096 (LWP 9692)):
#0 0x0000003e7d508a7a in pthread_cond_wait@@GLIBC_2.3.2 () from /
lib64/tls/libpthread.so.0
#1 0x00000000004dbb94 in Ns_CondWait (condPtr=0x61b548,
mutexPtr=0x61b540) at pthread.c:577
#2 0x00000000004d9098 in Ns_ThreadJoin (threadPtr=0x401c90c0,
argPtr=0x0) at thread.c:186
#3 0x000000000043f126 in JoinConnThread (threadPtr=0x401c90c0) at
serv.c:1000
#4 0x000000000043ebba in NsConnThread (ignored=0x0) at serv.c:738
#5 0x00000000004d912d in NsThreadMain (arg=0x803af00) at thread.c:225
#6 0x0000003e7d506137 in start_thread () from /lib64/tls/
libpthread.so.0
#7 0x0000003e7b9c7113 in clone () from /lib64/tls/libc.so.6
You can safely ignore line numbers in files, just watch the
sequence of calls and you get the idea.
The problem is the first thread in line is waiting on the condition
variable, but the thread that it is supposed to join no longer
exists (or so does the core file state). Hence we get the deadlock.
I’m not necessarily implying that this is an issue in the nsd code
(serv.c or anything else), it could be smth else -- but does
anybody seen this kind of behavior before in AOLServer ? Any hint
would be helpful.
Thanks,
Vlad
--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list, simply send an email to
<[EMAIL PROTECTED]> with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the
Subject: field of your email blank.
--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list, simply send an email to <[EMAIL PROTECTED]>
with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject:
field of your email blank.