Hi all,
I got across and issue on the shutdown sequence of Kannel smsbox, that
seems to me like a potential dead-lock situation while shutdown phase.
On a loaded system bearerbox was SIGHUP'ed and hence instructed it's
connected smsbox to go down too.
Bearerbox didn't shutdown cleanly, so forced a 'kill -9' to get it down.
Through the smsbox still maintained running, and I looked into the gdb
backtrace of the process a bit more.
What I see is this: (BTW, the line numbers don't match with the svn trunk).
#1 0x000000000044596b in gwthread_join_every (func=0x41ba40
<obey_request_thread>) at gwlib/gwthread-pthread.c:744
#2 0x00000000004142c8 in main (argc=<value optimized out>,
argv=0x7fff05d24428) at gw/smsbox.c:3872
so main() was blocking in the gwthread_join_every for the
obey_request_thread()s.
They itself blocked in:
#0 0x00007f809e117bd1 in sem_wait () from /lib/libpthread.so.0
#1 0x000000000041bdcb in obey_request_thread (arg=<value optimized
out>) at gw/smsbox.c:1346
in the semaphore_down(max_pending_requests); all before a
http_start_request().
Since we know that the semaphore_up() is performed in the
url_result_thread() when we got the response via
http_receive_result_real(), but that itself blocked in:
#0 0x00007f809e115d29 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/libpthread.so.0
#1 0x000000000044e098 in gwlist_consume (list=0x1498e50) at
gwlib/list.c:478
#2 0x000000000044840c in http_receive_result_real (caller=0x1498e84,
status=0x44485054, final_url=0x44485018, headers=0x44484ff8,
body=0x44484fc8, blocking=1577) at gwlib/http.c:1764
#3 0x000000000041a98e in url_result_thread (arg=<value optimized out>)
at gw/smsbox.c:1105
so in the gwlist_consume() on the HTTPCaller *caller.
Now, checking the the shutdown sequence in main() we see that we do:
...
gwthread_join_every(obey_request_thread);
http_caller_signal_shutdown(caller);
gwthread_join_every(url_result_thread);
...
so we remove the producer on HTTPCaller *caller AFTER we join the
obey_request_thread()s, which are performing the semaphore_down.
This ends up in a dead-lock situation IMO.
Resolution should be simply to move the http_caller_signal_shutdown()
before gwthread_join_every(obey_request_thread) in the shutdown sequence.
Any comments, reviews are highly welcome.
Stay safe all,
Stipe
--
Best Regards,
Stipe Tolj
-------------------------------------------------------------------
Düsseldorf, NRW, Germany
Kannel Foundation tolj.org system architecture
http://www.kannel.org/ http://www.tolj.org/
st...@kannel.org s...@tolj.org
-------------------------------------------------------------------