Hi all,

I got across and issue on the shutdown sequence of Kannel smsbox, that seems to me like a potential dead-lock situation while shutdown phase.

On a loaded system bearerbox was SIGHUP'ed and hence instructed it's connected smsbox to go down too.

Bearerbox didn't shutdown cleanly, so forced a 'kill -9' to get it down. Through the smsbox still maintained running, and I looked into the gdb backtrace of the process a bit more.

What I see is this: (BTW, the line numbers don't match with the svn trunk).

#1 0x000000000044596b in gwthread_join_every (func=0x41ba40 <obey_request_thread>) at gwlib/gwthread-pthread.c:744 #2 0x00000000004142c8 in main (argc=<value optimized out>, argv=0x7fff05d24428) at gw/smsbox.c:3872

so main() was blocking in the gwthread_join_every for the obey_request_thread()s.

They itself blocked in:

#0  0x00007f809e117bd1 in sem_wait () from /lib/libpthread.so.0
#1 0x000000000041bdcb in obey_request_thread (arg=<value optimized out>) at gw/smsbox.c:1346

in the semaphore_down(max_pending_requests); all before a http_start_request().

Since we know that the semaphore_up() is performed in the url_result_thread() when we got the response via http_receive_result_real(), but that itself blocked in:

#0 0x00007f809e115d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #1 0x000000000044e098 in gwlist_consume (list=0x1498e50) at gwlib/list.c:478 #2 0x000000000044840c in http_receive_result_real (caller=0x1498e84, status=0x44485054, final_url=0x44485018, headers=0x44484ff8, body=0x44484fc8, blocking=1577) at gwlib/http.c:1764 #3 0x000000000041a98e in url_result_thread (arg=<value optimized out>) at gw/smsbox.c:1105

so in the gwlist_consume() on the HTTPCaller *caller.

Now, checking the the shutdown sequence in main() we see that we do:

...
    gwthread_join_every(obey_request_thread);
    http_caller_signal_shutdown(caller);
    gwthread_join_every(url_result_thread);
...

so we remove the producer on HTTPCaller *caller AFTER we join the obey_request_thread()s, which are performing the semaphore_down.

This ends up in a dead-lock situation IMO.

Resolution should be simply to move the http_caller_signal_shutdown() before gwthread_join_every(obey_request_thread) in the shutdown sequence.

Any comments, reviews are highly welcome.

Stay safe all,
Stipe


--
Best Regards,
Stipe Tolj

-------------------------------------------------------------------
Düsseldorf, NRW, Germany

Kannel Foundation                 tolj.org system architecture
http://www.kannel.org/            http://www.tolj.org/

st...@kannel.org                  s...@tolj.org
-------------------------------------------------------------------

Reply via email to