[RFC] smsbox dead-lock on shutdown

Stipe Tolj Thu, 05 Nov 2020 01:42:25 -0800

Hi all,

I got across and issue on the shutdown sequence of Kannel smsbox, thatseems to me like a potential dead-lock situation while shutdown phase.

On a loaded system bearerbox was SIGHUP'ed and hence instructed it'sconnected smsbox to go down too.

Bearerbox didn't shutdown cleanly, so forced a 'kill -9' to get it down.Through the smsbox still maintained running, and I looked into the gdbbacktrace of the process a bit more.


What I see is this: (BTW, the line numbers don't match with the svn trunk).

#1 0x000000000044596b in gwthread_join_every (func=0x41ba40<obey_request_thread>) at gwlib/gwthread-pthread.c:744#2 0x00000000004142c8 in main (argc=<value optimized out>,argv=0x7fff05d24428) at gw/smsbox.c:3872

so main() was blocking in the gwthread_join_every for theobey_request_thread()s.


They itself blocked in:

#0  0x00007f809e117bd1 in sem_wait () from /lib/libpthread.so.0

#1 0x000000000041bdcb in obey_request_thread (arg=<value optimizedout>) at gw/smsbox.c:1346

in the semaphore_down(max_pending_requests); all before ahttp_start_request().

Since we know that the semaphore_up() is performed in theurl_result_thread() when we got the response viahttp_receive_result_real(), but that itself blocked in:

#0 0x00007f809e115d29 in pthread_cond_wait@@GLIBC_2.3.2 () from/lib/libpthread.so.0#1 0x000000000044e098 in gwlist_consume (list=0x1498e50) atgwlib/list.c:478#2 0x000000000044840c in http_receive_result_real (caller=0x1498e84,status=0x44485054, final_url=0x44485018, headers=0x44484ff8,body=0x44484fc8, blocking=1577) at gwlib/http.c:1764#3 0x000000000041a98e in url_result_thread (arg=<value optimized out>)at gw/smsbox.c:1105


so in the gwlist_consume() on the HTTPCaller *caller.

Now, checking the the shutdown sequence in main() we see that we do:

...
    gwthread_join_every(obey_request_thread);
    http_caller_signal_shutdown(caller);
    gwthread_join_every(url_result_thread);
...

so we remove the producer on HTTPCaller *caller AFTER we join theobey_request_thread()s, which are performing the semaphore_down.


This ends up in a dead-lock situation IMO.

Resolution should be simply to move the http_caller_signal_shutdown()before gwthread_join_every(obey_request_thread) in the shutdown sequence.


Any comments, reviews are highly welcome.

Stay safe all,
Stipe


--
Best Regards,
Stipe Tolj

-------------------------------------------------------------------
Düsseldorf, NRW, Germany

Kannel Foundation                 tolj.org system architecture
http://www.kannel.org/            http://www.tolj.org/

st...@kannel.org                  s...@tolj.org
-------------------------------------------------------------------

[RFC] smsbox dead-lock on shutdown

Reply via email to