Hi, agree that 240 sec timeout is too high. I think max 30 sec , should be enough.
Alex Am 5. Nov. 2020, 11:58 +0100 schrieb Stipe Tolj <st...@kannel.org>: > Am 05.11.20, 10:41, schrieb Stipe Tolj: > > Hi all, > > > > I got across and issue on the shutdown sequence of Kannel smsbox, that > > seems to me like a potential dead-lock situation while shutdown phase. > > > > On a loaded system bearerbox was SIGHUP'ed and hence instructed it's > > connected smsbox to go down too. > > > > Bearerbox didn't shutdown cleanly, so forced a 'kill -9' to get it down. > > Through the smsbox still maintained running, and I looked into the gdb > > backtrace of the process a bit more. > > > > What I see is this: (BTW, the line numbers don't match with the svn trunk). > > > > #1 0x000000000044596b in gwthread_join_every (func=0x41ba40 > > <obey_request_thread>) at gwlib/gwthread-pthread.c:744 > > #2 0x00000000004142c8 in main (argc=<value optimized out>, > > argv=0x7fff05d24428) at gw/smsbox.c:3872 > > > > so main() was blocking in the gwthread_join_every for the > > obey_request_thread()s. > > > > They itself blocked in: > > > > #0 0x00007f809e117bd1 in sem_wait () from /lib/libpthread.so.0 > > #1 0x000000000041bdcb in obey_request_thread (arg=<value optimized out>) > > at gw/smsbox.c:1346 > > > > in the semaphore_down(max_pending_requests); all before a > > http_start_request(). > > > > Since we know that the semaphore_up() is performed in the > > url_result_thread() when we got the response via > > http_receive_result_real(), but that itself blocked in: > > > > #0 0x00007f809e115d29 in pthread_cond_wait@@GLIBC_2.3.2 () from > > /lib/libpthread.so.0 > > #1 0x000000000044e098 in gwlist_consume (list=0x1498e50) at > > gwlib/list.c:478 > > #2 0x000000000044840c in http_receive_result_real (caller=0x1498e84, > > status=0x44485054, final_url=0x44485018, headers=0x44484ff8, > > body=0x44484fc8, blocking=1577) at gwlib/http.c:1764 > > #3 0x000000000041a98e in url_result_thread (arg=<value optimized out>) > > at gw/smsbox.c:1105 > > > > so in the gwlist_consume() on the HTTPCaller *caller. > > > > Now, checking the the shutdown sequence in main() we see that we do: > > > > ... > > gwthread_join_every(obey_request_thread); > > http_caller_signal_shutdown(caller); > > gwthread_join_every(url_result_thread); > > ... > > > > so we remove the producer on HTTPCaller *caller AFTER we join the > > obey_request_thread()s, which are performing the semaphore_down. > > > > This ends up in a dead-lock situation IMO. > > > > Resolution should be simply to move the http_caller_signal_shutdown() > > before gwthread_join_every(obey_request_thread) in the shutdown sequence. > > > > Any comments, reviews are highly welcome. > > ok, this is NOT blocking fully. It does block for any HTTP requests that > are performed against "bogus IP ranges", i.e. unrouted C-class 10.x.x.x > ranges, and blocks while we have out client timeout running, which is > 240 seconds by default. > > If we set > > group = smsbox > ... > http-timeout = 10 > > then we get it unblocked and shutdown cleanly. > > So, forget about the dead-lock claim I made. The only thing that we MAY > want here is to have a more realistic TCP connection timeout? > > Stipe > > -- > Best Regards, > Stipe Tolj > > ------------------------------------------------------------------- > Düsseldorf, NRW, Germany > > Kannel Foundation tolj.org system architecture > http://www.kannel.org/ http://www.tolj.org/ > > st...@kannel.org s...@tolj.org > ------------------------------------------------------------------- >