Hi,

agree that 240 sec timeout is too high. I think max 30 sec , should be enough.

Alex
Am 5. Nov. 2020, 11:58 +0100 schrieb Stipe Tolj <st...@kannel.org>:
> Am 05.11.20, 10:41, schrieb Stipe Tolj:
> > Hi all,
> >
> > I got across and issue on the shutdown sequence of Kannel smsbox, that
> > seems to me like a potential dead-lock situation while shutdown phase.
> >
> > On a loaded system bearerbox was SIGHUP'ed and hence instructed it's
> > connected smsbox to go down too.
> >
> > Bearerbox didn't shutdown cleanly, so forced a 'kill -9' to get it down.
> > Through the smsbox still maintained running, and I looked into the gdb
> > backtrace of the process a bit more.
> >
> > What I see is this: (BTW, the line numbers don't match with the svn trunk).
> >
> > #1 0x000000000044596b in gwthread_join_every (func=0x41ba40
> > <obey_request_thread>) at gwlib/gwthread-pthread.c:744
> > #2 0x00000000004142c8 in main (argc=<value optimized out>,
> > argv=0x7fff05d24428) at gw/smsbox.c:3872
> >
> > so main() was blocking in the gwthread_join_every for the
> > obey_request_thread()s.
> >
> > They itself blocked in:
> >
> > #0 0x00007f809e117bd1 in sem_wait () from /lib/libpthread.so.0
> > #1 0x000000000041bdcb in obey_request_thread (arg=<value optimized out>)
> > at gw/smsbox.c:1346
> >
> > in the semaphore_down(max_pending_requests); all before a
> > http_start_request().
> >
> > Since we know that the semaphore_up() is performed in the
> > url_result_thread() when we got the response via
> > http_receive_result_real(), but that itself blocked in:
> >
> > #0 0x00007f809e115d29 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > /lib/libpthread.so.0
> > #1 0x000000000044e098 in gwlist_consume (list=0x1498e50) at
> > gwlib/list.c:478
> > #2 0x000000000044840c in http_receive_result_real (caller=0x1498e84,
> > status=0x44485054, final_url=0x44485018, headers=0x44484ff8,
> > body=0x44484fc8, blocking=1577) at gwlib/http.c:1764
> > #3 0x000000000041a98e in url_result_thread (arg=<value optimized out>)
> > at gw/smsbox.c:1105
> >
> > so in the gwlist_consume() on the HTTPCaller *caller.
> >
> > Now, checking the the shutdown sequence in main() we see that we do:
> >
> > ...
> > gwthread_join_every(obey_request_thread);
> > http_caller_signal_shutdown(caller);
> > gwthread_join_every(url_result_thread);
> > ...
> >
> > so we remove the producer on HTTPCaller *caller AFTER we join the
> > obey_request_thread()s, which are performing the semaphore_down.
> >
> > This ends up in a dead-lock situation IMO.
> >
> > Resolution should be simply to move the http_caller_signal_shutdown()
> > before gwthread_join_every(obey_request_thread) in the shutdown sequence.
> >
> > Any comments, reviews are highly welcome.
>
> ok, this is NOT blocking fully. It does block for any HTTP requests that
> are performed against "bogus IP ranges", i.e. unrouted C-class 10.x.x.x
> ranges, and blocks while we have out client timeout running, which is
> 240 seconds by default.
>
> If we set
>
> group = smsbox
> ...
> http-timeout = 10
>
> then we get it unblocked and shutdown cleanly.
>
> So, forget about the dead-lock claim I made. The only thing that we MAY
> want here is to have a more realistic TCP connection timeout?
>
> Stipe
>
> --
> Best Regards,
> Stipe Tolj
>
> -------------------------------------------------------------------
> Düsseldorf, NRW, Germany
>
> Kannel Foundation tolj.org system architecture
> http://www.kannel.org/ http://www.tolj.org/
>
> st...@kannel.org s...@tolj.org
> -------------------------------------------------------------------
>

Reply via email to