On Mon, 2017-05-08 at 11:08 +1000, Tomas Krajca wrote:
> Hi all,
> 
> I have come across a weird/bad bug, I believe.
> 
> I run libzmq 4.1.6 and pyzmq 16.0.2. This happens on both Centos 6
> and 
> Centos 7.
> 
> The application is a celery worker that runs 16 worker threads. Each 
> worker thread instantiates a 0MQ-based client, gets data and then
> closes 
> this client. The 0MQ-based client creates its own 0MQ context and 
> terminates it on exit. Nothing is shared between the threads or
> clients, 
> every client processes only one request and then it's fully
> terminated.
> 
> The client itself is a REQ socket which uses CURVE authentication to 
> authenticate with a ROUTER socket on the server side. The REQ socket
> has 
> linger=0. Almost always, the REQ socket issues request, gets back 
> response, closes the socket, destroys its context, all is good. Once 
> every one or two days though, the REQ socket times out when waiting
> for 
> the response from the ROUTER server, it then successfully closes the 
> socket but indefinitely hangs when it goes on to destroy the context.

Note that these are two well-known anti-patterns. The context is
intended to be shared and be unique in an application, and live for as
long as the process does, and the sockets are meant to be long lived as
well.

I would recommend refactoring and, at the very least, use a single
context for the duration of your application.

> This runs in a data center on 1Gb/s LAN so the responses usually
> finish 
> in under a second, the timeout is 20s. My theory is that the socket
> gets 
> into a weird state and that's why it times out and blocks the
> context 
> termination.
> 
> I ran a tcpdump and it turns out that the REQ client successfully 
> authenticates with the ROUTER server but then it goes completely
> silent 
> for those 20 odd seconds.
> 
> Here is a tcpdump capture of a stuck REQ client - 
> https://pastebin.com/HxWAp6SQ. Here is a tcpdump capture of a normal 
> communication - https://pastebin.com/qCi1jTp0. This is a full
> backtrace 
> (after SIGABRT signal to the stuck application) - 
> https://pastebin.com/jHdZS4VU
> 
> Here is ulimit:
> 
> [root@auhwbesap001 tomask]# cat /proc/311/limits
> Limit                     Soft Limit           Hard Limit 
> Units
> Max cpu time              unlimited            unlimited 
> seconds
> Max file size             unlimited            unlimited 
> bytes
> Max data size             unlimited            unlimited 
> bytes
> Max stack size            8388608              unlimited 
> bytes
> Max core file size        0                    unlimited 
> bytes
> Max resident set          unlimited            unlimited 
> bytes
> Max processes             31141                31141 
> processes
> Max open files            8196                 8196 
> files
> Max locked memory         65536                65536 
> bytes
> Max address space         unlimited            unlimited 
> bytes
> Max file locks            unlimited            unlimited 
> locks
> Max pending signals       31141                31141 
> signals
> Max msgqueue size         819200               819200 
> bytes
> Max nice priority         0                    0
> Max realtime priority     0                    0
> Max realtime
> timeout      unlimited            unlimited            us 
> 
> 
> The application doesn't seem to get over any of the limits, it
> usually 
> hovers between 100 and 200 open file handlers.
> 
> I tried to swap the REQ socket for a DEALER socket but that didn't
> help, 
> the context eventually hung as well.
> 
> I also tried to set ZMQ_BLOCKY to 0 and/or ZMQ_HANDSHAKE_IVL to
> 100ms 
> but the context still eventually hung.
> 
> I looked into the C++ code of libzmq but would need some guidance to 
> troubleshoot this as I am primarily a python programmer.
> 
> I think we had a similar issue back in 2014 - 
> https://lists.zeromq.org/pipermail/zeromq-dev/2014-September/026752.h
> tml. From 
> memory, the tcpdump capture also showed the client/REQ going silent 
> after the successful initial CURVE authentication but at that time
> the 
> server/ROUTER application was crashing with an assertion.
> 
> I am happy to do any more debugging.
> 
> Thanks in advance for any help/pointers.

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to