I've seen something similar (I think) with Zyre, where dealer sockets
connecting with the same identity do weird things. Try setting
ZMQ_ROUTER_HANDOVER on the router socket, see if that helps (you'll
need libzmq master).

On Thu, Jun 12, 2014 at 4:15 AM, Sash Nagarkar <s...@dronedeploy.com> wrote:
> Hello ZMQ devs,
> We're using PyZMQ 14.3.0 and libzmq 4.0.4 with a ROUTER-DEALER pattern
> for a service we're providing.  Sorry if this is too verbose, and I
> hope this is the right place to ask the question.
> TL;DR: ROUTER socket doesn't receive messages from a DEALER even
> though netstat shows several megabytes in the TCP receive queue
> (nothing in the send queue).  Other connected DEALERs work fine.
> The ROUTER socket is running on a server with ample CPU & memory
> headroom, with several DEALER clients that connect, exchange messages,
> and can abruptly disconnect repeatedly.  We're exclusively using
> multipart messages with the first part always being the ZMQ socket
> identity, which persists across DEALER connect/disconnects.  In other
> words, each DEALER client uses the same socket identity across many
> connects and disconnects.
> Most of the time, things hum along smoothly (several thousand messages
> exchanged, several dozen connect/disconnects).  However, every once in
> a rare while, we see that one of the DEALER clients connects and sends
> messages to the ROUTER that end up never making it to the ROUTER
> process.  The ROUTER process continues to receive messages from other
> DEALER clients.
> Further debugging on the ROUTER server shows one (or more) TCP
> connections from the client DEALER that are in the CLOSE_WAIT state
> with several megabytes of data sitting in the receive queue to the
> ROUTER.  We also see one connection from the client DEALER in the
> ESTABLISHED state with a receive queue that is growing.
> It's clear that the DEALER client died abruptly once, but then
> returned with the same identity and resumed sending messages to the
> ROUTER.  However, none of the subsequent messages are delivered to the
> ROUTER process.  Any ideas on why this would be the case?
> I would have provided a test case, but we aren't able to consistently
> reproduce the issue.  I've copied the output from netstat (with
> obfuscated IPs) below, in case it helps.
> Questions:
> - What would cause the receive queue to fill up like this on a ROUTER
> while it continues to receive messages from other clients?  It's clear
> that the messages are all making it to the ROUTER machine.
> - Is it safe for DEALER sockets to abruptly disconnect and then reuse
> their socket identity?
> - How can we mitigate this situation?  The closest thing I see is
> ZMQ_LINGER, but that applies only to the outgoing queue and not the
> incoming one.
> - Is there anything I could investigate myself to figure out whether
> this is an issue in PyZMQ vs. libzmq?  Where should I start?
> Other potentially relevant info:
> - The ROUTER uses PyZMQ's zmq.Poller() to receive messages from the
> problem socket and some others.  All other nodes in the system
> continue to send and receive messages just fine.
> - The ROUTER's send queues are pretty much empty.
> - We see the same behavior with libzmq 4.0.4 and libzmq 2.2.x, on Ubuntu 
> 14.04.
> $ netstat -a
> Active Internet connections (servers and established)
> Proto Recv-Q Send-Q Local Address           Foreign Address         State
> tcp        0      0 *:12501                  *:*                     LISTEN
> tcp   1816956      0 server-ip.:12501 clientA-ip:42571 CLOSE_WAIT
> tcp   1551036      0 server-ip.:12501 clientA-ip:42858 CLOSE_WAIT
> tcp        0      0 server-ip.:12501 clientB-ip:34000 ESTABLISHED
> tcp   5265541      0 server-ip.:12501 clientA-ip:43469 ESTABLISHED
> Please let me if further information would help.  Thank you for
> helping build ZMQ, it's been a huge pleasure to work with so far.
> Cheers,
> Sash
