Re: [zeromq-dev] ROUTER tcp socket stuck in CLOSE_WAIT with large receive queue

2014-06-12 Thread Sash Nagarkar
Thanks Pieter!  I'll try that and see if we encounter it again.  Love
the work you guys are doing with ZMQ.

On Thu, Jun 12, 2014 at 7:53 AM, Pieter Hintjens p...@imatix.com wrote:
 I've seen something similar (I think) with Zyre, where dealer sockets
 connecting with the same identity do weird things. Try setting
 ZMQ_ROUTER_HANDOVER on the router socket, see if that helps (you'll
 need libzmq master).

 On Thu, Jun 12, 2014 at 4:15 AM, Sash Nagarkar s...@dronedeploy.com wrote:
 Hello ZMQ devs,

 We're using PyZMQ 14.3.0 and libzmq 4.0.4 with a ROUTER-DEALER pattern
 for a service we're providing.  Sorry if this is too verbose, and I
 hope this is the right place to ask the question.

 TL;DR: ROUTER socket doesn't receive messages from a DEALER even
 though netstat shows several megabytes in the TCP receive queue
 (nothing in the send queue).  Other connected DEALERs work fine.

 The ROUTER socket is running on a server with ample CPU  memory
 headroom, with several DEALER clients that connect, exchange messages,
 and can abruptly disconnect repeatedly.  We're exclusively using
 multipart messages with the first part always being the ZMQ socket
 identity, which persists across DEALER connect/disconnects.  In other
 words, each DEALER client uses the same socket identity across many
 connects and disconnects.

 Most of the time, things hum along smoothly (several thousand messages
 exchanged, several dozen connect/disconnects).  However, every once in
 a rare while, we see that one of the DEALER clients connects and sends
 messages to the ROUTER that end up never making it to the ROUTER
 process.  The ROUTER process continues to receive messages from other
 DEALER clients.

 Further debugging on the ROUTER server shows one (or more) TCP
 connections from the client DEALER that are in the CLOSE_WAIT state
 with several megabytes of data sitting in the receive queue to the
 ROUTER.  We also see one connection from the client DEALER in the
 ESTABLISHED state with a receive queue that is growing.

 It's clear that the DEALER client died abruptly once, but then
 returned with the same identity and resumed sending messages to the
 ROUTER.  However, none of the subsequent messages are delivered to the
 ROUTER process.  Any ideas on why this would be the case?

 I would have provided a test case, but we aren't able to consistently
 reproduce the issue.  I've copied the output from netstat (with
 obfuscated IPs) below, in case it helps.


 Questions:
 - What would cause the receive queue to fill up like this on a ROUTER
 while it continues to receive messages from other clients?  It's clear
 that the messages are all making it to the ROUTER machine.
 - Is it safe for DEALER sockets to abruptly disconnect and then reuse
 their socket identity?
 - How can we mitigate this situation?  The closest thing I see is
 ZMQ_LINGER, but that applies only to the outgoing queue and not the
 incoming one.
 - Is there anything I could investigate myself to figure out whether
 this is an issue in PyZMQ vs. libzmq?  Where should I start?


 Other potentially relevant info:
 - The ROUTER uses PyZMQ's zmq.Poller() to receive messages from the
 problem socket and some others.  All other nodes in the system
 continue to send and receive messages just fine.
 - The ROUTER's send queues are pretty much empty.
 - We see the same behavior with libzmq 4.0.4 and libzmq 2.2.x, on Ubuntu 
 14.04.


 $ netstat -a
 Active Internet connections (servers and established)
 Proto Recv-Q Send-Q Local Address   Foreign Address State
 tcp0  0 *:12501  *:* LISTEN
 tcp   1816956  0 server-ip.:12501 clientA-ip:42571 CLOSE_WAIT
 tcp   1551036  0 server-ip.:12501 clientA-ip:42858 CLOSE_WAIT
 tcp0  0 server-ip.:12501 clientB-ip:34000 ESTABLISHED
 tcp   5265541  0 server-ip.:12501 clientA-ip:43469 ESTABLISHED


 Please let me if further information would help.  Thank you for
 helping build ZMQ, it's been a huge pleasure to work with so far.

 Cheers,
 Sash
 ___
 zeromq-dev mailing list
 zeromq-dev@lists.zeromq.org
 http://lists.zeromq.org/mailman/listinfo/zeromq-dev
 ___
 zeromq-dev mailing list
 zeromq-dev@lists.zeromq.org
 http://lists.zeromq.org/mailman/listinfo/zeromq-dev
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev


[zeromq-dev] ROUTER tcp socket stuck in CLOSE_WAIT with large receive queue

2014-06-11 Thread Sash Nagarkar
Hello ZMQ devs,

We're using PyZMQ 14.3.0 and libzmq 4.0.4 with a ROUTER-DEALER pattern
for a service we're providing.  Sorry if this is too verbose, and I
hope this is the right place to ask the question.

TL;DR: ROUTER socket doesn't receive messages from a DEALER even
though netstat shows several megabytes in the TCP receive queue
(nothing in the send queue).  Other connected DEALERs work fine.

The ROUTER socket is running on a server with ample CPU  memory
headroom, with several DEALER clients that connect, exchange messages,
and can abruptly disconnect repeatedly.  We're exclusively using
multipart messages with the first part always being the ZMQ socket
identity, which persists across DEALER connect/disconnects.  In other
words, each DEALER client uses the same socket identity across many
connects and disconnects.

Most of the time, things hum along smoothly (several thousand messages
exchanged, several dozen connect/disconnects).  However, every once in
a rare while, we see that one of the DEALER clients connects and sends
messages to the ROUTER that end up never making it to the ROUTER
process.  The ROUTER process continues to receive messages from other
DEALER clients.

Further debugging on the ROUTER server shows one (or more) TCP
connections from the client DEALER that are in the CLOSE_WAIT state
with several megabytes of data sitting in the receive queue to the
ROUTER.  We also see one connection from the client DEALER in the
ESTABLISHED state with a receive queue that is growing.

It's clear that the DEALER client died abruptly once, but then
returned with the same identity and resumed sending messages to the
ROUTER.  However, none of the subsequent messages are delivered to the
ROUTER process.  Any ideas on why this would be the case?

I would have provided a test case, but we aren't able to consistently
reproduce the issue.  I've copied the output from netstat (with
obfuscated IPs) below, in case it helps.


Questions:
- What would cause the receive queue to fill up like this on a ROUTER
while it continues to receive messages from other clients?  It's clear
that the messages are all making it to the ROUTER machine.
- Is it safe for DEALER sockets to abruptly disconnect and then reuse
their socket identity?
- How can we mitigate this situation?  The closest thing I see is
ZMQ_LINGER, but that applies only to the outgoing queue and not the
incoming one.
- Is there anything I could investigate myself to figure out whether
this is an issue in PyZMQ vs. libzmq?  Where should I start?


Other potentially relevant info:
- The ROUTER uses PyZMQ's zmq.Poller() to receive messages from the
problem socket and some others.  All other nodes in the system
continue to send and receive messages just fine.
- The ROUTER's send queues are pretty much empty.
- We see the same behavior with libzmq 4.0.4 and libzmq 2.2.x, on Ubuntu 14.04.


$ netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address   Foreign Address State
tcp0  0 *:12501  *:* LISTEN
tcp   1816956  0 server-ip.:12501 clientA-ip:42571 CLOSE_WAIT
tcp   1551036  0 server-ip.:12501 clientA-ip:42858 CLOSE_WAIT
tcp0  0 server-ip.:12501 clientB-ip:34000 ESTABLISHED
tcp   5265541  0 server-ip.:12501 clientA-ip:43469 ESTABLISHED


Please let me if further information would help.  Thank you for
helping build ZMQ, it's been a huge pleasure to work with so far.

Cheers,
Sash
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev