Re: [zeromq-dev] ROUTER tcp socket stuck in CLOSE_WAIT with large receive queue

2014-06-12 Thread Sash Nagarkar
Thanks Pieter!  I'll try that and see if we encounter it again.  Love
the work you guys are doing with ZMQ.

On Thu, Jun 12, 2014 at 7:53 AM, Pieter Hintjens  wrote:
> I've seen something similar (I think) with Zyre, where dealer sockets
> connecting with the same identity do weird things. Try setting
> ZMQ_ROUTER_HANDOVER on the router socket, see if that helps (you'll
> need libzmq master).
>
> On Thu, Jun 12, 2014 at 4:15 AM, Sash Nagarkar  wrote:
>> Hello ZMQ devs,
>>
>> We're using PyZMQ 14.3.0 and libzmq 4.0.4 with a ROUTER-DEALER pattern
>> for a service we're providing.  Sorry if this is too verbose, and I
>> hope this is the right place to ask the question.
>>
>> TL;DR: ROUTER socket doesn't receive messages from a DEALER even
>> though netstat shows several megabytes in the TCP receive queue
>> (nothing in the send queue).  Other connected DEALERs work fine.
>>
>> The ROUTER socket is running on a server with ample CPU & memory
>> headroom, with several DEALER clients that connect, exchange messages,
>> and can abruptly disconnect repeatedly.  We're exclusively using
>> multipart messages with the first part always being the ZMQ socket
>> identity, which persists across DEALER connect/disconnects.  In other
>> words, each DEALER client uses the same socket identity across many
>> connects and disconnects.
>>
>> Most of the time, things hum along smoothly (several thousand messages
>> exchanged, several dozen connect/disconnects).  However, every once in
>> a rare while, we see that one of the DEALER clients connects and sends
>> messages to the ROUTER that end up never making it to the ROUTER
>> process.  The ROUTER process continues to receive messages from other
>> DEALER clients.
>>
>> Further debugging on the ROUTER server shows one (or more) TCP
>> connections from the client DEALER that are in the CLOSE_WAIT state
>> with several megabytes of data sitting in the receive queue to the
>> ROUTER.  We also see one connection from the client DEALER in the
>> ESTABLISHED state with a receive queue that is growing.
>>
>> It's clear that the DEALER client died abruptly once, but then
>> returned with the same identity and resumed sending messages to the
>> ROUTER.  However, none of the subsequent messages are delivered to the
>> ROUTER process.  Any ideas on why this would be the case?
>>
>> I would have provided a test case, but we aren't able to consistently
>> reproduce the issue.  I've copied the output from netstat (with
>> obfuscated IPs) below, in case it helps.
>>
>>
>> Questions:
>> - What would cause the receive queue to fill up like this on a ROUTER
>> while it continues to receive messages from other clients?  It's clear
>> that the messages are all making it to the ROUTER machine.
>> - Is it safe for DEALER sockets to abruptly disconnect and then reuse
>> their socket identity?
>> - How can we mitigate this situation?  The closest thing I see is
>> ZMQ_LINGER, but that applies only to the outgoing queue and not the
>> incoming one.
>> - Is there anything I could investigate myself to figure out whether
>> this is an issue in PyZMQ vs. libzmq?  Where should I start?
>>
>>
>> Other potentially relevant info:
>> - The ROUTER uses PyZMQ's zmq.Poller() to receive messages from the
>> problem socket and some others.  All other nodes in the system
>> continue to send and receive messages just fine.
>> - The ROUTER's send queues are pretty much empty.
>> - We see the same behavior with libzmq 4.0.4 and libzmq 2.2.x, on Ubuntu 
>> 14.04.
>>
>>
>> $ netstat -a
>> Active Internet connections (servers and established)
>> Proto Recv-Q Send-Q Local Address   Foreign Address State
>> tcp0  0 *:12501  *:* LISTEN
>> tcp   1816956  0 server-ip.:12501 clientA-ip:42571 CLOSE_WAIT
>> tcp   1551036  0 server-ip.:12501 clientA-ip:42858 CLOSE_WAIT
>> tcp0  0 server-ip.:12501 clientB-ip:34000 ESTABLISHED
>> tcp   5265541  0 server-ip.:12501 clientA-ip:43469 ESTABLISHED
>>
>>
>> Please let me if further information would help.  Thank you for
>> helping build ZMQ, it's been a huge pleasure to work with so far.
>>
>> Cheers,
>> Sash
>> ___
>> zeromq-dev mailing list
>> zeromq-dev@lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> ___
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev


[zeromq-dev] ROUTER tcp socket stuck in CLOSE_WAIT with large receive queue

2014-06-11 Thread Sash Nagarkar
Hello ZMQ devs,

We're using PyZMQ 14.3.0 and libzmq 4.0.4 with a ROUTER-DEALER pattern
for a service we're providing.  Sorry if this is too verbose, and I
hope this is the right place to ask the question.

TL;DR: ROUTER socket doesn't receive messages from a DEALER even
though netstat shows several megabytes in the TCP receive queue
(nothing in the send queue).  Other connected DEALERs work fine.

The ROUTER socket is running on a server with ample CPU & memory
headroom, with several DEALER clients that connect, exchange messages,
and can abruptly disconnect repeatedly.  We're exclusively using
multipart messages with the first part always being the ZMQ socket
identity, which persists across DEALER connect/disconnects.  In other
words, each DEALER client uses the same socket identity across many
connects and disconnects.

Most of the time, things hum along smoothly (several thousand messages
exchanged, several dozen connect/disconnects).  However, every once in
a rare while, we see that one of the DEALER clients connects and sends
messages to the ROUTER that end up never making it to the ROUTER
process.  The ROUTER process continues to receive messages from other
DEALER clients.

Further debugging on the ROUTER server shows one (or more) TCP
connections from the client DEALER that are in the CLOSE_WAIT state
with several megabytes of data sitting in the receive queue to the
ROUTER.  We also see one connection from the client DEALER in the
ESTABLISHED state with a receive queue that is growing.

It's clear that the DEALER client died abruptly once, but then
returned with the same identity and resumed sending messages to the
ROUTER.  However, none of the subsequent messages are delivered to the
ROUTER process.  Any ideas on why this would be the case?

I would have provided a test case, but we aren't able to consistently
reproduce the issue.  I've copied the output from netstat (with
obfuscated IPs) below, in case it helps.


Questions:
- What would cause the receive queue to fill up like this on a ROUTER
while it continues to receive messages from other clients?  It's clear
that the messages are all making it to the ROUTER machine.
- Is it safe for DEALER sockets to abruptly disconnect and then reuse
their socket identity?
- How can we mitigate this situation?  The closest thing I see is
ZMQ_LINGER, but that applies only to the outgoing queue and not the
incoming one.
- Is there anything I could investigate myself to figure out whether
this is an issue in PyZMQ vs. libzmq?  Where should I start?


Other potentially relevant info:
- The ROUTER uses PyZMQ's zmq.Poller() to receive messages from the
problem socket and some others.  All other nodes in the system
continue to send and receive messages just fine.
- The ROUTER's send queues are pretty much empty.
- We see the same behavior with libzmq 4.0.4 and libzmq 2.2.x, on Ubuntu 14.04.


$ netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address   Foreign Address State
tcp0  0 *:12501  *:* LISTEN
tcp   1816956  0 server-ip.:12501 clientA-ip:42571 CLOSE_WAIT
tcp   1551036  0 server-ip.:12501 clientA-ip:42858 CLOSE_WAIT
tcp0  0 server-ip.:12501 clientB-ip:34000 ESTABLISHED
tcp   5265541  0 server-ip.:12501 clientA-ip:43469 ESTABLISHED


Please let me if further information would help.  Thank you for
helping build ZMQ, it's been a huge pleasure to work with so far.

Cheers,
Sash
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev