Re: [zeromq-dev] ROUTER tcp socket stuck in CLOSE_WAIT with large receive queue
Thanks Pieter! I'll try that and see if we encounter it again. Love the work you guys are doing with ZMQ. On Thu, Jun 12, 2014 at 7:53 AM, Pieter Hintjens wrote: > I've seen something similar (I think) with Zyre, where dealer sockets > connecting with the same identity do weird things. Try setting > ZMQ_ROUTER_HANDOVER on the router socket, see if that helps (you'll > need libzmq master). > > On Thu, Jun 12, 2014 at 4:15 AM, Sash Nagarkar wrote: >> Hello ZMQ devs, >> >> We're using PyZMQ 14.3.0 and libzmq 4.0.4 with a ROUTER-DEALER pattern >> for a service we're providing. Sorry if this is too verbose, and I >> hope this is the right place to ask the question. >> >> TL;DR: ROUTER socket doesn't receive messages from a DEALER even >> though netstat shows several megabytes in the TCP receive queue >> (nothing in the send queue). Other connected DEALERs work fine. >> >> The ROUTER socket is running on a server with ample CPU & memory >> headroom, with several DEALER clients that connect, exchange messages, >> and can abruptly disconnect repeatedly. We're exclusively using >> multipart messages with the first part always being the ZMQ socket >> identity, which persists across DEALER connect/disconnects. In other >> words, each DEALER client uses the same socket identity across many >> connects and disconnects. >> >> Most of the time, things hum along smoothly (several thousand messages >> exchanged, several dozen connect/disconnects). However, every once in >> a rare while, we see that one of the DEALER clients connects and sends >> messages to the ROUTER that end up never making it to the ROUTER >> process. The ROUTER process continues to receive messages from other >> DEALER clients. >> >> Further debugging on the ROUTER server shows one (or more) TCP >> connections from the client DEALER that are in the CLOSE_WAIT state >> with several megabytes of data sitting in the receive queue to the >> ROUTER. We also see one connection from the client DEALER in the >> ESTABLISHED state with a receive queue that is growing. >> >> It's clear that the DEALER client died abruptly once, but then >> returned with the same identity and resumed sending messages to the >> ROUTER. However, none of the subsequent messages are delivered to the >> ROUTER process. Any ideas on why this would be the case? >> >> I would have provided a test case, but we aren't able to consistently >> reproduce the issue. I've copied the output from netstat (with >> obfuscated IPs) below, in case it helps. >> >> >> Questions: >> - What would cause the receive queue to fill up like this on a ROUTER >> while it continues to receive messages from other clients? It's clear >> that the messages are all making it to the ROUTER machine. >> - Is it safe for DEALER sockets to abruptly disconnect and then reuse >> their socket identity? >> - How can we mitigate this situation? The closest thing I see is >> ZMQ_LINGER, but that applies only to the outgoing queue and not the >> incoming one. >> - Is there anything I could investigate myself to figure out whether >> this is an issue in PyZMQ vs. libzmq? Where should I start? >> >> >> Other potentially relevant info: >> - The ROUTER uses PyZMQ's zmq.Poller() to receive messages from the >> problem socket and some others. All other nodes in the system >> continue to send and receive messages just fine. >> - The ROUTER's send queues are pretty much empty. >> - We see the same behavior with libzmq 4.0.4 and libzmq 2.2.x, on Ubuntu >> 14.04. >> >> >> $ netstat -a >> Active Internet connections (servers and established) >> Proto Recv-Q Send-Q Local Address Foreign Address State >> tcp0 0 *:12501 *:* LISTEN >> tcp 1816956 0 server-ip.:12501 clientA-ip:42571 CLOSE_WAIT >> tcp 1551036 0 server-ip.:12501 clientA-ip:42858 CLOSE_WAIT >> tcp0 0 server-ip.:12501 clientB-ip:34000 ESTABLISHED >> tcp 5265541 0 server-ip.:12501 clientA-ip:43469 ESTABLISHED >> >> >> Please let me if further information would help. Thank you for >> helping build ZMQ, it's been a huge pleasure to work with so far. >> >> Cheers, >> Sash >> ___ >> zeromq-dev mailing list >> zeromq-dev@lists.zeromq.org >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev > ___ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > http://lists.zeromq.org/mailman/listinfo/zeromq-dev ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] ROUTER tcp socket stuck in CLOSE_WAIT with large receive queue
I've seen something similar (I think) with Zyre, where dealer sockets connecting with the same identity do weird things. Try setting ZMQ_ROUTER_HANDOVER on the router socket, see if that helps (you'll need libzmq master). On Thu, Jun 12, 2014 at 4:15 AM, Sash Nagarkar wrote: > Hello ZMQ devs, > > We're using PyZMQ 14.3.0 and libzmq 4.0.4 with a ROUTER-DEALER pattern > for a service we're providing. Sorry if this is too verbose, and I > hope this is the right place to ask the question. > > TL;DR: ROUTER socket doesn't receive messages from a DEALER even > though netstat shows several megabytes in the TCP receive queue > (nothing in the send queue). Other connected DEALERs work fine. > > The ROUTER socket is running on a server with ample CPU & memory > headroom, with several DEALER clients that connect, exchange messages, > and can abruptly disconnect repeatedly. We're exclusively using > multipart messages with the first part always being the ZMQ socket > identity, which persists across DEALER connect/disconnects. In other > words, each DEALER client uses the same socket identity across many > connects and disconnects. > > Most of the time, things hum along smoothly (several thousand messages > exchanged, several dozen connect/disconnects). However, every once in > a rare while, we see that one of the DEALER clients connects and sends > messages to the ROUTER that end up never making it to the ROUTER > process. The ROUTER process continues to receive messages from other > DEALER clients. > > Further debugging on the ROUTER server shows one (or more) TCP > connections from the client DEALER that are in the CLOSE_WAIT state > with several megabytes of data sitting in the receive queue to the > ROUTER. We also see one connection from the client DEALER in the > ESTABLISHED state with a receive queue that is growing. > > It's clear that the DEALER client died abruptly once, but then > returned with the same identity and resumed sending messages to the > ROUTER. However, none of the subsequent messages are delivered to the > ROUTER process. Any ideas on why this would be the case? > > I would have provided a test case, but we aren't able to consistently > reproduce the issue. I've copied the output from netstat (with > obfuscated IPs) below, in case it helps. > > > Questions: > - What would cause the receive queue to fill up like this on a ROUTER > while it continues to receive messages from other clients? It's clear > that the messages are all making it to the ROUTER machine. > - Is it safe for DEALER sockets to abruptly disconnect and then reuse > their socket identity? > - How can we mitigate this situation? The closest thing I see is > ZMQ_LINGER, but that applies only to the outgoing queue and not the > incoming one. > - Is there anything I could investigate myself to figure out whether > this is an issue in PyZMQ vs. libzmq? Where should I start? > > > Other potentially relevant info: > - The ROUTER uses PyZMQ's zmq.Poller() to receive messages from the > problem socket and some others. All other nodes in the system > continue to send and receive messages just fine. > - The ROUTER's send queues are pretty much empty. > - We see the same behavior with libzmq 4.0.4 and libzmq 2.2.x, on Ubuntu > 14.04. > > > $ netstat -a > Active Internet connections (servers and established) > Proto Recv-Q Send-Q Local Address Foreign Address State > tcp0 0 *:12501 *:* LISTEN > tcp 1816956 0 server-ip.:12501 clientA-ip:42571 CLOSE_WAIT > tcp 1551036 0 server-ip.:12501 clientA-ip:42858 CLOSE_WAIT > tcp0 0 server-ip.:12501 clientB-ip:34000 ESTABLISHED > tcp 5265541 0 server-ip.:12501 clientA-ip:43469 ESTABLISHED > > > Please let me if further information would help. Thank you for > helping build ZMQ, it's been a huge pleasure to work with so far. > > Cheers, > Sash > ___ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > http://lists.zeromq.org/mailman/listinfo/zeromq-dev ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev
[zeromq-dev] ROUTER tcp socket stuck in CLOSE_WAIT with large receive queue
Hello ZMQ devs, We're using PyZMQ 14.3.0 and libzmq 4.0.4 with a ROUTER-DEALER pattern for a service we're providing. Sorry if this is too verbose, and I hope this is the right place to ask the question. TL;DR: ROUTER socket doesn't receive messages from a DEALER even though netstat shows several megabytes in the TCP receive queue (nothing in the send queue). Other connected DEALERs work fine. The ROUTER socket is running on a server with ample CPU & memory headroom, with several DEALER clients that connect, exchange messages, and can abruptly disconnect repeatedly. We're exclusively using multipart messages with the first part always being the ZMQ socket identity, which persists across DEALER connect/disconnects. In other words, each DEALER client uses the same socket identity across many connects and disconnects. Most of the time, things hum along smoothly (several thousand messages exchanged, several dozen connect/disconnects). However, every once in a rare while, we see that one of the DEALER clients connects and sends messages to the ROUTER that end up never making it to the ROUTER process. The ROUTER process continues to receive messages from other DEALER clients. Further debugging on the ROUTER server shows one (or more) TCP connections from the client DEALER that are in the CLOSE_WAIT state with several megabytes of data sitting in the receive queue to the ROUTER. We also see one connection from the client DEALER in the ESTABLISHED state with a receive queue that is growing. It's clear that the DEALER client died abruptly once, but then returned with the same identity and resumed sending messages to the ROUTER. However, none of the subsequent messages are delivered to the ROUTER process. Any ideas on why this would be the case? I would have provided a test case, but we aren't able to consistently reproduce the issue. I've copied the output from netstat (with obfuscated IPs) below, in case it helps. Questions: - What would cause the receive queue to fill up like this on a ROUTER while it continues to receive messages from other clients? It's clear that the messages are all making it to the ROUTER machine. - Is it safe for DEALER sockets to abruptly disconnect and then reuse their socket identity? - How can we mitigate this situation? The closest thing I see is ZMQ_LINGER, but that applies only to the outgoing queue and not the incoming one. - Is there anything I could investigate myself to figure out whether this is an issue in PyZMQ vs. libzmq? Where should I start? Other potentially relevant info: - The ROUTER uses PyZMQ's zmq.Poller() to receive messages from the problem socket and some others. All other nodes in the system continue to send and receive messages just fine. - The ROUTER's send queues are pretty much empty. - We see the same behavior with libzmq 4.0.4 and libzmq 2.2.x, on Ubuntu 14.04. $ netstat -a Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp0 0 *:12501 *:* LISTEN tcp 1816956 0 server-ip.:12501 clientA-ip:42571 CLOSE_WAIT tcp 1551036 0 server-ip.:12501 clientA-ip:42858 CLOSE_WAIT tcp0 0 server-ip.:12501 clientB-ip:34000 ESTABLISHED tcp 5265541 0 server-ip.:12501 clientA-ip:43469 ESTABLISHED Please let me if further information would help. Thank you for helping build ZMQ, it's been a huge pleasure to work with so far. Cheers, Sash ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev