[ https://issues.apache.org/jira/browse/THRIFT-4080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Fankhauser updated THRIFT-4080: ------------------------------------- Description: I had the problem that if the network connection is really bad the server sometimes does not accept more connections. "Really bad" means that a simple ping event sent via thrift could take 15 seconds. Having this issue for nearly 2 years now I could finally figure it out: There is no timeout when the socket receives data. After a connection is established and the socket object is created, the connection can drop which yields to the socket blocking forever. I added a timeout in the TSocket accept function which makes the socket throw a resource not available error after 30 seconds: def accept(self): client, addr = self.handle.accept() -- added timeout of 30.0 seconds client.setsockopt(socket.SOL_SOCKET, socket.SO_RCVTIMEO, struct.pack('LL', 30, 0)) result = TSocket() result.setHandle(client) return result Gives this error: buff = self.handle.recv(sz) error: [Errno 11] Resource temporarily unavailable I also tried using python socket's settimeout() function which does not work. Only setting the receive timeout times out dropped connections. This bug does not appear on stable connections. However, I have 4 devices that are connected via WiFi and my ThreadedServer gets stuck about 4-5 times a day. The ThreadedServer has 5 threads, thus all 5 sockets get stuck all the time... FYI here is the strace of the stuck socket: [pid 2698] futex(0x7faf50000d80, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2697] read(4, <unfinished ...> [pid 2693] accept(7, {sa_family=AF_INET6, sin6_port=htons(39911), inet_pton(AF_INET6, "::ffff:46.125.249.41", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 6 [pid 2693] recvfrom(6, was: I had the problem that if the network connection is really bad the server sometimes does not accept more connections. "Really bad" means that a simple ping event sent via thrift could take 15 seconds. Having this issue for nearly 2 years now I could finally figure it out: There is no timeout when the socket receives data. After a connection is established and the socket object is created, the connection can drop which yields to the socket blocking forever. I added a timeout in the TSocket accept function which makes the socket throw a resource not available error after 30 seconds: def accept(self): client, addr = self.handle.accept() # added timeout of 30.0 seconds client.setsockopt(socket.SOL_SOCKET, socket.SO_RCVTIMEO, struct.pack('LL', 30, 0)) result = TSocket() result.setHandle(client) return result Gives this error: buff = self.handle.recv(sz) error: [Errno 11] Resource temporarily unavailable I also tried using python socket's settimeout() function which does not work. Only setting the receive timeout times out dropped connections. This bug does not appear on stable connections. However, I have 4 devices that are connected via WiFi and my ThreadedServer gets stuck about 4-5 times a day. The ThreadedServer has 5 threads, thus all 5 sockets get stuck all the time... FYI here is the strace of the stuck socket: [pid 2698] futex(0x7faf50000d80, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2697] read(4, <unfinished ...> [pid 2693] accept(7, {sa_family=AF_INET6, sin6_port=htons(39911), inet_pton(AF_INET6, "::ffff:46.125.249.41", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 6 [pid 2693] recvfrom(6, > Unix sockets can get stuck forever > ---------------------------------- > > Key: THRIFT-4080 > URL: https://issues.apache.org/jira/browse/THRIFT-4080 > Project: Thrift > Issue Type: Bug > Components: Python - Library > Affects Versions: 0.10.0 > Environment: Ubuntu 14.04 > Reporter: David Fankhauser > Priority: Critical > > I had the problem that if the network connection is really bad the server > sometimes does not accept more connections. "Really bad" means that a simple > ping event sent via thrift could take 15 seconds. > Having this issue for nearly 2 years now I could finally figure it out: > There is no timeout when the socket receives data. After a connection is > established and the socket object is created, the connection can drop which > yields to the socket blocking forever. > I added a timeout in the TSocket accept function which makes the socket throw > a resource not available error after 30 seconds: > def accept(self): > client, addr = self.handle.accept() > -- added timeout of 30.0 seconds > client.setsockopt(socket.SOL_SOCKET, socket.SO_RCVTIMEO, > struct.pack('LL', 30, 0)) > result = TSocket() > result.setHandle(client) > return result > Gives this error: > buff = self.handle.recv(sz) > error: [Errno 11] Resource temporarily unavailable > I also tried using python socket's settimeout() function which does not work. > Only setting the receive timeout times out dropped connections. > This bug does not appear on stable connections. However, I have 4 devices > that are connected via WiFi and my ThreadedServer gets stuck about 4-5 times > a day. The ThreadedServer has 5 threads, thus all 5 sockets get stuck all > the time... > FYI here is the strace of the stuck socket: > [pid 2698] futex(0x7faf50000d80, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> > [pid 2697] read(4, <unfinished ...> > [pid 2693] accept(7, {sa_family=AF_INET6, sin6_port=htons(39911), > inet_pton(AF_INET6, "::ffff:46.125.249.41", &sin6_addr), sin6_flowinfo=0, > sin6_scope_id=0}, [28]) = 6 > [pid 2693] recvfrom(6, -- This message was sent by Atlassian JIRA (v6.3.15#6346)