[ 
https://issues.apache.org/jira/browse/THRIFT-4080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Fankhauser updated THRIFT-4080:
-------------------------------------
    Description: 
I had the problem that if the network connection is really bad the server 
sometimes does not accept more connections. "Really bad" means that a simple 
ping event sent via thrift could take 15 seconds.

Having this issue for nearly 2 years now I could finally figure it out: 
There is no timeout when the socket receives data. After a connection is 
established and the socket object is created, the connection can drop which 
yields to the socket blocking forever.

I added a timeout in the TSocket accept function which makes the socket throw a 
resource not available error after 30 seconds:

def accept(self):
        client, addr = self.handle.accept()
        -- added timeout of 30.0 seconds
        client.setsockopt(socket.SOL_SOCKET, socket.SO_RCVTIMEO, 
struct.pack('LL', 30, 0))
        result = TSocket()
        result.setHandle(client)
        return result

Gives this error:
buff = self.handle.recv(sz)
error: [Errno 11] Resource temporarily unavailable

I also tried using python socket's settimeout() function which does not work. 
Only setting the receive timeout times out dropped connections.

This bug does not appear on stable connections. However, I have 4 devices that 
are connected via WiFi and my ThreadedServer gets stuck about 4-5 times a day.  
The ThreadedServer has 5 threads, thus all 5 sockets get stuck all the time...


FYI here is the strace of the stuck socket:
[pid  2698] futex(0x7faf50000d80, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  2697] read(4,  <unfinished ...>
[pid  2693] accept(7, {sa_family=AF_INET6, sin6_port=htons(39911), 
inet_pton(AF_INET6, "::ffff:46.125.249.41", &sin6_addr), sin6_flowinfo=0, 
sin6_scope_id=0}, [28]) = 6
[pid  2693] recvfrom(6, 

  was:
I had the problem that if the network connection is really bad the server 
sometimes does not accept more connections. "Really bad" means that a simple 
ping event sent via thrift could take 15 seconds.

Having this issue for nearly 2 years now I could finally figure it out: 
There is no timeout when the socket receives data. After a connection is 
established and the socket object is created, the connection can drop which 
yields to the socket blocking forever.

I added a timeout in the TSocket accept function which makes the socket throw a 
resource not available error after 30 seconds:

def accept(self):
        client, addr = self.handle.accept()
        # added timeout of 30.0 seconds
        client.setsockopt(socket.SOL_SOCKET, socket.SO_RCVTIMEO, 
struct.pack('LL', 30, 0))
        result = TSocket()
        result.setHandle(client)
        return result

Gives this error:
buff = self.handle.recv(sz)
error: [Errno 11] Resource temporarily unavailable

I also tried using python socket's settimeout() function which does not work. 
Only setting the receive timeout times out dropped connections.

This bug does not appear on stable connections. However, I have 4 devices that 
are connected via WiFi and my ThreadedServer gets stuck about 4-5 times a day.  
The ThreadedServer has 5 threads, thus all 5 sockets get stuck all the time...


FYI here is the strace of the stuck socket:
[pid  2698] futex(0x7faf50000d80, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  2697] read(4,  <unfinished ...>
[pid  2693] accept(7, {sa_family=AF_INET6, sin6_port=htons(39911), 
inet_pton(AF_INET6, "::ffff:46.125.249.41", &sin6_addr), sin6_flowinfo=0, 
sin6_scope_id=0}, [28]) = 6
[pid  2693] recvfrom(6, 


> Unix sockets can get stuck forever
> ----------------------------------
>
>                 Key: THRIFT-4080
>                 URL: https://issues.apache.org/jira/browse/THRIFT-4080
>             Project: Thrift
>          Issue Type: Bug
>          Components: Python - Library
>    Affects Versions: 0.10.0
>         Environment: Ubuntu 14.04
>            Reporter: David Fankhauser
>            Priority: Critical
>
> I had the problem that if the network connection is really bad the server 
> sometimes does not accept more connections. "Really bad" means that a simple 
> ping event sent via thrift could take 15 seconds.
> Having this issue for nearly 2 years now I could finally figure it out: 
> There is no timeout when the socket receives data. After a connection is 
> established and the socket object is created, the connection can drop which 
> yields to the socket blocking forever.
> I added a timeout in the TSocket accept function which makes the socket throw 
> a resource not available error after 30 seconds:
> def accept(self):
>         client, addr = self.handle.accept()
>         -- added timeout of 30.0 seconds
>         client.setsockopt(socket.SOL_SOCKET, socket.SO_RCVTIMEO, 
> struct.pack('LL', 30, 0))
>         result = TSocket()
>         result.setHandle(client)
>         return result
> Gives this error:
> buff = self.handle.recv(sz)
> error: [Errno 11] Resource temporarily unavailable
> I also tried using python socket's settimeout() function which does not work. 
> Only setting the receive timeout times out dropped connections.
> This bug does not appear on stable connections. However, I have 4 devices 
> that are connected via WiFi and my ThreadedServer gets stuck about 4-5 times 
> a day.  The ThreadedServer has 5 threads, thus all 5 sockets get stuck all 
> the time...
> FYI here is the strace of the stuck socket:
> [pid  2698] futex(0x7faf50000d80, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid  2697] read(4,  <unfinished ...>
> [pid  2693] accept(7, {sa_family=AF_INET6, sin6_port=htons(39911), 
> inet_pton(AF_INET6, "::ffff:46.125.249.41", &sin6_addr), sin6_flowinfo=0, 
> sin6_scope_id=0}, [28]) = 6
> [pid  2693] recvfrom(6, 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to