[jira] [Assigned] (TS-3299) InactivityCop broken

Sudheer Vinukonda (JIRA) Mon, 23 Feb 2015 21:22:26 -0800

     [ 
https://issues.apache.org/jira/browse/TS-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sudheer Vinukonda reassigned TS-3299:
-------------------------------------

    Assignee: Sudheer Vinukonda

> InactivityCop broken
> --------------------
>
>                 Key: TS-3299
>                 URL: https://issues.apache.org/jira/browse/TS-3299
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 5.2.0
>            Reporter: Sudheer Vinukonda
>            Assignee: Sudheer Vinukonda
>             Fix For: 5.3.0
>
>
> The patch in TS-3196 seems to result in fd leak in our prod. There are a 
> bunch of hung sockets (in close-wait state), stuck forever (remain leaked for 
> days after stopping the traffic). Debugging further, it seems that the 
> InactivityCop is broken by this patch. [NOTE: We have spdy enabled in prod, 
> but, I am not entirely sure, if this bug only affects spdy connections]
> Some info below for the leaked sockets (in close_wait state):
> {code}
> $ ss -s ; sudo traffic_line -r
> proxy.process.net.connections_currently_open; sudo traffic_line -r
> proxy.process.http.current_client_connections; sudo traffic_line -r
> proxy.process.http.current_server_connections; sudo ls -l /proc/$(pidof
> traffic_server)/fd/ 3>/dev/null| wc -l
> Total: 29367 (kernel 29437)
> TCP:   78235 (estab 5064, closed 46593, orphaned 0, synrecv 0, timewait 15/0),
> ports 918
> Transport Total     IP        IPv6
> *      29437     -         -        
> RAW      0         0         0        
> UDP      16        13        3        
> TCP      31642     31637     5        
> INET      31658     31650     8        
> FRAG      0         0         0        
> Password: 
> 27689
> 1
> 1
> 27939
> A snippet from "lsof -p $(pidof traffic_server)"
> [ET_NET 10024 nobody  240u  IPv4         2138575429        0t0        TCP 
> 67.195.33.62:443->66.87.139.114:29381 (CLOSE_WAIT)
> [ET_NET 10024 nobody  241u  IPv4         2137093945        0t0        TCP 
> 67.195.33.62:443->201.122.209.180:49274 (CLOSE_WAIT)
> [ET_NET 10024 nobody  243u  IPv4         2136018789        0t0        TCP 
> 67.195.33.62:443->173.225.111.9:38133 (CLOSE_WAIT)
> [ET_NET 10024 nobody  245u  IPv4         2135996293        0t0        TCP 
> 67.195.33.62:443->172.243.79.180:52701 (CLOSE_WAIT)
> [ET_NET 10024 nobody  248u  IPv4         2136468896        0t0        TCP 
> 67.195.33.62:443->173.225.111.82:42273 (CLOSE_WAIT)
> [ET_NET 10024 nobody  253u  IPv4         2140213864        0t0        TCP 
> 67.195.33.62:443->174.138.185.120:34936 (CLOSE_WAIT)
> [ET_NET 10024 nobody  259u  IPv4         2137861176        0t0        TCP 
> 67.195.33.62:443->76.199.250.133:60631 (CLOSE_WAIT)
> [ET_NET 10024 nobody  260u  IPv4         2139081493        0t0        TCP 
> 67.195.33.62:443->187.74.154.214:58800 (CLOSE_WAIT)
> [ET_NET 10024 nobody  261u  IPv4         2134948565        0t0        TCP 
> 67.195.33.62:443->23.242.49.117:4127 (CLOSE_WAIT)
> [ET_NET 10024 nobody  262u  IPv4         2135708046        0t0        TCP 
> 67.195.33.62:443->66.241.71.243:50318 (CLOSE_WAIT)
> [ET_NET 10024 nobody  263u  IPv4         2138896897        0t0        TCP 
> 67.195.33.62:443->73.35.151.106:52414 (CLOSE_WAIT)
> [ET_NET 10024 nobody  264u  IPv4         2135589029        0t0        TCP 
> 67.195.33.62:443->96.251.12.27:62426 (CLOSE_WAIT)
> [ET_NET 10024 nobody  265u  IPv4         2134930235        0t0        TCP 
> 67.195.33.62:443->207.118.3.196:50690 (CLOSE_WAIT)
> [ET_NET 10024 nobody  267u  IPv4         2137837515        0t0        TCP 
> 67.195.33.62:443->98.112.195.98:52028 (CLOSE_WAIT)
> [ET_NET 10024 nobody  269u  IPv4         2135272855        0t0        TCP 
> 67.195.33.62:443->24.1.230.25:57265 (CLOSE_WAIT)
> [ET_NET 10024 nobody  270u  IPv4         2135820802        0t0        TCP 
> 67.195.33.62:443->24.75.122.66:14345 (CLOSE_WAIT)
> [ET_NET 10024 nobody  271u  IPv4         2135475042        0t0        TCP 
> 67.195.33.62:443->65.102.35.112:49188 (CLOSE_WAIT)
> [ET_NET 10024 nobody  272u  IPv4         2135328974        0t0        TCP 
> 67.195.33.62:443->209.242.195.252:54890 (CLOSE_WAIT)
> [ET_NET 10024 nobody  273u  IPv4         2137542791        0t0        TCP 
> 67.195.33.62:443->76.79.183.188:47048 (CLOSE_WAIT)
> [ET_NET 10024 nobody  274u  IPv4         2134806135        0t0        TCP 
> 67.195.33.62:443->189.251.149.36:58106 (CLOSE_WAIT)
> [ET_NET 10024 nobody  275u  IPv4         2140126017        0t0        TCP 
> 67.195.33.62:443->68.19.173.44:1397 (CLOSE_WAIT)
> [ET_NET 10024 nobody  276u  IPv4         2134636089        0t0        TCP 
> 67.195.33.62:443->67.44.192.72:22112 (CLOSE_WAIT)
> [ET_NET 10024 nobody  278u  IPv4         2134708339        0t0        TCP 
> 67.195.33.62:443->107.220.216.155:51242 (CLOSE_WAIT)
> [ET_NET 10024 nobody  279u  IPv4         2134580888        0t0        TCP 
> 67.195.33.62:443->50.126.116.209:59432 (CLOSE_WAIT)
> [ET_NET 10024 nobody  281u  IPv4         2134868131        0t0        TCP 
> 67.195.33.62:443->108.38.255.44:4612 (CLOSE_WAIT)
> [ET_NET 10024 nobody  282u  IPv4         2139275601        0t0        TCP 
> 67.195.33.62:443->168.99.78.5:63345 (CLOSE_WAIT)
> [ET_NET 10024 nobody  283u  IPv4         2138019644        0t0        TCP 
> 67.195.33.62:443->209.242.204.18:47547 (CLOSE_WAIT)
> [ET_NET 10024 nobody  284u  IPv4         2134873211        0t0        TCP 
> 67.195.33.62:443->99.47.247.0:58672 (CLOSE_WAIT)
> [ET_NET 10024 nobody  285u  IPv4         2137679393        0t0        TCP 
> 67.195.33.62:443->174.138.184.227:43908 (CLOSE_WAIT)
> [ET_NET 10024 nobody  287u  IPv4         2137417894        0t0        TCP 
> 67.195.33.62:443->50.46.200.8:4880 (CLOSE_WAIT)
> [ET_NET 10024 nobody  288u  IPv4         2135541215        0t0        TCP 
> 67.195.33.62:443->216.201.190.60:41016 (CLOSE_WAIT)
> [ET_NET 10024 nobody  289u  IPv4         2137315400        0t0        TCP 
> 67.195.33.62:443->98.180.195.225:53754 (CLOSE_WAIT)
> [ET_NET 10024 nobody  291u  IPv4         2136931822        0t0        TCP 
> 67.195.33.62:443->12.199.206.210:7271 (CLOSE_WAIT)
> [ET_NET 10024 nobody  292u  IPv4         2137892940        0t0        TCP 
> 67.195.33.62:443->166.173.60.100:16766 (CLOSE_WAIT)
> [ET_NET 10024 nobody  293u  IPv4         2136627603        0t0        TCP 
> 67.195.33.62:443->71.131.5.198:51552 (CLOSE_WAIT)
> [ET_NET 10024 nobody  294u  IPv4         2135059554        0t0        TCP 
> 67.195.33.62:443->209.95.183.181:39707 (CLOSE_WAIT)
> {code}
> Below are the traces showing how the inactivity cop is stuck in a loop, 
> basically resetting the timer every time it expires, and not closing the vc.
> {code}
> [Jan 13 23:08:32.120] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
> vc: 0x2ae574f13730 timeout at: 1421190525 timeout in: 300
> [Jan 13 23:08:33.164] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
> vc: 0x2ae574f13730 timeout at: 1421190525 timeout in: 300
> .....
> [Jan 13 23:13:46.833] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose) 
> vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300
> [Jan 13 23:13:47.845] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose) 
> vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300
> [Jan 13 23:08:47.639] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop) vc:
> 0x2ae574f13730 inactivity timeout not set, setting a default of 300
> [Jan 13 23:08:48.666] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose)
> vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300
> ........
> {code}
> Attached gdb to track who's resetting the inactivity timer - It seems that, 
> when the inactivity timer expires for these connections, the audit correctly 
> generates a signal to the vc's continuation. However, since, the patch in 
> TS-3196 reset the read continuation to NULL, this event simply is ignored (in 
> {{read_signal_and_update()}}), while at the same time, the inactivity timer 
> is reset. This keeps looping forever resulting in never releasing these 
> inactive connections. Below's the debug output from gdb.
> {code}
> (gdb) cont
> Continuing.
> Hardware watchpoint 22: *(0x2ae564011a28)
> Old value = 0
> New value = -794113713
> UnixNetVConnection::set_inactivity_timeout (this=0x2ae564011830,
> timeout=300000000000) at P_UnixNetVConnection.h:324
> 324    P_UnixNetVConnection.h: No such file or directory.
>     in P_UnixNetVConnection.h
> (gdb) cont
> Continuing.
> Hardware watchpoint 22: *(0x2ae564011a28)
> Old value = -794113713
> New value = 0
> UnixNetVConnection::mainEvent (this=0x2ae564011830, event=1, e=0x1f37c90) at
> UnixNetVConnection.cc:1060
> 1060    UnixNetVConnection.cc: No such file or directory.
>     in UnixNetVConnection.cc
> (gdb) break 1068
> Breakpoint 23 at 0x738358: file UnixNetVConnection.cc, line 1068.
> Breakpoint 23, UnixNetVConnection::mainEvent (this=0x2ae575679c90, event=1,
> e=0x1f34a50) at UnixNetVConnection.cc:1068
> 1068    in UnixNetVConnection.cc
> (gdb) p closed
> $24 = 0
> (gdb) p read.vio
> $25 = {_cont = 0x0, nbytes = 0, ndone = 0, op = 1, buffer = {mbuf = 0x0, entry
> = 0x0}, vc_server = 0x2ae575679c90, mutex = {m_ptr = 0x2ae6241a3720}}
> Breakpoint 24, read_signal_and_update (event=105, vc=0x2ae5758ffab0) at
> UnixNetVConnection.cc:137
> 137    in UnixNetVConnection.cc
> (gdb) p vc->read.vio
> $33 = {_cont = 0x0, nbytes = 0, ndone = 0, op = 1, buffer = {mbuf = 0x0, entry
> = 0x0}, vc_server = 0x2ae5758ffab0, mutex = {m_ptr = 0x2ae6100a3a20}}
> (gdb) cont
> Continuing.
> [Switching to Thread 0x2ae55e928700 (LWP 10088)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (TS-3299) InactivityCop broken

Reply via email to