[ https://issues.apache.org/jira/browse/TS-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sudheer Vinukonda reassigned TS-3299: ------------------------------------- Assignee: Sudheer Vinukonda > InactivityCop broken > -------------------- > > Key: TS-3299 > URL: https://issues.apache.org/jira/browse/TS-3299 > Project: Traffic Server > Issue Type: Bug > Components: Core > Affects Versions: 5.2.0 > Reporter: Sudheer Vinukonda > Assignee: Sudheer Vinukonda > Fix For: 5.3.0 > > > The patch in TS-3196 seems to result in fd leak in our prod. There are a > bunch of hung sockets (in close-wait state), stuck forever (remain leaked for > days after stopping the traffic). Debugging further, it seems that the > InactivityCop is broken by this patch. [NOTE: We have spdy enabled in prod, > but, I am not entirely sure, if this bug only affects spdy connections] > Some info below for the leaked sockets (in close_wait state): > {code} > $ ss -s ; sudo traffic_line -r > proxy.process.net.connections_currently_open; sudo traffic_line -r > proxy.process.http.current_client_connections; sudo traffic_line -r > proxy.process.http.current_server_connections; sudo ls -l /proc/$(pidof > traffic_server)/fd/ 3>/dev/null| wc -l > Total: 29367 (kernel 29437) > TCP: 78235 (estab 5064, closed 46593, orphaned 0, synrecv 0, timewait 15/0), > ports 918 > Transport Total IP IPv6 > * 29437 - - > RAW 0 0 0 > UDP 16 13 3 > TCP 31642 31637 5 > INET 31658 31650 8 > FRAG 0 0 0 > Password: > 27689 > 1 > 1 > 27939 > A snippet from "lsof -p $(pidof traffic_server)" > [ET_NET 10024 nobody 240u IPv4 2138575429 0t0 TCP > 67.195.33.62:443->66.87.139.114:29381 (CLOSE_WAIT) > [ET_NET 10024 nobody 241u IPv4 2137093945 0t0 TCP > 67.195.33.62:443->201.122.209.180:49274 (CLOSE_WAIT) > [ET_NET 10024 nobody 243u IPv4 2136018789 0t0 TCP > 67.195.33.62:443->173.225.111.9:38133 (CLOSE_WAIT) > [ET_NET 10024 nobody 245u IPv4 2135996293 0t0 TCP > 67.195.33.62:443->172.243.79.180:52701 (CLOSE_WAIT) > [ET_NET 10024 nobody 248u IPv4 2136468896 0t0 TCP > 67.195.33.62:443->173.225.111.82:42273 (CLOSE_WAIT) > [ET_NET 10024 nobody 253u IPv4 2140213864 0t0 TCP > 67.195.33.62:443->174.138.185.120:34936 (CLOSE_WAIT) > [ET_NET 10024 nobody 259u IPv4 2137861176 0t0 TCP > 67.195.33.62:443->76.199.250.133:60631 (CLOSE_WAIT) > [ET_NET 10024 nobody 260u IPv4 2139081493 0t0 TCP > 67.195.33.62:443->187.74.154.214:58800 (CLOSE_WAIT) > [ET_NET 10024 nobody 261u IPv4 2134948565 0t0 TCP > 67.195.33.62:443->23.242.49.117:4127 (CLOSE_WAIT) > [ET_NET 10024 nobody 262u IPv4 2135708046 0t0 TCP > 67.195.33.62:443->66.241.71.243:50318 (CLOSE_WAIT) > [ET_NET 10024 nobody 263u IPv4 2138896897 0t0 TCP > 67.195.33.62:443->73.35.151.106:52414 (CLOSE_WAIT) > [ET_NET 10024 nobody 264u IPv4 2135589029 0t0 TCP > 67.195.33.62:443->96.251.12.27:62426 (CLOSE_WAIT) > [ET_NET 10024 nobody 265u IPv4 2134930235 0t0 TCP > 67.195.33.62:443->207.118.3.196:50690 (CLOSE_WAIT) > [ET_NET 10024 nobody 267u IPv4 2137837515 0t0 TCP > 67.195.33.62:443->98.112.195.98:52028 (CLOSE_WAIT) > [ET_NET 10024 nobody 269u IPv4 2135272855 0t0 TCP > 67.195.33.62:443->24.1.230.25:57265 (CLOSE_WAIT) > [ET_NET 10024 nobody 270u IPv4 2135820802 0t0 TCP > 67.195.33.62:443->24.75.122.66:14345 (CLOSE_WAIT) > [ET_NET 10024 nobody 271u IPv4 2135475042 0t0 TCP > 67.195.33.62:443->65.102.35.112:49188 (CLOSE_WAIT) > [ET_NET 10024 nobody 272u IPv4 2135328974 0t0 TCP > 67.195.33.62:443->209.242.195.252:54890 (CLOSE_WAIT) > [ET_NET 10024 nobody 273u IPv4 2137542791 0t0 TCP > 67.195.33.62:443->76.79.183.188:47048 (CLOSE_WAIT) > [ET_NET 10024 nobody 274u IPv4 2134806135 0t0 TCP > 67.195.33.62:443->189.251.149.36:58106 (CLOSE_WAIT) > [ET_NET 10024 nobody 275u IPv4 2140126017 0t0 TCP > 67.195.33.62:443->68.19.173.44:1397 (CLOSE_WAIT) > [ET_NET 10024 nobody 276u IPv4 2134636089 0t0 TCP > 67.195.33.62:443->67.44.192.72:22112 (CLOSE_WAIT) > [ET_NET 10024 nobody 278u IPv4 2134708339 0t0 TCP > 67.195.33.62:443->107.220.216.155:51242 (CLOSE_WAIT) > [ET_NET 10024 nobody 279u IPv4 2134580888 0t0 TCP > 67.195.33.62:443->50.126.116.209:59432 (CLOSE_WAIT) > [ET_NET 10024 nobody 281u IPv4 2134868131 0t0 TCP > 67.195.33.62:443->108.38.255.44:4612 (CLOSE_WAIT) > [ET_NET 10024 nobody 282u IPv4 2139275601 0t0 TCP > 67.195.33.62:443->168.99.78.5:63345 (CLOSE_WAIT) > [ET_NET 10024 nobody 283u IPv4 2138019644 0t0 TCP > 67.195.33.62:443->209.242.204.18:47547 (CLOSE_WAIT) > [ET_NET 10024 nobody 284u IPv4 2134873211 0t0 TCP > 67.195.33.62:443->99.47.247.0:58672 (CLOSE_WAIT) > [ET_NET 10024 nobody 285u IPv4 2137679393 0t0 TCP > 67.195.33.62:443->174.138.184.227:43908 (CLOSE_WAIT) > [ET_NET 10024 nobody 287u IPv4 2137417894 0t0 TCP > 67.195.33.62:443->50.46.200.8:4880 (CLOSE_WAIT) > [ET_NET 10024 nobody 288u IPv4 2135541215 0t0 TCP > 67.195.33.62:443->216.201.190.60:41016 (CLOSE_WAIT) > [ET_NET 10024 nobody 289u IPv4 2137315400 0t0 TCP > 67.195.33.62:443->98.180.195.225:53754 (CLOSE_WAIT) > [ET_NET 10024 nobody 291u IPv4 2136931822 0t0 TCP > 67.195.33.62:443->12.199.206.210:7271 (CLOSE_WAIT) > [ET_NET 10024 nobody 292u IPv4 2137892940 0t0 TCP > 67.195.33.62:443->166.173.60.100:16766 (CLOSE_WAIT) > [ET_NET 10024 nobody 293u IPv4 2136627603 0t0 TCP > 67.195.33.62:443->71.131.5.198:51552 (CLOSE_WAIT) > [ET_NET 10024 nobody 294u IPv4 2135059554 0t0 TCP > 67.195.33.62:443->209.95.183.181:39707 (CLOSE_WAIT) > {code} > Below are the traces showing how the inactivity cop is stuck in a loop, > basically resetting the timer every time it expires, and not closing the vc. > {code} > [Jan 13 23:08:32.120] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose) > vc: 0x2ae574f13730 timeout at: 1421190525 timeout in: 300 > [Jan 13 23:08:33.164] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose) > vc: 0x2ae574f13730 timeout at: 1421190525 timeout in: 300 > ..... > [Jan 13 23:13:46.833] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose) > vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300 > [Jan 13 23:13:47.845] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose) > vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300 > [Jan 13 23:08:47.639] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop) vc: > 0x2ae574f13730 inactivity timeout not set, setting a default of 300 > [Jan 13 23:08:48.666] Server {0x2ae55fc3b700} DEBUG: (inactivity_cop_verbose) > vc: 0x2ae574f13730 timeout at: 1421190827 timeout in: 300 > ........ > {code} > Attached gdb to track who's resetting the inactivity timer - It seems that, > when the inactivity timer expires for these connections, the audit correctly > generates a signal to the vc's continuation. However, since, the patch in > TS-3196 reset the read continuation to NULL, this event simply is ignored (in > {{read_signal_and_update()}}), while at the same time, the inactivity timer > is reset. This keeps looping forever resulting in never releasing these > inactive connections. Below's the debug output from gdb. > {code} > (gdb) cont > Continuing. > Hardware watchpoint 22: *(0x2ae564011a28) > Old value = 0 > New value = -794113713 > UnixNetVConnection::set_inactivity_timeout (this=0x2ae564011830, > timeout=300000000000) at P_UnixNetVConnection.h:324 > 324 P_UnixNetVConnection.h: No such file or directory. > in P_UnixNetVConnection.h > (gdb) cont > Continuing. > Hardware watchpoint 22: *(0x2ae564011a28) > Old value = -794113713 > New value = 0 > UnixNetVConnection::mainEvent (this=0x2ae564011830, event=1, e=0x1f37c90) at > UnixNetVConnection.cc:1060 > 1060 UnixNetVConnection.cc: No such file or directory. > in UnixNetVConnection.cc > (gdb) break 1068 > Breakpoint 23 at 0x738358: file UnixNetVConnection.cc, line 1068. > Breakpoint 23, UnixNetVConnection::mainEvent (this=0x2ae575679c90, event=1, > e=0x1f34a50) at UnixNetVConnection.cc:1068 > 1068 in UnixNetVConnection.cc > (gdb) p closed > $24 = 0 > (gdb) p read.vio > $25 = {_cont = 0x0, nbytes = 0, ndone = 0, op = 1, buffer = {mbuf = 0x0, entry > = 0x0}, vc_server = 0x2ae575679c90, mutex = {m_ptr = 0x2ae6241a3720}} > Breakpoint 24, read_signal_and_update (event=105, vc=0x2ae5758ffab0) at > UnixNetVConnection.cc:137 > 137 in UnixNetVConnection.cc > (gdb) p vc->read.vio > $33 = {_cont = 0x0, nbytes = 0, ndone = 0, op = 1, buffer = {mbuf = 0x0, entry > = 0x0}, vc_server = 0x2ae5758ffab0, mutex = {m_ptr = 0x2ae6100a3a20}} > (gdb) cont > Continuing. > [Switching to Thread 0x2ae55e928700 (LWP 10088)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)