Peter Chou created TS-4475: ------------------------------ Summary: Crash in Log-Collation client after using inactivity-cop. Key: TS-4475 URL: https://issues.apache.org/jira/browse/TS-4475 Project: Traffic Server Issue Type: Bug Components: Logging Reporter: Peter Chou
Background: We recently tried making use of inactivity-cop by setting it to 300s instead of the default one-day setting. This was to address an issue where, under heavy load, ATS would become un-responsive to client requests, and the condition would persist after traffic was stopped with the active queue saying 0 connections but 'netstat -na' showing a bunch of established connections (up to the throttle limit approximately). Inactivity cop seemed to help ATS handle this situation, but we have since experienced a couple of core dumps over the last four day period. It seems occasionally the Log Collation Client State Machine will have event value 105 or VC_EVENT_INACTIVITY_TIMEOUT, but when it reaches read_signal_and_update() it tries to call the continuation handler which down the line does not know about this event thus causing core dump !"unexpcted state" [sic]. Here is the back-trace -- (gdb) bt #0 0x00002b67cd5405f7 in raise () from /lib64/libc.so.6 #1 0x00002b67cd541e28 in abort () from /lib64/libc.so.6 #2 0x00002b67cb032921 in ink_die_die_die () at ink_error.cc:43 #3 0x00002b67cb0329da in ink_fatal_va (fmt=0x2b67cb0442dc "%s:%d: failed assert `%s`", ap=0x7ffc690e7ba8) at ink_error.cc:65 #4 0x00002b67cb032a79 in ink_fatal (message_format=0x2b67cb0442dc "%s:%d: failed assert `%s`") at ink_error.cc:73 #5 0x00002b67cb0305a6 in _ink_assert (expression=0x7fb422 "!\"unexpcted state\"", file=0x7fb35b "LogCollationClientSM.cc", line=445) at ink_assert.cc:37 #6 0x000000000069c86b in LogCollationClientSM::client_idle (this=0x2b681400bb00, event=105) at LogCollationClientSM.cc:445 #7 0x000000000069b427 in LogCollationClientSM::client_handler (this=0x2b681400bb00, event=105, data=0x2b680c017020) at LogCollationClientSM.cc:119 #8 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b681400bb00, event=105, data=0x2b680c017020) at ../iocore/eventsystem/I_Continuation.h:153 #9 0x0000000000783d40 in read_signal_and_update (event=105, vc=0x2b680c016f00) at UnixNetVConnection.cc:150 #10 0x0000000000787a22 in UnixNetVConnection::mainEvent (this=0x2b680c016f00, event=1, e=0x127ad60) at UnixNetVConnection.cc:1188 #11 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b680c016f00, event=1, data=0x127ad60) at ../iocore/eventsystem/I_Continuation.h:153 #12 0x000000000077d943 in InactivityCop::check_inactivity (this=0x1209a00, event=2, e=0x127ad60) at UnixNet.cc:102 #13 0x0000000000502cc6 in Continuation::handleEvent (this=0x1209a00, event=2, data=0x127ad60) at ../iocore/eventsystem/I_Continuation.h:153 #14 0x00000000007a5df6 in EThread::process_event (this=0x2b67cf7bb010, e=0x127ad60, calling_code=2) at UnixEThread.cc:128 #15 0x00000000007a61f5 in EThread::execute (this=0x2b67cf7bb010) at UnixEThread.cc:207 #16 0x0000000000534430 in main (argv=0x7ffc690e82e8) at Main.cc:1918 I believe it takes a wrong turn here -- #9 0x0000000000783d40 in read_signal_and_update (event=105, vc=0x2b680c016f00) at UnixNetVConnection.cc:150 150 vc->read.vio._cont->handleEvent(event, &vc->read.vio); (gdb) list 145 static inline int 146 read_signal_and_update(int event, UnixNetVConnection *vc) 147 { 148 vc->recursion++; 149 if (vc->read.vio._cont) { 150 vc->read.vio._cont->handleEvent(event, &vc->read.vio); 151 } else { 152 switch (event) { 153 case VC_EVENT_EOS: 154 case VC_EVENT_ERROR: (gdb) list 155 case VC_EVENT_ACTIVE_TIMEOUT: 156 case VC_EVENT_INACTIVITY_TIMEOUT: 157 Debug("inactivity_cop", "event %d: null read.vio cont, closing vc %p", event, vc); 158 vc->closed = 1; 159 break; 160 default: 161 Error("Unexpected event %d for vc %p", event, vc); 162 ink_release_assert(0); 163 break; 164 } Note: I understand that there were several issues related to TS-3196 concerning inactivity_cop and this section of code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)