[ https://issues.apache.org/jira/browse/TS-4475?focusedWorklogId=28719&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-28719 ]
ASF GitHub Bot logged work on TS-4475: -------------------------------------- Author: ASF GitHub Bot Created on: 10/Sep/16 12:36 Start Date: 10/Sep/16 12:36 Worklog Time Spent: 10m Work Description: Github user shinrich commented on the issue: https://github.com/apache/trafficserver/pull/831 Sorry for the delay. Looks good to me as well. Issue Time Tracking ------------------- Worklog Id: (was: 28719) Time Spent: 2h 50m (was: 2h 40m) > Crash in Log-Collation client after using inactivity-cop. > --------------------------------------------------------- > > Key: TS-4475 > URL: https://issues.apache.org/jira/browse/TS-4475 > Project: Traffic Server > Issue Type: Bug > Components: Logging > Affects Versions: 6.1.1 > Reporter: Peter Chou > Fix For: sometime > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Background: We recently tried making use of inactivity-cop by setting it to > 300s instead of the default one-day setting. This was to address an issue > where, under heavy load, ATS would become un-responsive to client requests, > and the condition would persist after traffic was stopped with the active > queue saying 0 connections but 'netstat -na' showing a bunch of established > connections (up to the throttle limit approximately). > Inactivity cop seemed to help ATS handle this situation, but we have since > experienced a couple of core dumps over the last four day period. It seems > occasionally the Log Collation Client State Machine will have event value 105 > or VC_EVENT_INACTIVITY_TIMEOUT, but when it reaches read_signal_and_update() > it tries to call the continuation handler which down the line does not know > about this event thus causing core dump !"unexpcted state" [sic]. > Here is the back-trace -- > (gdb) bt > #0 0x00002b67cd5405f7 in raise () from /lib64/libc.so.6 > #1 0x00002b67cd541e28 in abort () from /lib64/libc.so.6 > #2 0x00002b67cb032921 in ink_die_die_die () at ink_error.cc:43 > #3 0x00002b67cb0329da in ink_fatal_va (fmt=0x2b67cb0442dc "%s:%d: failed > assert `%s`", ap=0x7ffc690e7ba8) at ink_error.cc:65 > #4 0x00002b67cb032a79 in ink_fatal (message_format=0x2b67cb0442dc "%s:%d: > failed assert `%s`") at ink_error.cc:73 > #5 0x00002b67cb0305a6 in _ink_assert (expression=0x7fb422 "!\"unexpcted > state\"", file=0x7fb35b "LogCollationClientSM.cc", > line=445) at ink_assert.cc:37 > #6 0x000000000069c86b in LogCollationClientSM::client_idle > (this=0x2b681400bb00, event=105) at LogCollationClientSM.cc:445 > #7 0x000000000069b427 in LogCollationClientSM::client_handler > (this=0x2b681400bb00, event=105, data=0x2b680c017020) > at LogCollationClientSM.cc:119 > #8 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b681400bb00, > event=105, data=0x2b680c017020) > at ../iocore/eventsystem/I_Continuation.h:153 > #9 0x0000000000783d40 in read_signal_and_update (event=105, > vc=0x2b680c016f00) at UnixNetVConnection.cc:150 > #10 0x0000000000787a22 in UnixNetVConnection::mainEvent (this=0x2b680c016f00, > event=1, e=0x127ad60) at UnixNetVConnection.cc:1188 > #11 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b680c016f00, > event=1, data=0x127ad60) > at ../iocore/eventsystem/I_Continuation.h:153 > #12 0x000000000077d943 in InactivityCop::check_inactivity (this=0x1209a00, > event=2, e=0x127ad60) at UnixNet.cc:102 > #13 0x0000000000502cc6 in Continuation::handleEvent (this=0x1209a00, event=2, > data=0x127ad60) > at ../iocore/eventsystem/I_Continuation.h:153 > #14 0x00000000007a5df6 in EThread::process_event (this=0x2b67cf7bb010, > e=0x127ad60, calling_code=2) at UnixEThread.cc:128 > #15 0x00000000007a61f5 in EThread::execute (this=0x2b67cf7bb010) at > UnixEThread.cc:207 > #16 0x0000000000534430 in main (argv=0x7ffc690e82e8) at Main.cc:1918 > I believe it takes a wrong turn here -- > #9 0x0000000000783d40 in read_signal_and_update (event=105, > vc=0x2b680c016f00) at UnixNetVConnection.cc:150 > 150 vc->read.vio._cont->handleEvent(event, &vc->read.vio); > (gdb) list > 145 static inline int > 146 read_signal_and_update(int event, UnixNetVConnection *vc) > 147 { > 148 vc->recursion++; > 149 if (vc->read.vio._cont) { > 150 vc->read.vio._cont->handleEvent(event, &vc->read.vio); > 151 } else { > 152 switch (event) { > 153 case VC_EVENT_EOS: > 154 case VC_EVENT_ERROR: > (gdb) list > 155 case VC_EVENT_ACTIVE_TIMEOUT: > 156 case VC_EVENT_INACTIVITY_TIMEOUT: > 157 Debug("inactivity_cop", "event %d: null read.vio cont, closing > vc %p", event, vc); > 158 vc->closed = 1; > 159 break; > 160 default: > 161 Error("Unexpected event %d for vc %p", event, vc); > 162 ink_release_assert(0); > 163 break; > 164 } > Note: I understand that there were several issues related to TS-3196 > concerning inactivity_cop and this section of code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)