[ https://issues.apache.org/jira/browse/TS-4475?focusedWorklogId=29198&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29198 ]
ASF GitHub Bot logged work on TS-4475: -------------------------------------- Author: ASF GitHub Bot Created on: 15/Sep/16 20:22 Start Date: 15/Sep/16 20:22 Worklog Time Spent: 10m Work Description: GitHub user pbchou opened a pull request: https://github.com/apache/trafficserver/pull/1026 TS-4475 Fix Log Collation Crash (back-port). @shinrich @oknet -- Susan, this PR is just a back-port to 6.2.x of the Log Collation crash fix. I believe Oknet already +1 the back-port as part of the previous PR #831 . I think this should be back-ported as this is a crash bug and 6.2.x is a LTS release. You can merge this pull request into a Git repository by running: $ git pull https://github.com/pbchou/trafficserver TS-4475-backport-6.2.x Alternatively you can review and apply these changes as the patch at: https://github.com/apache/trafficserver/pull/1026.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1026 ---- commit 1cc28eb9ca06ae6834a779ee889f8b9c3ab8c2b9 Author: Peter Chou <pbc...@labs.att.com> Date: 2016-06-24T23:21:05Z TS-4475: Modified the event handlers in LogCollationClientSM.cc to handle and ignore the VC_EVENT_INACTIVITY_TIMEOUT event. Previously, this event was un-handled and would cause a core dump. (cherry picked from commit 0aefbe5349432e44031401ca6a60535a8516adcb) commit c4344440715e72fc87fa5c3ad4b788317f013e80 Author: Peter Chou <pbc...@labs.att.com> Date: 2016-08-03T19:20:45Z TS-4475 : Handle VC_EVENT_INACTIVITY_TIMEOUT in LogCollationClientSM.cc Based on suggestions from Oknet Xu, we now close the client connection on the timeout event instead of ignoring it. Also, we set the net_vc associated with the client connection to have a timeout of one day or 86400 seconds instead of the default value. (cherry picked from commit efd3582b92d825649216705521dbe92784a7279b) commit 569407f970489f3d16b835cc7f990d902e946300 Author: Peter Chou <pbc...@labs.att.com> Date: 2016-08-05T23:35:09Z TS-4475 : Handle VC_EVENT_INACTIVITY_TIMEOUT in LogCollationClientSM.cc Based on suggestions from Susan Hinrich, we now also handle the VC_EVENT_ACTIVE_TIMEOUT event. In addition, we also perform the same clean-up activities performed for the EOS and ERROR events in addition to closing the client connection. (cherry picked from commit 483412986c5875ed2451b72328e8548dfc7ef207) commit 6f3795631011da4232ebe99c94206168318a9b47 Author: Peter Chou <pbc...@labs.att.com> Date: 2016-08-09T23:52:47Z TS-4475: Fix un-handled event crash in LogCollationClientSM.cc I modified the comment and used the post-assignment pointer (instead of the pre-assignment pointer) to implement an explicit inactivity time-out value for Log Collation client net-vcs so that they do not use the default defined in the config. (cherry picked from commit cb2fa3c5159f17dc5a8e75bfaa4e3cacd8251071) commit a23a620c53c50e1f15e10cbdc3057877697a59ad Author: Peter Chou <pbc...@labs.att.com> Date: 2016-08-22T19:56:18Z TS-4475: Fix un-handled time-out event crash in Log Collation 1. For collation hosts and clients, handle VC_EVENT_ACTIVE_TIMEOUT and VC_EVENT_INACTIVITY_TIMEOUT events and treat them the same as other error conditions (EOS and ERROR) including performing any clean-up operations. 2. Added two configuration options for log collation VC time-outs -- - proxy.config.log.collation_host_timeout INT 86390 - proxy.config.log.collation_client_timeout INT 86400 These options control how long before the host and client ends of the log collation net-vc will time-out after being inactive. The host end defaults to 10 fewer seconds to avoid any race condition. If the host disconnects first, the client will re-establish with a fresh connection. (cherry picked from commit b6ed91805d9769394131b577045fb6331a25f5c8) commit fb4ad2cdc92f6023b7cf61b8836a3535f3e41a0b Author: Peter Chou <pbc...@labs.att.com> Date: 2016-09-01T19:06:24Z TS-4475: Fix un-handled time-out event crash in Log Collation Updated the documenation to explain the difference in host/client default time-outs. (cherry picked from commit 4c03b08851eb55ee92537029fbc99cf760542ed6) ---- Issue Time Tracking ------------------- Worklog Id: (was: 29198) Time Spent: 3h 10m (was: 3h) > Crash in Log-Collation client after using inactivity-cop. > --------------------------------------------------------- > > Key: TS-4475 > URL: https://issues.apache.org/jira/browse/TS-4475 > Project: Traffic Server > Issue Type: Bug > Components: Logging > Affects Versions: 6.1.1 > Reporter: Peter Chou > Fix For: sometime > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Background: We recently tried making use of inactivity-cop by setting it to > 300s instead of the default one-day setting. This was to address an issue > where, under heavy load, ATS would become un-responsive to client requests, > and the condition would persist after traffic was stopped with the active > queue saying 0 connections but 'netstat -na' showing a bunch of established > connections (up to the throttle limit approximately). > Inactivity cop seemed to help ATS handle this situation, but we have since > experienced a couple of core dumps over the last four day period. It seems > occasionally the Log Collation Client State Machine will have event value 105 > or VC_EVENT_INACTIVITY_TIMEOUT, but when it reaches read_signal_and_update() > it tries to call the continuation handler which down the line does not know > about this event thus causing core dump !"unexpcted state" [sic]. > Here is the back-trace -- > (gdb) bt > #0 0x00002b67cd5405f7 in raise () from /lib64/libc.so.6 > #1 0x00002b67cd541e28 in abort () from /lib64/libc.so.6 > #2 0x00002b67cb032921 in ink_die_die_die () at ink_error.cc:43 > #3 0x00002b67cb0329da in ink_fatal_va (fmt=0x2b67cb0442dc "%s:%d: failed > assert `%s`", ap=0x7ffc690e7ba8) at ink_error.cc:65 > #4 0x00002b67cb032a79 in ink_fatal (message_format=0x2b67cb0442dc "%s:%d: > failed assert `%s`") at ink_error.cc:73 > #5 0x00002b67cb0305a6 in _ink_assert (expression=0x7fb422 "!\"unexpcted > state\"", file=0x7fb35b "LogCollationClientSM.cc", > line=445) at ink_assert.cc:37 > #6 0x000000000069c86b in LogCollationClientSM::client_idle > (this=0x2b681400bb00, event=105) at LogCollationClientSM.cc:445 > #7 0x000000000069b427 in LogCollationClientSM::client_handler > (this=0x2b681400bb00, event=105, data=0x2b680c017020) > at LogCollationClientSM.cc:119 > #8 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b681400bb00, > event=105, data=0x2b680c017020) > at ../iocore/eventsystem/I_Continuation.h:153 > #9 0x0000000000783d40 in read_signal_and_update (event=105, > vc=0x2b680c016f00) at UnixNetVConnection.cc:150 > #10 0x0000000000787a22 in UnixNetVConnection::mainEvent (this=0x2b680c016f00, > event=1, e=0x127ad60) at UnixNetVConnection.cc:1188 > #11 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b680c016f00, > event=1, data=0x127ad60) > at ../iocore/eventsystem/I_Continuation.h:153 > #12 0x000000000077d943 in InactivityCop::check_inactivity (this=0x1209a00, > event=2, e=0x127ad60) at UnixNet.cc:102 > #13 0x0000000000502cc6 in Continuation::handleEvent (this=0x1209a00, event=2, > data=0x127ad60) > at ../iocore/eventsystem/I_Continuation.h:153 > #14 0x00000000007a5df6 in EThread::process_event (this=0x2b67cf7bb010, > e=0x127ad60, calling_code=2) at UnixEThread.cc:128 > #15 0x00000000007a61f5 in EThread::execute (this=0x2b67cf7bb010) at > UnixEThread.cc:207 > #16 0x0000000000534430 in main (argv=0x7ffc690e82e8) at Main.cc:1918 > I believe it takes a wrong turn here -- > #9 0x0000000000783d40 in read_signal_and_update (event=105, > vc=0x2b680c016f00) at UnixNetVConnection.cc:150 > 150 vc->read.vio._cont->handleEvent(event, &vc->read.vio); > (gdb) list > 145 static inline int > 146 read_signal_and_update(int event, UnixNetVConnection *vc) > 147 { > 148 vc->recursion++; > 149 if (vc->read.vio._cont) { > 150 vc->read.vio._cont->handleEvent(event, &vc->read.vio); > 151 } else { > 152 switch (event) { > 153 case VC_EVENT_EOS: > 154 case VC_EVENT_ERROR: > (gdb) list > 155 case VC_EVENT_ACTIVE_TIMEOUT: > 156 case VC_EVENT_INACTIVITY_TIMEOUT: > 157 Debug("inactivity_cop", "event %d: null read.vio cont, closing > vc %p", event, vc); > 158 vc->closed = 1; > 159 break; > 160 default: > 161 Error("Unexpected event %d for vc %p", event, vc); > 162 ink_release_assert(0); > 163 break; > 164 } > Note: I understand that there were several issues related to TS-3196 > concerning inactivity_cop and this section of code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)