[ 
https://issues.apache.org/jira/browse/TS-4475?focusedWorklogId=29198&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-29198
 ]

ASF GitHub Bot logged work on TS-4475:
--------------------------------------

                Author: ASF GitHub Bot
            Created on: 15/Sep/16 20:22
            Start Date: 15/Sep/16 20:22
    Worklog Time Spent: 10m 
      Work Description: GitHub user pbchou opened a pull request:

    https://github.com/apache/trafficserver/pull/1026

    TS-4475 Fix Log Collation Crash (back-port).

    @shinrich @oknet -- Susan, this PR is just a back-port to 6.2.x of the Log 
Collation crash fix. I believe Oknet already +1 the back-port as part of the 
previous PR #831 . I think this should be back-ported as this is a crash bug 
and 6.2.x is a LTS release.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/pbchou/trafficserver TS-4475-backport-6.2.x

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/trafficserver/pull/1026.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1026
    
----
commit 1cc28eb9ca06ae6834a779ee889f8b9c3ab8c2b9
Author: Peter Chou <pbc...@labs.att.com>
Date:   2016-06-24T23:21:05Z

    TS-4475: Modified the event handlers in LogCollationClientSM.cc to
    handle and ignore the VC_EVENT_INACTIVITY_TIMEOUT event. Previously,
    this event was un-handled and would cause a core dump.
    
    (cherry picked from commit 0aefbe5349432e44031401ca6a60535a8516adcb)

commit c4344440715e72fc87fa5c3ad4b788317f013e80
Author: Peter Chou <pbc...@labs.att.com>
Date:   2016-08-03T19:20:45Z

    TS-4475 : Handle VC_EVENT_INACTIVITY_TIMEOUT in LogCollationClientSM.cc
    Based on suggestions from Oknet Xu, we now close the client connection
    on the timeout event instead of ignoring it. Also, we set the
    net_vc associated with the client connection to have a timeout of
    one day or 86400 seconds instead of the default value.
    
    (cherry picked from commit efd3582b92d825649216705521dbe92784a7279b)

commit 569407f970489f3d16b835cc7f990d902e946300
Author: Peter Chou <pbc...@labs.att.com>
Date:   2016-08-05T23:35:09Z

    TS-4475 : Handle VC_EVENT_INACTIVITY_TIMEOUT in LogCollationClientSM.cc
    Based on suggestions from Susan Hinrich, we now also handle the
    VC_EVENT_ACTIVE_TIMEOUT event. In addition, we also perform
    the same clean-up activities performed for the EOS and ERROR events
    in addition to closing the client connection.
    
    (cherry picked from commit 483412986c5875ed2451b72328e8548dfc7ef207)

commit 6f3795631011da4232ebe99c94206168318a9b47
Author: Peter Chou <pbc...@labs.att.com>
Date:   2016-08-09T23:52:47Z

    TS-4475: Fix un-handled event crash in LogCollationClientSM.cc
    
    I modified the comment and used the post-assignment pointer
    (instead of the pre-assignment pointer) to implement an explicit
    inactivity time-out value for Log Collation client net-vcs so
    that they do not use the default defined in the config.
    
    (cherry picked from commit cb2fa3c5159f17dc5a8e75bfaa4e3cacd8251071)

commit a23a620c53c50e1f15e10cbdc3057877697a59ad
Author: Peter Chou <pbc...@labs.att.com>
Date:   2016-08-22T19:56:18Z

    TS-4475: Fix un-handled time-out event crash in Log Collation
    
    1. For collation hosts and clients, handle VC_EVENT_ACTIVE_TIMEOUT
       and VC_EVENT_INACTIVITY_TIMEOUT events and treat them the same
       as other error conditions (EOS and ERROR) including performing
       any clean-up operations.
    
    2. Added two configuration options for log collation VC time-outs --
      - proxy.config.log.collation_host_timeout INT 86390
      - proxy.config.log.collation_client_timeout INT 86400
    
      These options control how long before the host and client ends of
      the log collation net-vc will time-out after being inactive. The
      host end defaults to 10 fewer seconds to avoid any race condition.
      If the host disconnects first, the client will re-establish with
      a fresh connection.
    
    (cherry picked from commit b6ed91805d9769394131b577045fb6331a25f5c8)

commit fb4ad2cdc92f6023b7cf61b8836a3535f3e41a0b
Author: Peter Chou <pbc...@labs.att.com>
Date:   2016-09-01T19:06:24Z

    TS-4475: Fix un-handled time-out event crash in Log Collation
    
    Updated the documenation to explain the difference in host/client
    default time-outs.
    
    (cherry picked from commit 4c03b08851eb55ee92537029fbc99cf760542ed6)

----


Issue Time Tracking
-------------------

    Worklog Id:     (was: 29198)
    Time Spent: 3h 10m  (was: 3h)

> Crash in Log-Collation client after using inactivity-cop.
> ---------------------------------------------------------
>
>                 Key: TS-4475
>                 URL: https://issues.apache.org/jira/browse/TS-4475
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Logging
>    Affects Versions: 6.1.1
>            Reporter: Peter Chou
>             Fix For: sometime
>
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Background: We recently tried making use of inactivity-cop by setting it to 
> 300s instead of the default one-day setting. This was to address an issue 
> where, under heavy load, ATS would become un-responsive to client requests, 
> and the condition would persist after traffic was stopped with the active 
> queue saying 0 connections but 'netstat -na' showing a bunch of established 
> connections (up to the throttle limit approximately).
> Inactivity cop seemed to help ATS handle this situation, but we have since 
> experienced a couple of core dumps over the last four day period. It seems 
> occasionally the Log Collation Client State Machine will have event value 105 
> or VC_EVENT_INACTIVITY_TIMEOUT, but when it reaches read_signal_and_update() 
> it tries to call the continuation handler which down the line does not know 
> about this event thus causing core dump !"unexpcted state" [sic].
> Here is the back-trace --
> (gdb) bt
> #0  0x00002b67cd5405f7 in raise () from /lib64/libc.so.6
> #1  0x00002b67cd541e28 in abort () from /lib64/libc.so.6
> #2  0x00002b67cb032921 in ink_die_die_die () at ink_error.cc:43
> #3  0x00002b67cb0329da in ink_fatal_va (fmt=0x2b67cb0442dc "%s:%d: failed 
> assert `%s`", ap=0x7ffc690e7ba8) at ink_error.cc:65
> #4  0x00002b67cb032a79 in ink_fatal (message_format=0x2b67cb0442dc "%s:%d: 
> failed assert `%s`") at ink_error.cc:73
> #5  0x00002b67cb0305a6 in _ink_assert (expression=0x7fb422 "!\"unexpcted 
> state\"", file=0x7fb35b "LogCollationClientSM.cc",
>     line=445) at ink_assert.cc:37
> #6  0x000000000069c86b in LogCollationClientSM::client_idle 
> (this=0x2b681400bb00, event=105) at LogCollationClientSM.cc:445
> #7  0x000000000069b427 in LogCollationClientSM::client_handler 
> (this=0x2b681400bb00, event=105, data=0x2b680c017020)
>     at LogCollationClientSM.cc:119
> #8  0x0000000000502cc6 in Continuation::handleEvent (this=0x2b681400bb00, 
> event=105, data=0x2b680c017020)
>     at ../iocore/eventsystem/I_Continuation.h:153
> #9  0x0000000000783d40 in read_signal_and_update (event=105, 
> vc=0x2b680c016f00) at UnixNetVConnection.cc:150
> #10 0x0000000000787a22 in UnixNetVConnection::mainEvent (this=0x2b680c016f00, 
> event=1, e=0x127ad60) at UnixNetVConnection.cc:1188
> #11 0x0000000000502cc6 in Continuation::handleEvent (this=0x2b680c016f00, 
> event=1, data=0x127ad60)
>     at ../iocore/eventsystem/I_Continuation.h:153
> #12 0x000000000077d943 in InactivityCop::check_inactivity (this=0x1209a00, 
> event=2, e=0x127ad60) at UnixNet.cc:102
> #13 0x0000000000502cc6 in Continuation::handleEvent (this=0x1209a00, event=2, 
> data=0x127ad60)
>     at ../iocore/eventsystem/I_Continuation.h:153
> #14 0x00000000007a5df6 in EThread::process_event (this=0x2b67cf7bb010, 
> e=0x127ad60, calling_code=2) at UnixEThread.cc:128
> #15 0x00000000007a61f5 in EThread::execute (this=0x2b67cf7bb010) at 
> UnixEThread.cc:207
> #16 0x0000000000534430 in main (argv=0x7ffc690e82e8) at Main.cc:1918
> I believe it takes a wrong turn here --
> #9  0x0000000000783d40 in read_signal_and_update (event=105, 
> vc=0x2b680c016f00) at UnixNetVConnection.cc:150
> 150         vc->read.vio._cont->handleEvent(event, &vc->read.vio);
> (gdb) list
> 145     static inline int
> 146     read_signal_and_update(int event, UnixNetVConnection *vc)
> 147     {
> 148       vc->recursion++;
> 149       if (vc->read.vio._cont) {
> 150         vc->read.vio._cont->handleEvent(event, &vc->read.vio);
> 151       } else {
> 152         switch (event) {
> 153         case VC_EVENT_EOS:
> 154         case VC_EVENT_ERROR:
> (gdb) list
> 155         case VC_EVENT_ACTIVE_TIMEOUT:
> 156         case VC_EVENT_INACTIVITY_TIMEOUT:
> 157           Debug("inactivity_cop", "event %d: null read.vio cont, closing 
> vc %p", event, vc);
> 158           vc->closed = 1;
> 159           break;
> 160         default:
> 161           Error("Unexpected event %d for vc %p", event, vc);
> 162           ink_release_assert(0);
> 163           break;
> 164         }
> Note: I understand that there were several issues related to TS-3196 
> concerning inactivity_cop and this section of code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to