[ 
https://issues.apache.org/jira/browse/TS-3871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Susan Hinrichs reassigned TS-3871:
----------------------------------

    Assignee: Susan Hinrichs

> VC Migration Can Lose Events
> ----------------------------
>
>                 Key: TS-3871
>                 URL: https://issues.apache.org/jira/browse/TS-3871
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: HTTP
>            Reporter: Susan Hinrichs
>            Assignee: Susan Hinrichs
>
> Found this in my stress testing.  Sometimes the POST or GET response is 
> completely empty.  No header and no body.  The packet capture shows that ATS 
> closes the connection 70 seconds after the last POST or GET of the connection 
> was received.  This corresponds to the 
> proxy.config.http.keep_alive_no_activity_timeout_in on my test box.
> I moved from global pool to local pool and the problem went away.
> I eventually tracked it down to a problem in the epoll update.  ep.start() 
> during the migration would fail sometimes with EEXIST error.  This means that 
> the file descriptor is already associated with the epoll.  If we are 
> migrating from thread A to thread B this should not be the case.  Unless we 
> when from thread B to thread A and back to thread B without cleaning up the 
> original thread B epoll.  If this is happening, then multiple threads will be 
> processing network events which seems like a recipe for disaster and dropped 
> events.
> Originally, I left the ep.stop() which clears the epoll on the original 
> thread's epoll structure to be done by the original thread.  But under stress 
> that seems to be a bad idea.  Too much drift.  With some more research, it 
> appears that the epoll calls are thread safe.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-03/msg00084.html
> I rearranged the code to do both the ep.stop() and ep.start() in the same 
> migrating target thread, and my stress test had no more problems.
> I've run this patch on a production machine for over 12 hours with no crashes 
> and no performance discrepancies.  We will be expanding this testing.
> To repeat, this is not a problem we saw in production, but only in my "make 
> it fall over" stress test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to