[ https://issues.apache.org/jira/browse/TS-3871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Susan Hinrichs reassigned TS-3871: ---------------------------------- Assignee: Susan Hinrichs > VC Migration Can Lose Events > ---------------------------- > > Key: TS-3871 > URL: https://issues.apache.org/jira/browse/TS-3871 > Project: Traffic Server > Issue Type: Bug > Components: HTTP > Reporter: Susan Hinrichs > Assignee: Susan Hinrichs > > Found this in my stress testing. Sometimes the POST or GET response is > completely empty. No header and no body. The packet capture shows that ATS > closes the connection 70 seconds after the last POST or GET of the connection > was received. This corresponds to the > proxy.config.http.keep_alive_no_activity_timeout_in on my test box. > I moved from global pool to local pool and the problem went away. > I eventually tracked it down to a problem in the epoll update. ep.start() > during the migration would fail sometimes with EEXIST error. This means that > the file descriptor is already associated with the epoll. If we are > migrating from thread A to thread B this should not be the case. Unless we > when from thread B to thread A and back to thread B without cleaning up the > original thread B epoll. If this is happening, then multiple threads will be > processing network events which seems like a recipe for disaster and dropped > events. > Originally, I left the ep.stop() which clears the epoll on the original > thread's epoll structure to be done by the original thread. But under stress > that seems to be a bad idea. Too much drift. With some more research, it > appears that the epoll calls are thread safe. > http://linux.derkeiler.com/Mailing-Lists/Kernel/2006-03/msg00084.html > I rearranged the code to do both the ep.stop() and ep.start() in the same > migrating target thread, and my stress test had no more problems. > I've run this patch on a production machine for over 12 hours with no crashes > and no performance discrepancies. We will be expanding this testing. > To repeat, this is not a problem we saw in production, but only in my "make > it fall over" stress test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)