[ https://issues.apache.org/jira/browse/TS-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278549#comment-15278549 ]
Thomas Jackson commented on TS-3959: ------------------------------------ The check for "is a request retryable" has been changed to what [~amc] mentioned in the ticket (not retryable if bytes were sent to the origin). So, I would expect that this issue should be fixed by https://issues.apache.org/jira/browse/TS-4328, [~nickm] would you be able to apply the patch and see if the issue is resolved for you? If so, it will be fixed in 6.2. > Dropped keep-alive connections not being re-established > ------------------------------------------------------- > > Key: TS-3959 > URL: https://issues.apache.org/jira/browse/TS-3959 > Project: Traffic Server > Issue Type: Bug > Affects Versions: 6.0.0 > Reporter: Nick Muerdter > Assignee: Alan M. Carroll > Priority: Blocker > Labels: regression > Fix For: 7.0.0 > > > I've observed some differences in how TrafficServer 6.0.0 behaves with > connection retrying and outgoing keep-alive connections. I believe the > changes in behavior might be related to this issue: > https://issues.apache.org/jira/browse/TS-3440 > I originally wasn't sure if this was a bug, but James Peach indicated it > sounded more like a regression on the mailing list > (http://mail-archives.apache.org/mod_mbox/trafficserver-users/201510.mbox/%3cba85d5a2-8b29-44a9-acdc-e7fa8d21f...@apache.org%3e). > What I'm seeing in 6.0.0 is that if TrafficServer has some backend keep-alive > connections already opened, but then one of the keep-alive connections is > closed, the next request to TrafficServer may generate a 502 Server Hangup > response when attempting to reuse that connection. Previously, I think > TrafficServer was retrying when it encountered a closed keep-alive > connection, but that is no longer the case. So if you have a backend that > might unexpectedly close its open keep-alive connections, the only way I've > found to completely prevent these 502 errors in 6.0.0 is to disable outgoing > keepalive (proxy.config.http.keep_alive_enabled_out and > proxy.config.http.keep_alive_post_out settings). > For a slightly more concrete example of what can trigger this, this is fairly > easy to reproduce with the following setup: > - TrafficServer is proxying to nginx with outgoing keep-alive connections > enabled (the default). > - Throw a constant stream of requests at TrafficServer. > - While that constant stream of requests is happening, also send a regular > stream of SIGHUP commands to nginx to reload nginx. > - Eventually you'll get some 502 Server Hangup responses from TrafficServer > among your stream of requests. > SIGHUPs in nginx should result in zero downtime for new requests, but I think > what's happening is that TrafficServer may fail when an old keep-alived > connection is reused (it's not common, so it depends on the timing of things > and if the connection is from an old nginx worker that has since been shut > down). In TrafficServer 5.3.1 these connection failures were retried, but in > 6.0.0, no retries occur in this case. > Here's some debug logs that show the difference in behavior between 6.0.0 and > 5.3.1. Note that differences seem to stem from how each version eventually > handles the "VC_EVENT_EOS" event following > "&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE". > 5.3.1: > https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316 > 6.0.0: > https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314 > Interestingly, if I'm understand the log files correctly, it looks like > TraffficServer is reporting an odd empty response from these connections > ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I can > tell from TCP dumps on the system, nginx is not actually sending any form of > response. > In these example cases the backend server isn't sending back any data (at > least as far as I can tell), so from what I understand (and the logic > outlined in https://issues.apache.org/jira/browse/TS-3440), it should be safe > to retry. > Let me know if I can provide any other details. Or if exact scripts to > reproduce the issues against the example nginx backend I described above > would be useful, I could get that together. -- This message was sent by Atlassian JIRA (v6.3.4#6332)