[ 
https://issues.apache.org/jira/browse/TS-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-3959.
--------------------------------
    Resolution: Fixed

> Dropped keep-alive connections not being re-established
> -------------------------------------------------------
>
>                 Key: TS-3959
>                 URL: https://issues.apache.org/jira/browse/TS-3959
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core, Network
>    Affects Versions: 6.0.0
>            Reporter: Nick Muerdter
>            Assignee: Thomas Jackson
>            Priority: Blocker
>              Labels: regression
>             Fix For: 7.0.0
>
>         Attachments: trafficserver-closed.png, trafficserver-reset.png
>
>
> I've observed some differences in how TrafficServer 6.0.0 behaves with 
> connection retrying and outgoing keep-alive connections. I believe the 
> changes in behavior might be related to this issue: 
> https://issues.apache.org/jira/browse/TS-3440
> I originally wasn't sure if this was a bug, but James Peach indicated it 
> sounded more like a regression on the mailing list 
> (http://mail-archives.apache.org/mod_mbox/trafficserver-users/201510.mbox/%3cba85d5a2-8b29-44a9-acdc-e7fa8d21f...@apache.org%3e).
> What I'm seeing in 6.0.0 is that if TrafficServer has some backend keep-alive 
> connections already opened, but then one of the keep-alive connections is 
> closed, the next request to TrafficServer may generate a 502 Server Hangup 
> response when attempting to reuse that connection. Previously, I think 
> TrafficServer was retrying when it encountered a closed keep-alive 
> connection, but that is no longer the case. So if you have a backend that 
> might unexpectedly close its open keep-alive connections, the only way I've 
> found to completely prevent these 502 errors in 6.0.0 is to disable outgoing 
> keepalive (proxy.config.http.keep_alive_enabled_out and 
> proxy.config.http.keep_alive_post_out settings).
> For a slightly more concrete example of what can trigger this, this is fairly 
> easy to reproduce with the following setup:
> - TrafficServer is proxying to nginx with outgoing keep-alive connections 
> enabled (the default).
> - Throw a constant stream of requests at TrafficServer.
> - While that constant stream of requests is happening, also send a regular 
> stream of SIGHUP commands to nginx to reload nginx.
> - Eventually you'll get some 502 Server Hangup responses from TrafficServer 
> among your stream of requests.
> SIGHUPs in nginx should result in zero downtime for new requests, but I think 
> what's happening is that TrafficServer may fail when an old keep-alived 
> connection is reused (it's not common, so it depends on the timing of things 
> and if the connection is from an old nginx worker that has since been shut 
> down). In TrafficServer 5.3.1 these connection failures were retried, but in 
> 6.0.0, no retries occur in this case.
> Here's some debug logs that show the difference in behavior between 6.0.0 and 
> 5.3.1. Note that differences seem to stem from how each version eventually 
> handles the "VC_EVENT_EOS" event following 
> "&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".
> 5.3.1: 
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316
> 6.0.0: 
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314
> Interestingly, if I'm understand the log files correctly, it looks like 
> TraffficServer is reporting an odd empty response from these connections 
> ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I can 
> tell from TCP dumps on the system, nginx is not actually sending any form of 
> response.
> In these example cases the backend server isn't sending back any data (at 
> least as far as I can tell), so from what I understand (and the logic 
> outlined in https://issues.apache.org/jira/browse/TS-3440), it should be safe 
> to retry.
> Let me know if I can provide any other details. Or if exact scripts to 
> reproduce the issues against the example nginx backend I described above 
> would be useful, I could get that together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to