[ https://issues.apache.org/jira/browse/TS-4372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Susan Hinrichs updated TS-4372: ------------------------------- Attachment: ts-4372-example.pcap ts-4372-example.pcap contains the packets exchanged between traffic_server, traffic_cop, and traffic_manager during the heartbeat checks. Look for tcp.flags.reset == 1 to find the failure cases. > Traffic server heart beat fails with 6.1 > ---------------------------------------- > > Key: TS-4372 > URL: https://issues.apache.org/jira/browse/TS-4372 > Project: Traffic Server > Issue Type: Bug > Components: Cop, Manager > Reporter: Susan Hinrichs > Assignee: Susan Hinrichs > Attachments: ts-4372-example.pcap > > > When running 6.1 in a loaded production environment, traffic server will run > for a while (30 minutes or so), then server heart beats will start failing > intermittently. Eventually two will fail in a row causing the traffic_cop to > restart traffic_server (or traffic_manager and then traffic_server I'm still > a bit unclear there). > {code} > traffic_cop[18078]: (test) read failed [104 'Connection reset by peer'] > {code} > There are no particular resource limitations on the production machine in > this state. The number of open sockets is around 50-60K which is consistent > with its 5.3.x peer. The memory usage is no where near the limit. The CPU > usage is high, but again, not near the limit (perhaps half the entire machine > usage). > If we look at the packets exchanged on the loopback interface during this > heartbeat failing interval, we see some interesting things. I'll attach an > example pcap file. The interesting traffic is on port 8084 and 8083. > Traffic_cop sends a GET http://127.0.0.1:8083/synthetic.txt request to > traffic_server over port 8084. Traffic server should proxy the request and > send the request GET /synthetic.txt to traffic_manager listing on port 8083. > Traffic manager returns a 200 response with some data. Traffic_server relays > that response to traffic_cop. > However, in the failure cases, traffic_cop sends the request and > traffic_manager sends a RESET after the connection has been established and > the request has been sent to it. I'm guessing that there is logic in > traffic_server that closes the socket before reading the get request causing > the reset to be sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)