[ 
https://issues.apache.org/jira/browse/TS-4372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254096#comment-15254096
 ] 

Susan Hinrichs commented on TS-4372:
------------------------------------

Tracked down the problem to the active_queue getting corrupted by unintentional 
multi threaded activity.  I put in asserts that the vc->thread was the same as 
the current thread as the active_queue manipulation methods were being called.  
The assert went off very quickly from a Http2 call via FetchSM.

I then moved to 6.2.x which includes ts-3612 to eliminate FetchSM, and the 
assert no longer goes off and I don't see the heartbeat failures.  However, on 
that build the number of sockets grows.  I assume that we are missing 
inactivity timeouts for client side connections.  And some clients don't 
initiate the close for a very long time.  I ran with Http2 disabled (and SPDY 
not built), so the leak occurs from Http1.x only traffic as well.

I must move onto other things today, so I'm going to reinstall the 6.1 build 
and disable Http2 and spdy to verify that they were the cause of the 
multithreading.

[~bcall] if you have some spare cycles, could you review the 
keep_alive_queue/active_queue logic on 6.2.x?  Perhaps I messed things up with 
the ts-3612 integration.

> Traffic server heart beat fails with 6.1
> ----------------------------------------
>
>                 Key: TS-4372
>                 URL: https://issues.apache.org/jira/browse/TS-4372
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Cop, Manager
>            Reporter: Susan Hinrichs
>            Assignee: Susan Hinrichs
>         Attachments: ts-4372-example.pcap
>
>
> When running 6.1 in a loaded production environment, traffic server will run 
> for a while (30 minutes or so), then server heart beats will start failing 
> intermittently.  Eventually two will fail in a row causing the traffic_cop to 
> restart traffic_server (or traffic_manager and then traffic_server I'm still 
> a bit unclear there).
> {code}
> traffic_cop[18078]: (test) read failed [104 'Connection reset by peer']
> {code}
> There are no particular resource limitations on the production machine in 
> this state.  The number of open sockets is around 50-60K which is consistent 
> with its 5.3.x peer.  The memory usage is no where near the limit.  The CPU 
> usage is high, but again, not near the limit (perhaps half the entire machine 
> usage).
> If we look at the packets exchanged on the loopback interface during this 
> heartbeat failing interval, we see some interesting things.  I'll attach an 
> example pcap file.   The interesting traffic is on port 8084 and 8083.  
> Traffic_cop sends a GET http://127.0.0.1:8083/synthetic.txt request to 
> traffic_server over port 8084.  Traffic server should proxy the request and 
> send the request GET /synthetic.txt to traffic_manager listing on port 8083.  
> Traffic manager returns a 200 response with some data.  Traffic_server relays 
> that response to traffic_cop.
> However, in the failure cases, traffic_cop sends the request and 
> traffic_manager sends a RESET after the connection has been established and 
> the request has been sent to it.   I'm guessing that there is logic in 
> traffic_server that closes the socket before reading the get request causing 
> the reset to be sent.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to