Hi Folks,
In a 2-tier CDN production environment using ATS, we recently had a series of
issues that I want to reach out to the community for help.
The symptoms usually shows up on edge caches as:
1) elevated CPU - noticeable, but not 'through the roof'
2) unusually high ingress bw, often order of magnitudes higher than
egress. i.e. edge caches is pulling a lot of data from mids but not sending
much to end clients. We often saw 3-4Gbps ingress vs. 500Mbps egress. The edge
cache uses separate network interfaces - the ingress traffic is mainly from
upstream interface.
3) huge number of TIME_WAIT sockets on edge cache, 70-80K, all of them
destined to mid. This number stays at that level until we kill/restart ATS
process. It looks like the edge is open/close sockets to mid, as indicated by
the debug log.
4) debug log shows SM 702933367 stuck in the loop repeating the
following, over and over and over. The regular transaction log, however, does
not have an entry for the url this SM is handling. So it feels like the the
stuck loop runs in the ‘background’, GET the same url, without caching or
sending to client.
...
[Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State
Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN
[Sep 22 03:54:23.135] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open
connection to x.x.x.x 172.17.173.103:80
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[HttpSM::main_handler, NET_EVENT_OPEN]
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_http_server_open, NET_EVENT_OPEN]
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling
plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_api_callback, HTTP_API_CONTINUE]
[Sep 22 03:54:23.137] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_api_callout, HTTP_API_CONTINUE]
[Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE]
[Sep 22 03:54:23.187] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE]
[Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[HttpSM::main_handler, VC_EVENT_READ_READY]
[Sep 22 03:54:23.226] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY]
[Sep 22 03:54:23.227] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling
plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60
[Sep 22 03:54:23.231] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_api_callback, HTTP_API_CONTINUE]
[Sep 22 03:54:23.232] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_api_callout, HTTP_API_CONTINUE]
[Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367]
[HttpTransact::HandleResponse]
[Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367]
[HttpTransact::HandleResponse] Response received
[Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367]
[HandleResponse] response_received_time: 1600746863
[Sep 22 03:54:23.233] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367]
[is_response_valid] No errors in response
[Sep 22 03:54:23.234] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367]
[HttpTransact::HandleResponse] Response valid
[Sep 22 03:54:23.235] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367]
[initialize_state_variables_from_response]Server is keep-alive.
[Sep 22 03:54:23.236] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] State
Transition: SM_ACTION_ORIGIN_SERVER_OPEN -> SM_ACTION_ORIGIN_SERVER_OPEN
[Sep 22 03:54:23.240] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] open
connection to x.x.x.x 172.17.173.103:80
[Sep 22 03:54:23.241] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[HttpSM::main_handler, NET_EVENT_OPEN]
[Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_http_server_open, NET_EVENT_OPEN]
[Sep 22 03:54:23.242] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling
plugin on hook TS_HTTP_SEND_REQUEST_HDR_HOOK at hook 0x2b7e844b49e0
[Sep 22 03:54:23.243] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_api_callback, HTTP_API_CONTINUE]
[Sep 22 03:54:23.244] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_api_callout, HTTP_API_CONTINUE]
[Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[HttpSM::main_handler, VC_EVENT_WRITE_COMPLETE]
[Sep 22 03:54:23.281] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE]
[Sep 22 03:54:23.285] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[HttpSM::main_handler, VC_EVENT_READ_READY]
[Sep 22 03:54:23.286] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_read_server_response_header, VC_EVENT_READ_READY]
[Sep 22 03:54:23.288] Server {0x2b7e51d2d700} DEBUG: (http) [702933367] calling
plugin on hook TS_HTTP_READ_RESPONSE_HDR_HOOK at hook 0x2b7e844b4a60
[Sep 22 03:54:23.289] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_api_callback, HTTP_API_CONTINUE]
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http) [702933367]
[&HttpSM::state_api_callout, HTTP_API_CONTINUE]
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367]
[HttpTransact::HandleResponse]
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367]
[HttpTransact::HandleResponse] Response received
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367]
[HandleResponse] response_received_time: 1600746863
[Sep 22 03:54:23.290] Server {0x2b7e51d2d700} DEBUG: (http_trans) [702933367]
[is_response_valid] No errors in response
[Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_seq) [702933367]
[HttpTransact::HandleResponse] Response valid
[Sep 22 03:54:23.292] Server {0x2b7e51d2d700} DEBUG: (http_hdrs) [702933367]
[initialize_state_variables_from_response]Server is keep-alive.
…
5) the transaction log on the mid confirms there were request storms
for the same url from a the edge, but all resulted in crc=ERR_CLIENT_ABORT,
cfsc=INTR.
Our initial investigation confirmed the this traffic storm situation usually
follows a repeated series of parent proxy (mid tier cache) loss and recovery
events - recovery is usually 10 seconds after initial mark down, and the
loss/recover repeats 3 or 4 times in about a minute.
The ATS is 7.1.5 based, process running on bare metal server with Centos 7.7.
The only way to recover today is to restart ATS process. Obviously this is
service impacting.
Was this problem reported before? Any verdict on root cause? Debugging tip or
pointers to resolution?
Thanks,
-Hongfei