Hi,
In reverse order as it’s easier for me to assemble the answers:
3. It’s definitely not the disk IO on the chassis, as I’ve eliminated that by
using a 8GB file located on a ramdisk as the cache file. The chassis has 24
Intel SSDs in it, and has more than enough capacity to perform as required. I
have a number of these systems, and the IO does not stutter on any of them when
doing either write or read with other applications. Presenting large files via
Apache HTTPD straight off the disk arrays, I’m able to get a stable
long-duration 36 Gbps with a single HTTP GET.
I’ve also been able to confirm this on different systems, with different
chipsets and both spinning disk and SSD drive configurations. The behaviour
also occurs against different backends, on different networks, with different
Linux distributions (RHEL 5, RHEL 6, and Ubuntu) and different versions of ATS
using every filesystem format I’ve tried. This is why it makes me suspect
something else, such as my configuration, rather than hardware.
2. I can confirm it’s not the network beyond the servers or end points. There
are no firewalls or middle-boxes. It may be the TCP tuning on the server, but
I’m not sure why it would only affect ATS, and not Apache HTTPD, netcat or even
iperf.
I have managed to capture a stalled request and the traffic both sides of the
ATS server (of note, because it’s in cache and not invalidated, there’s no HTTP
traffic to the backend). It’s only 34kB, but I’ll provide it on request rather
than emailing it to the list. Please let me know if it’s wanted.
In summary, what I see in the connection is after usual TCP style timeouts,
repeated retransmissions of the request, with duplicate ACKs, until eventually
the client side gives up and sends a FIN, in this case after 900 seconds.
Successful requests look completely normal at a TCP and HTTP level.
1. I enabled this, and checked the {http} function. Just by luck the very first
request stalled. I’ve filtered the backend address as these are public
archives. The output read:
• Details for SM id 5013333
• Current State: SERVE_FROM_CACHE
Client Request
GET
http://aaa.bbb.ccc.ddd:80/pub/ubuntu/archive/dists/saucy/main/binary-i386/Packages.bz2
HTTP/1.0
User-Agent: Wget/1.11.4 Red Hat modified
Accept: */*
Host: bne-b-ccs1.cdn.aarnet.edu.au
Connection: Keep-Alive
Client Response
HTTP/1.0 200 OK
Date: Tue, 07 Jan 2014 03:10:59 GMT
Server: ATS/4.1.2
Last-Modified: Thu, 17 Oct 2013 12:37:14 GMT
ETag: "16e217c7-12e30e-4e8ef13cef280"
Accept-Ranges: bytes
Content-Length: 1237774
Content-Type: application/x-bzip2
Age: 73027
Connection: keep-alive
Tunneling Info
Producers
cache read
1
1048504
1237774
Consumers
user agent
1
101360
1238052
947422
History
HttpSM.cc:518
0
1
HttpSM.cc:678
100
1
HttpSM.cc:1309
60000
2
HttpSM.cc:1350
60000
2
HttpSM.cc:6919
65535
2
HttpSM.cc:4308
35728
2
HttpCacheSM.cc:108
1102
-1
HttpSM.cc:2447
1102
1
HttpSM.cc:5685
65535
1
Of interest, in the time it took for the request to time out client side, the
cache read never incremented past 1048504. To once again eliminate the SSD on
the system, I setup another 8GB cache file on a tmpfs/ramdisk mount (the server
has 256GB of RAM), and I get the same behaviour.
I repeated the test a few times, especially to grab a full TCP dump of a
stalled session. I noticed that if I didn’t restart ATS between tests the
http_ui module was reporting multiple stalled sessions, all in the state
SERVE_FROM_CACHE with the same cache read values. It didn’t matter if I used a
ramdisk cache or the SSD cache. In all cases, the http_ui reported identical
numbers in the procuers and consumers section.
From: Yongming Zhao [mailto:[email protected]]
Sent: Tuesday, 7 January 2014 5:28 PM
To: [email protected]
Subject: Re: Stalling ATS 4.1.2 and Ubuntu archives
David:
can you help us get more detail on why it is stalled? here is some ideas:
1, on server side, ATS have a http_ui cache inspector which have a http module,
can outline all the working http transactions, if some connections is stalled,
you can find out it’s detail.
https://cwiki.apache.org/confluence/display/TS/FAQ#FAQ-http_ui is the FAQ on
how to enable http_ui, please use the {http} functions.
2, this is more likely a network related issue, can you help us dump some of
the connections in tcpdump|wireshark, and outline some detailed message? I
think we are interest in both client and server side too.
3, please help us check the IO stats on disks and others, while you checking
the problem, we need to identify which part we need to review in ATS. at least
we can confirm that is not bothering us.
This is a very interest problem for us, please help us invest it and we will
try to reproduce it. thanks