Hi,

In reverse order as it’s easier for me to assemble the answers:

3. It’s definitely not the disk IO on the chassis, as I’ve eliminated that by 
using a 8GB file located on a ramdisk as the cache file. The chassis has 24 
Intel SSDs in it, and has more than enough capacity to perform as required. I 
have a number of these systems, and the IO does not stutter on any of them when 
doing either write or read with other applications. Presenting large files via 
Apache HTTPD straight off the disk arrays, I’m able to get a stable 
long-duration 36 Gbps with a single HTTP GET.

I’ve also been able to confirm this on different systems, with different 
chipsets and both spinning disk and SSD drive configurations. The behaviour 
also occurs against different backends, on different networks, with different 
Linux distributions (RHEL 5, RHEL 6, and Ubuntu) and different versions of ATS 
using every filesystem format I’ve tried. This is why it makes me suspect 
something else, such as my configuration, rather than hardware.

2. I can confirm it’s not the network beyond the servers or end points. There 
are no firewalls or middle-boxes. It may be the TCP tuning on the server, but 
I’m not sure why it would only affect ATS, and not Apache HTTPD, netcat or even 
iperf.

I have managed to capture a stalled request and the traffic both sides of the 
ATS server (of note, because it’s in cache and not invalidated, there’s no HTTP 
traffic to the backend). It’s only 34kB, but I’ll provide it on request rather 
than emailing it to the list. Please let me know if it’s wanted.

In summary, what I see in the connection is after usual TCP style timeouts, 
repeated retransmissions of the request, with duplicate ACKs, until eventually 
the client side gives up and sends a FIN, in this case after 900 seconds.

Successful requests look completely normal at a TCP and HTTP level.

1. I enabled this, and checked the {http} function. Just by luck the very first 
request stalled. I’ve filtered the backend address as these are public 
archives. The output read:


•  Details for SM id 5013333
•  Current State: SERVE_FROM_CACHE
Client Request
GET 
http://aaa.bbb.ccc.ddd:80/pub/ubuntu/archive/dists/saucy/main/binary-i386/Packages.bz2
 HTTP/1.0
User-Agent: Wget/1.11.4 Red Hat modified
Accept: */*
Host: bne-b-ccs1.cdn.aarnet.edu.au
Connection: Keep-Alive

Client Response
HTTP/1.0 200 OK
Date: Tue, 07 Jan 2014 03:10:59 GMT
Server: ATS/4.1.2
Last-Modified: Thu, 17 Oct 2013 12:37:14 GMT
ETag: "16e217c7-12e30e-4e8ef13cef280"
Accept-Ranges: bytes
Content-Length: 1237774
Content-Type: application/x-bzip2
Age: 73027
Connection: keep-alive

Tunneling Info
Producers
cache read

1

1048504

1237774

Consumers
user agent

1

101360

1238052

947422

History
HttpSM.cc:518

0

1

HttpSM.cc:678

100

1

HttpSM.cc:1309

60000

2

HttpSM.cc:1350

60000

2

HttpSM.cc:6919

65535

2

HttpSM.cc:4308

35728

2

HttpCacheSM.cc:108

1102

-1

HttpSM.cc:2447

1102

1

HttpSM.cc:5685

65535

1




Of interest, in the time it took for the request to time out client side, the 
cache read never incremented past 1048504. To once again eliminate the SSD on 
the system, I setup another 8GB cache file on a tmpfs/ramdisk mount (the server 
has 256GB of RAM), and I get the same behaviour.

I repeated the test a few times, especially to grab a full TCP dump of a 
stalled session. I noticed that if I didn’t restart ATS between tests the 
http_ui module was reporting multiple stalled sessions, all in the state 
SERVE_FROM_CACHE with the same cache read values. It didn’t matter if I used a 
ramdisk cache or the SSD cache. In all cases, the http_ui reported identical 
numbers in the procuers and consumers section.

From: Yongming Zhao [mailto:[email protected]]
Sent: Tuesday, 7 January 2014 5:28 PM
To: [email protected]
Subject: Re: Stalling ATS 4.1.2 and Ubuntu archives

David:
 can you help us get more detail on why it is stalled? here is some ideas:
1, on server side, ATS have a http_ui cache inspector which have a http module, 
can outline all the working http transactions, if some connections is stalled, 
you can find out it’s detail.
  https://cwiki.apache.org/confluence/display/TS/FAQ#FAQ-http_ui  is the FAQ on 
how to enable http_ui, please use the {http} functions.

2, this is more likely a network related issue, can you help us dump some of 
the connections in tcpdump|wireshark, and outline some detailed message? I 
think we are interest in both client and server side too.

3, please help us check the IO stats on disks and others, while you checking 
the problem, we need to identify which part we need to review in ATS. at least 
we can confirm that is not bothering us.

This is a very interest problem for us, please help us invest it and we will 
try to reproduce it. thanks


Reply via email to