Re: Questions about OSD recovery

2012-02-09 Thread Tommi Virtanen
On Wed, Feb 8, 2012 at 19:14, Josh Durgin josh.dur...@dreamhost.com wrote:
 It's possible to do what the current repair code does
 automatically, but this would be a bad idea since it just takes
 the first osd (with primary before replicas) to have the object
 as authoritative, and copies it to all the relevant osds. If the
 primary has a corrupt copy, this corruption will spread to other
 osds. In your case, since you removed the object entirely, repair
 could correct it.

At the risk of saying the obvious.. If you have =3 copies, you could
hash them all, and let the majority decide which is the good copy.

An admin could do this manually, just deleting the bad one and letting
scrub repair it, and later on we might be able to automate it.

I'm not sure if Dynamo's/Cassandra's anti-entropy feature does this,
or if it's a simple master overwrites slaves, and I realize the
multi-party communication is sort of hard to coordinate, but it's
definitely possible. I loves me some Merkle trees.

Of course, there might be cases where e.g. all 3 replicas have
different content.

In many ways, getting a hash stored alongside is object is
significantly better, and might be a better route to go -- our objects
are big enough, as opposed to typical Dynamo/Cassandra cells that are
often smaller than a sha1.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: can't read/write after inserting new crushmap

2012-02-09 Thread Josh Durgin

On 02/08/2012 01:58 AM, huang jun wrote:

hi,all
 we test with 8 OSDs and group into 2 racks, each has 4 OSDs.
 write the crushmap file and export it into ceph cluster.(crushmap
file attached)
 all PGs distribute allow the crush rule.

then one group(4OSDs) powered off  and ceph -w shows:
  2012-02-08 17:03:57.518285pg v633: 1584 pgs: 1092 active,
490 active+clean+degraded, 2 degraded+peering; 3992 MB data, 6665 MB
used, 4642 GB / 4657 GB  avail; 349/2040 degraded (17.108%)
  2012-02-08 17:03:57.520698   mds e4: 1/1/1 up {0=0=up:active}
  2012-02-08 17:03:57.520729   osd e86: 8 osds: 4 up, 4 in
  2012-02-08 17:03:57.521199   log 2012-02-08 15:26:21.761073
mon0 192.168.0.116:6789/0 27 : [INF] osd7 out (down for 304.299392)
  2012-02-08 17:03:57.521249   mon e1: 1 mons at 
{0=192.168.0.116:6789/0}
2 PGs' state is degraded+peering, and it seems never goto normal
active+clean  state.(we use ceph v0.35,maybe it matters)

  check the pg dump output:
 2.1p3   0   0   0   0   0   0   0
  0   active  0'0 4'120   [3] [3,0]   0'0
 2.0p2   0   0   0   0   0   0   0
  0   active  0'0 3'119   [2] [2,0]   0'0
 0.1p1   0   0   0   0   0   0   0
  0   active  0'0 3'103   [1] [1,0]   0'0
 0.0p0   0   0   0   0   0   0   0
  0   active  0'0 2'126   [0] [0,1]   0'0
 1.1p0   0   0   0   0   0   0   0
  0   active  0'0 2'123   [0] [0,2]   0'0
 1.0p1   0   0   0   0   0   0   0
  0   active  0'0 3'102   [1] [1,0]   0'0
 2.0p3   0   0   0   0   0   0   0
  0   active  0'0 4'122   [3] [3,0]   0'0
 2.1p2   0   0   0   0   0   0   0
  0   active  0'0 3'122   [2] [2,0]   0'0
 0.0p1   0   0   0   0   0   0   0
  0   active  0'0 3'116   [1] [1,0]   0'0
 0.1p0   0   0   0   0   0   0   0
  0   active  0'0 2'115   [0] [0,2]   0'0
 1.0p0   0   0   0   0   0   0   0
  0   active  0'0 2'115   [0] [0,2]   0'0
 1.1p1   0   0   0   0   0   0   0
  0   active  0'0 3'116   [1] [1,0]   0'0
 2.1p1   0   0   0   0   0   0   0
  0   active  0'0 3'121   [1] [1,0]   0'0
 2.0p0   0   0   0   0   0   0   0
  0   active  0'0 2'121   [0] [0,2]   0'0
 0.1p3   0   0   0   0   0   0   0
  0   active  0'0 4'115   [3] [3,0]   0'0
 0.0p2   0   0   0   0   0   0   0
  0   active  0'0 3'117   [2] [2,0]   0'0
 1.1p2   0   0   0   0   0   0   0
  0   active  0'0 3'119   [2] [2,0]   0'0
 1.0p3   0   0   0   0   0   0   0
  0   active  0'0 4'115   [3] [3,0]   0'0
  2.0p1   0   0   0   0   0   0   0
   0   active  0'0 3'116   [1] [1,0]   0'0
  2.1p0   0   0   0   0   0   0   0
   0   active  0'0 2'124   [0] [0,1]   0'0
 let's take pg 2.1p3 for example,
 why  the up and acting set are not equals?
 Does  data migration  occurs in OSD cluster on this condition?
that is what we are most concerned about.
 if so, the data distribution didn't follow the rules setted by crushmap.


With many down nodes, the distribution can become bad due to the way the 
current crush implementation works (bug #2047). It will only retry 
within the local subtree when it encounters a down device, so it can end 
up getting more down nodes and giving up. A workaround for this is to 
remove the host level from your hierarchy and put the devices directly 
in the racks.


You can test the distrubtion with crushtool to see what happens with 
different osds down. For example, to test with osds 0 and 1 down:


ceph getcrushmap -o /tmp/crushmap
crushtool -i /tmp/crushmap --test --weight 0 0 --weight 1 0
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-09 Thread Jim Schutt

On 02/06/2012 11:35 AM, Gregory Farnum wrote:

On Mon, Feb 6, 2012 at 10:20 AM, Jim Schuttjasc...@sandia.gov  wrote:

On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:


On Mon, Feb 6, 2012 at 8:20 AM, Jim Schuttjasc...@sandia.govwrote:





The above suggests to me that the slowdown is a result
of requests not getting submitted at the same rate as
when things are running well.



Yeah, it really looks like that. My suggestions wouldn't help there.

I do see that when things go well the number of writes per device is
capped at ~200 writes per second and the throughput per device is
~100MB/sec. Is 100MB/sec the expected device throughput?



Pretty much, at least for the outer tracks on a drive.  I've seen
~108 MB/s with dd to a block device.  Also, I've got 8 drives per
SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
disk subsystem is any sort of significant bottleneck.


Well, you might try changing your throttling settings on the OSDs.
ms_dispatch_throttle_bytes defaults to 10020 (100MB) and is used for
throttling dispatch; osd_max_client_bytes defaults to 50020 (500MB)
and is used to limit the amount of client data in memory (ie; messages
are included in this throttler for their entire lifetime, not just
while waiting for dispatch).




I've made a little progress isolating this.

osd client message size cap =  500 makes the stall
completely reproducible (which also means I can reproduce
on two different network types, ethernet and IPoIB.), and I
am able to generate graphs of throttled/receive/process time
for each request received by an OSD (see attached SVG plot).

Such plots suggest to me my problem is caused by stalled
receives.  Using debug ms = 30 on my OSDs turns up instances
of this:

osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader 
wants 4194432 from dispatch throttler 0/2500
osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader got 
front 128
osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader 
allocating new rx buffer at offset 0
osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader 
reading nonblocking into 0x1656c000 len 4194304
osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader read 
1369231 of 4194304
osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader 
reading nonblocking into 0x166ba48f len 2825073
osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader read 
1448 of 2825073

which I take to mean that the reader thread sat in poll() for 56 secs, in
this case.

I was able to correlate such stalls with tcpdump output collected on
clients.  Here's an example from another run:

15:09:37.584600 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack 
23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
15:09:37.584613 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq 
23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 
1096144], length 31856
15:09:37.584655 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq 
23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 
1096144], length 31856
15:09:37.624476 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack 
23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
15:09:37.624489 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq 
23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 
1096184], length 31856
15:09:37.624532 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [P.], seq 
23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 
1096184], length 31856
15:09:37.664454 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack 
23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
15:09:37.664468 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq 
23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 
1096224], length 31856
15:09:37.664506 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq 
23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 
1096224], length 31856
15:09:37.706937 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack 
23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length 0
15:09:37.706950 

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-09 Thread Jim Schutt

On 02/09/2012 03:40 PM, sridhar basam wrote:

On Thu, Feb 9, 2012 at 3:53 PM, Jim Schuttjasc...@sandia.gov  wrote:

On 02/06/2012 11:35 AM, Gregory Farnum wrote:


On Mon, Feb 6, 2012 at 10:20 AM, Jim Schuttjasc...@sandia.govwrote:


On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:



On Mon, Feb 6, 2012 at 8:20 AM, Jim Schuttjasc...@sandia.gov  wrote:






The above suggests to me that the slowdown is a result
of requests not getting submitted at the same rate as
when things are running well.



Yeah, it really looks like that. My suggestions wouldn't help there.

I do see that when things go well the number of writes per device is
capped at ~200 writes per second and the throughput per device is
~100MB/sec. Is 100MB/sec the expected device throughput?




Pretty much, at least for the outer tracks on a drive.  I've seen
~108 MB/s with dd to a block device.  Also, I've got 8 drives per
SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
disk subsystem is any sort of significant bottleneck.



Well, you might try changing your throttling settings on the OSDs.
ms_dispatch_throttle_bytes defaults to 10020 (100MB) and is used for
throttling dispatch; osd_max_client_bytes defaults to 50020 (500MB)
and is used to limit the amount of client data in memory (ie; messages
are included in this throttler for their entire lifetime, not just
while waiting for dispatch).




I've made a little progress isolating this.

osd client message size cap =  500 makes the stall
completely reproducible (which also means I can reproduce
on two different network types, ethernet and IPoIB.), and I
am able to generate graphs of throttled/receive/process time
for each request received by an OSD (see attached SVG plot).

Such plots suggest to me my problem is caused by stalled
receives.  Using debug ms = 30 on my OSDs turns up instances
of this:

osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/2500
osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader got front 128
osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0
osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304
osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader read 1369231 of 4194304
osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073
osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader read 1448 of 2825073

which I take to mean that the reader thread sat in poll() for 56 secs, in
this case.

I was able to correlate such stalls with tcpdump output collected on
clients.  Here's an example from another run:

15:09:37.584600 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack
23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
15:09:37.584613 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq
23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
1096144], length 31856
15:09:37.584655 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq
23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
1096144], length 31856
15:09:37.624476 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack
23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
15:09:37.624489 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq
23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
1096184], length 31856
15:09:37.624532 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [P.], seq
23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
1096184], length 31856
15:09:37.664454 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack
23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
15:09:37.664468 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq
23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
1096224], length 31856
15:09:37.664506 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq
23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
1096224], length 31856
15:09:37.706937 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], 

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-09 Thread Tommi Virtanen
On Thu, Feb 9, 2012 at 15:15, Jim Schutt jasc...@sandia.gov wrote:
 I suspect a bug in the stack, as at an application level I get
 the same sort of stalls whether I use IP over ethernet or IPoIB.
 I need to get traces for both cases to prove that it is the same
 stall...

Hi. I just wanted to confirm that what your tcpdump shows is a packet
loss between the client and the osd. If this wasn't packet loss, you'd
expect the TCP window size to drop to 0 -- not remain at 64k as it is
in your dump -- and you'd expect to see ACKs that don't advance the
sequence number. Something like this:

16:25:16.914407 IP 127.0.0.1.60336  127.0.0.1.: Flags [P.], seq
90150:93094, ack 1, win 257, options [nop,nop,TS val 3732293 ecr
3732270], length 2944
16:25:16.914416 IP 127.0.0.1.  127.0.0.1.60336: Flags [.], ack
93094, win 0, options [nop,nop,TS val 3732293 ecr 3732293], length 0
16:25:17.144409 IP 127.0.0.1.60336  127.0.0.1.: Flags [.], ack 1,
win 257, options [nop,nop,TS val 3732316 ecr 3732293], length 0
16:25:17.144421 IP 127.0.0.1.  127.0.0.1.60336: Flags [.], ack
93094, win 0, options [nop,nop,TS val 3732316 ecr 3732293], length 0
16:25:17.604409 IP 127.0.0.1.60336  127.0.0.1.: Flags [.], ack 1,
win 257, options [nop,nop,TS val 3732362 ecr 3732316], length 0
16:25:17.604419 IP 127.0.0.1.  127.0.0.1.60336: Flags [.], ack
93094, win 0, options [nop,nop,TS val 3732362 ecr 3732293], length 0

As pointed out by Sridhar, various TCP offload mechanisms (and
firewalling!) may make your tcpdump not see the underlying reality.

You might also be just actually losing packets, and the osd settings
might, perhaps, influence the performance of the machine enough to
make it lose packets -- though that sounds a bit far fetched.

You might also be suffering from a Path MTU Discovery black hole, and
need the osd size cap to get full-frame packets out. I see your
tcpdump indicated jumbo frames (at least until the TSO engine!), that
might be its own source of pain.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-09 Thread sridhar basam
On Thu, Feb 9, 2012 at 6:15 PM, Jim Schutt jasc...@sandia.gov wrote:
 On 02/09/2012 03:40 PM, sridhar basam wrote:

 On Thu, Feb 9, 2012 at 3:53 PM, Jim Schuttjasc...@sandia.gov  wrote:

 On 02/06/2012 11:35 AM, Gregory Farnum wrote:


 On Mon, Feb 6, 2012 at 10:20 AM, Jim Schuttjasc...@sandia.gov
  wrote:


 On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:



 On Mon, Feb 6, 2012 at 8:20 AM, Jim Schuttjasc...@sandia.gov
  wrote:





 The above suggests to me that the slowdown is a result
 of requests not getting submitted at the same rate as
 when things are running well.


 Yeah, it really looks like that. My suggestions wouldn't help there.

 I do see that when things go well the number of writes per device is
 capped at ~200 writes per second and the throughput per device is
 ~100MB/sec. Is 100MB/sec the expected device throughput?




 Pretty much, at least for the outer tracks on a drive.  I've seen
 ~108 MB/s with dd to a block device.  Also, I've got 8 drives per
 SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
 disk subsystem is any sort of significant bottleneck.



 Well, you might try changing your throttling settings on the OSDs.
 ms_dispatch_throttle_bytes defaults to 10020 (100MB) and is used for
 throttling dispatch; osd_max_client_bytes defaults to 50020 (500MB)
 and is used to limit the amount of client data in memory (ie; messages
 are included in this throttler for their entire lifetime, not just
 while waiting for dispatch).



 I've made a little progress isolating this.

 osd client message size cap =  500 makes the stall
 completely reproducible (which also means I can reproduce
 on two different network types, ethernet and IPoIB.), and I
 am able to generate graphs of throttled/receive/process time
 for each request received by an OSD (see attached SVG plot).

 Such plots suggest to me my problem is caused by stalled
 receives.  Using debug ms = 30 on my OSDs turns up instances
 of this:

 osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/2500
 osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader got front 128
 osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0
 osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304
 osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader read 1369231 of 4194304
 osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073
 osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader read 1448 of 2825073

 which I take to mean that the reader thread sat in poll() for 56 secs, in
 this case.

 I was able to correlate such stalls with tcpdump output collected on
 clients.  Here's an example from another run:

 15:09:37.584600 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.],
 ack
 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length
 0
 15:09:37.584613 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.],
 seq
 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615
 ecr
 1096144], length 31856
 15:09:37.584655 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.],
 seq
 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615
 ecr
 1096144], length 31856
 15:09:37.624476 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.],
 ack
 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length
 0
 15:09:37.624489 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.],
 seq
 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655
 ecr
 1096184], length 31856
 15:09:37.624532 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [P.],
 seq
 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655
 ecr
 1096184], length 31856
 15:09:37.664454 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.],
 ack
 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length
 0
 15:09:37.664468 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.],
 seq
 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695
 ecr
 1096224], length 31856
 15:09:37.664506 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.],
 seq