Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-28 Thread Jim Schutt
Hi Sage,

On 02/26/2013 12:36 PM, Sage Weil wrote:
 On Tue, 26 Feb 2013, Jim Schutt wrote:
 I think the right solution is to make an option that will setsockopt on 
 SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
 wip-tcp.  Do you mind checking to see if this addresses the issue (without 
 manually adjusting things in /proc)?

 I'll be happy to test it out...
 
 That would be great!  It's branch wip-tcp, and the setting is 'ms tcp 
 rcvbuf'.

I've verified that I can reproduce the slowdown with the
default value of 1 for /proc/sys/net/ipv4/tcp_moderate_rcvbuf,
and 'ms tcp rcvbuf' at 0.

I've also verified that I could not reproduce any slowdown when
I configure 'ms tcp rcvbuf' to 256 KiB on OSDs.

So, that's great news - sorry for the delay in testing.

Also, FWIW, I ended up testing with commits cb15e6e0f4 and
c346282940 cherry-picked on top of next as of a day or
so ago (commit f58601d681), as for some reason wip-tcp
wouldn't work for me - ceph-mon was non-responsive in
some way I didn't dig into.

-- Jim

 
 Thanks!
 sage
 
 


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-28 Thread Sage Weil
On Thu, 28 Feb 2013, Jim Schutt wrote:
 Hi Sage,
 
 On 02/26/2013 12:36 PM, Sage Weil wrote:
  On Tue, 26 Feb 2013, Jim Schutt wrote:
  I think the right solution is to make an option that will setsockopt on 
  SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
  wip-tcp.  Do you mind checking to see if this addresses the issue 
  (without 
  manually adjusting things in /proc)?
 
  I'll be happy to test it out...
  
  That would be great!  It's branch wip-tcp, and the setting is 'ms tcp 
  rcvbuf'.
 
 I've verified that I can reproduce the slowdown with the
 default value of 1 for /proc/sys/net/ipv4/tcp_moderate_rcvbuf,
 and 'ms tcp rcvbuf' at 0.
 
 I've also verified that I could not reproduce any slowdown when
 I configure 'ms tcp rcvbuf' to 256 KiB on OSDs.
 
 So, that's great news - sorry for the delay in testing.

Awesome--thanks so much for testing that!  Pulling it into master now.
 
 Also, FWIW, I ended up testing with commits cb15e6e0f4 and
 c346282940 cherry-picked on top of next as of a day or
 so ago (commit f58601d681), as for some reason wip-tcp
 wouldn't work for me - ceph-mon was non-responsive in
 some way I didn't dig into.

Yeah, sorry about that.  I rebased wip-tcp a few days ago but you may have 
picked up the previous version.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-26 Thread Jim Schutt
Hi Sage,

On 02/20/2013 05:12 PM, Sage Weil wrote:
 Hi Jim,
 
 I'm resurrecting an ancient thread here, but: we've just observed this on 
 another big cluster and remembered that this hasn't actually been fixed.

Sorry for the delayed reply - I missed this in a backlog
of unread email...

 
 I think the right solution is to make an option that will setsockopt on 
 SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
 wip-tcp.  Do you mind checking to see if this addresses the issue (without 
 manually adjusting things in /proc)?

I'll be happy to test it out...

 
 And perhaps we should consider making this default to 256KB...

That's the value I've been using with my /proc adjustments
since I figured out what was going on.  My servers use
a 10 GbE port for each of the cluster and public networks,
with cephfs clients using 1 GbE, and I've not detected any
issues resulting from that value.  So, it seems like a decent
starting point for a default...

-- Jim

 
 Thanks!
 sage
 


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-26 Thread Sage Weil
On Tue, 26 Feb 2013, Jim Schutt wrote:
  I think the right solution is to make an option that will setsockopt on 
  SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
  wip-tcp.  Do you mind checking to see if this addresses the issue (without 
  manually adjusting things in /proc)?
 
 I'll be happy to test it out...

That would be great!  It's branch wip-tcp, and the setting is 'ms tcp 
rcvbuf'.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-20 Thread Sage Weil
Hi Jim,

I'm resurrecting an ancient thread here, but: we've just observed this on 
another big cluster and remembered that this hasn't actually been fixed.

I think the right solution is to make an option that will setsockopt on 
SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
wip-tcp.  Do you mind checking to see if this addresses the issue (without 
manually adjusting things in /proc)?

And perhaps we should consider making this default to 256KB...

Thanks!
sage



On Fri, 24 Feb 2012, Jim Schutt wrote:

 On 02/02/2012 10:52 AM, Gregory Farnum wrote:
  On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov  wrote:
   I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
   per OSD.  During a test I watch both OSD servers with both
   vmstat and iostat.
   
   During a good period, vmstat says the server is sustaining  2 GB/s
   for multiple tens of seconds.  Since I use replication factor 2, that
   means that server is sustaining  500 MB/s aggregate client throughput,
   right?  During such a period vmstat also reports ~10% CPU idle.
   
   During a bad period, vmstat says the server is doing ~200 MB/s,
   with lots of idle cycles.  It is during these periods that
   messages stuck in the policy throttler build up such long
   wait times.  Sometimes I see really bad periods with aggregate
   throughput per server  100 MB/s.
   
   The typical pattern I see is that a run starts with tens of seconds
   of aggregate throughput  2 GB/s.  Then it drops and bounces around
   500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
   it ramps back up near 2 GB/s again.
  
  Hmm. 100MB/s is awfully low for this theory, but have you tried to
  correlate the drops in throughput with the OSD journals running out of
  space? I assume from your setup that they're sharing the disk with the
  store (although it works either way), and your description makes me
  think that throughput is initially constrained by sequential journal
  writes but then the journal runs out of space and the OSD has to wait
  for the main store to catch up (with random IO), and that sends the IO
  patterns all to hell. (If you can say that random 4MB IOs are
  hellish.)
  I'm also curious about memory usage as a possible explanation for the
  more dramatic drops.
 
 I've finally figured out what is going on with this behaviour.
 Memory usage was on the right track.
 
 It turns out to be an unfortunate interaction between the
 number of OSDs/server, number of clients, TCP socket buffer
 autotuning, the policy throttler, and limits on the total
 memory used by the TCP stack (net/ipv4/tcp_mem sysctl).
 
 What happens is that for throttled reader threads, the
 TCP stack will continue to receive data as long as there
 is available socket buffer, and the sender has data to send.
 
 As each reader thread receives successive messages, the
 TCP socket buffer autotuning increases the size of the
 socket buffer.  Eventually, due to the number of OSDs
 per server and the number of clients trying to write,
 all the memory the TCP stack is allowed by net/ipv4/tcp_mem
 to use is consumed by the socket buffers of throttled
 reader threads.  When this happens, TCP processing is affected
 to the point that the TCP stack cannot send ACKs on behalf
 of the reader threads that aren't throttled.  At that point
 the OSD stalls until the TCP retransmit count on some connection
 is exceeded, causing it to be reset.
 
 Since my OSD servers don't run anything else, the simplest
 solution for me is to turn off socket buffer autotuning
 (net/ipv4/tcp_moderate_rcvbuf), and set the default socket
 buffer size to something reasonable.  256k seems to be
 working well for me right now.
 
 -- Jim
 
  -Greg
  
  
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-24 Thread Jim Schutt

On 02/02/2012 10:52 AM, Gregory Farnum wrote:

On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov  wrote:

I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
per OSD.  During a test I watch both OSD servers with both
vmstat and iostat.

During a good period, vmstat says the server is sustaining  2 GB/s
for multiple tens of seconds.  Since I use replication factor 2, that
means that server is sustaining  500 MB/s aggregate client throughput,
right?  During such a period vmstat also reports ~10% CPU idle.

During a bad period, vmstat says the server is doing ~200 MB/s,
with lots of idle cycles.  It is during these periods that
messages stuck in the policy throttler build up such long
wait times.  Sometimes I see really bad periods with aggregate
throughput per server  100 MB/s.

The typical pattern I see is that a run starts with tens of seconds
of aggregate throughput  2 GB/s.  Then it drops and bounces around
500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
it ramps back up near 2 GB/s again.


Hmm. 100MB/s is awfully low for this theory, but have you tried to
correlate the drops in throughput with the OSD journals running out of
space? I assume from your setup that they're sharing the disk with the
store (although it works either way), and your description makes me
think that throughput is initially constrained by sequential journal
writes but then the journal runs out of space and the OSD has to wait
for the main store to catch up (with random IO), and that sends the IO
patterns all to hell. (If you can say that random 4MB IOs are
hellish.)
I'm also curious about memory usage as a possible explanation for the
more dramatic drops.


I've finally figured out what is going on with this behaviour.
Memory usage was on the right track.

It turns out to be an unfortunate interaction between the
number of OSDs/server, number of clients, TCP socket buffer
autotuning, the policy throttler, and limits on the total
memory used by the TCP stack (net/ipv4/tcp_mem sysctl).

What happens is that for throttled reader threads, the
TCP stack will continue to receive data as long as there
is available socket buffer, and the sender has data to send.

As each reader thread receives successive messages, the
TCP socket buffer autotuning increases the size of the
socket buffer.  Eventually, due to the number of OSDs
per server and the number of clients trying to write,
all the memory the TCP stack is allowed by net/ipv4/tcp_mem
to use is consumed by the socket buffers of throttled
reader threads.  When this happens, TCP processing is affected
to the point that the TCP stack cannot send ACKs on behalf
of the reader threads that aren't throttled.  At that point
the OSD stalls until the TCP retransmit count on some connection
is exceeded, causing it to be reset.

Since my OSD servers don't run anything else, the simplest
solution for me is to turn off socket buffer autotuning
(net/ipv4/tcp_moderate_rcvbuf), and set the default socket
buffer size to something reasonable.  256k seems to be
working well for me right now.

-- Jim


-Greg





--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-24 Thread Tommi Virtanen
On Fri, Feb 24, 2012 at 07:38, Jim Schutt jasc...@sandia.gov wrote:
 I've finally figured out what is going on with this behaviour.
 Memory usage was on the right track.

 It turns out to be an unfortunate interaction between the
 number of OSDs/server, number of clients, TCP socket buffer
 autotuning, the policy throttler, and limits on the total
 memory used by the TCP stack (net/ipv4/tcp_mem sysctl).

 What happens is that for throttled reader threads, the
 TCP stack will continue to receive data as long as there
 is available socket buffer, and the sender has data to send.

Ohh! Yes, if the userspace stops reading a socket, kernel will buffer
data as per SO_RCVBUF etc. And TCP has global limits, and that is
going to push it uncomfortably close to the global limit.

Ceph *could* manipulate SO_RCVBUF size at the time it decides to
throttle a client, that would limit the TCP buffer space consumed by
throttled clients (except for a race where the data got received
before Ceph called setsockopt). I recall seeing a trick like that
pulled off somewhere, but I can't find an example right now.

Or perhaps we just say sorry your server is swamped with too much
work for the resources it's given; you need more of them. That's not
nice though, when throttling can slow down the non-throttled
connections.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-24 Thread Tommi Virtanen
I created ticket http://tracker.newdream.net/issues/2100 for this.

On Fri, Feb 24, 2012 at 10:31, Tommi Virtanen
tommi.virta...@dreamhost.com wrote:
 On Fri, Feb 24, 2012 at 07:38, Jim Schutt jasc...@sandia.gov wrote:
 I've finally figured out what is going on with this behaviour.
 Memory usage was on the right track.

 It turns out to be an unfortunate interaction between the
 number of OSDs/server, number of clients, TCP socket buffer
 autotuning, the policy throttler, and limits on the total
 memory used by the TCP stack (net/ipv4/tcp_mem sysctl).

 What happens is that for throttled reader threads, the
 TCP stack will continue to receive data as long as there
 is available socket buffer, and the sender has data to send.

 Ohh! Yes, if the userspace stops reading a socket, kernel will buffer
 data as per SO_RCVBUF etc. And TCP has global limits, and that is
 going to push it uncomfortably close to the global limit.

 Ceph *could* manipulate SO_RCVBUF size at the time it decides to
 throttle a client, that would limit the TCP buffer space consumed by
 throttled clients (except for a race where the data got received
 before Ceph called setsockopt). I recall seeing a trick like that
 pulled off somewhere, but I can't find an example right now.

 Or perhaps we just say sorry your server is swamped with too much
 work for the resources it's given; you need more of them. That's not
 nice though, when throttling can slow down the non-throttled
 connections.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-13 Thread Jim Schutt

On 02/10/2012 05:05 PM, sridhar basam wrote:

  But the server never ACKed that packet.  Too busy?

  I was collecting vmstat data during the run; here's the important bits:

  Fri Feb 10 11:56:51 MST 2012
  vmstat -w 8 16
  procs ---memory-- ---swap-- -io
  --system-- -cpu---
r  b   swpd   free   buff  cache   si   sobibo   in
 cs  us sy  id wa st
  13 10  0 250272944   3785908000 7  5346 1098
444   2  5  92  1  0
  88  8  0 260472944   3672877600 0 1329838
  257602 68861  19 73   5  4  0
  100 10  0 241952944   36066536 0 0 0 1635891 340724
  85570  22 68   6  4  0
  105  9  0 250288944   34750820 0 0 0 1584816 433223
  111462  21 73   4  3  0
  126  3  0 259908944   3384169600 0 749648
  225707 86716   9 83   4  3  0
  157  2  0 245032944   31572536 0 0 0 736841 252406
  99083   9 81   5  5  0
  45 17  0 246720944   2887764000 1 755085
  282177 116551   8 77   9  5  0

Holy crap! That might explain why you aren't seeing anything. You are
writing out over a 1.6 million blocks/sec. That too averaged over a 8
second interval. I bet the missed acks are when this is happening.
What sort of I/O load is going through this system during those times?
What sort of filesystem and Linux system are these OSDs on?


Dual socket Nehalem EP @ 3 GHz, 24 ea. 7200RPM SAS drives w/ 64 MB cache,
3 LSI SAS HBAs w/8 drives per HBA, btrfs, 3.2.0 kernel.  Each OSD
has a ceph journal and a ceph data store on a single drive.

I'm running 24 OSDs on such a box; all that write load is the result
of dd from 166 linux ceph clients.

FWIW, I've seen these boxes sustain  2 GB/s for 60 sec or so under
this load, when I have TSO/GSO/GRO turned on, and am writing to
a freshly created ceph filesystem.

That lasts until my OSDs get stalled reading from a socket, as
documented by those packet traces I posted.

If you compare the timestamps on the retransmits to the times
that vmstat is dumping reports, at least some of the retransmits
hit the system when it is ~80% idle.

-- Jim



  Sridhar






--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-10 Thread Jim Schutt

On 02/09/2012 06:26 PM, sridhar basam wrote:

Do you mind capturing to a pcap file and providing that. Makes it
easier to analyse things. If not, i understand. If you can make do the
capture on both ends, do it with a snaplen of 68 so that you get all
of the headers and there shouldn't be too much payload information in
the file.


I've got a pcap file for this run for this client, with snaplen 128
(I thought I might need to look for ceph message headers).  It's 13 MB
compressed.  How shall I get it to you?

In the meantime, I'll try to capture this from both sides.



I will take a look at the additional output and see if anything pops
out. I am assuming the below output was immediately after what you
posted in your earlier email.


Yes.

Thanks -- Jim




  Sridhar




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-10 Thread sridhar basam
On Fri, Feb 10, 2012 at 10:32 AM, Jim Schutt jasc...@sandia.gov wrote:
 On 02/09/2012 06:26 PM, sridhar basam wrote:

 Do you mind capturing to a pcap file and providing that. Makes it
 easier to analyse things. If not, i understand. If you can make do the
 capture on both ends, do it with a snaplen of 68 so that you get all
 of the headers and there shouldn't be too much payload information in
 the file.


 I've got a pcap file for this run for this client, with snaplen 128
 (I thought I might need to look for ceph message headers).  It's 13 MB
 compressed.  How shall I get it to you?


Can i grab it off some webserver you control? Or you can temporarily
drop it into docs.google.com and add accesss from my email account?

 In the meantime, I'll try to capture this from both sides.



 I will take a look at the additional output and see if anything pops
 out. I am assuming the below output was immediately after what you
 posted in your earlier email.

i don't see anything out of the ordinary once things recover, the
sender even starts to do TSO after a short while.

 Sridhar




 Yes.

 Thanks -- Jim




  Sridhar



 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-10 Thread Jim Schutt


[ added Cc:netdev
  See http://www.spinics.net/lists/ceph-devel/msg04804.html
  for the start of the thread.
  -- Jim
]

On 02/10/2012 10:13 AM, sridhar basam wrote:

On Fri, Feb 10, 2012 at 10:32 AM, Jim Schuttjasc...@sandia.gov  wrote:

On 02/09/2012 06:26 PM, sridhar basam wrote:


Do you mind capturing to a pcap file and providing that. Makes it
easier to analyse things. If not, i understand. If you can make do the
capture on both ends, do it with a snaplen of 68 so that you get all
of the headers and there shouldn't be too much payload information in
the file.



I've got a pcap file for this run for this client, with snaplen 128
(I thought I might need to look for ceph message headers).  It's 13 MB
compressed.  How shall I get it to you?



Can i grab it off some webserver you control? Or you can temporarily
drop it into docs.google.com and add accesss from my email account?


I tabled this for the moment while I worked on collecting
packet traces from both ends.  But you'll probably want
to see the pcap files for what I'm about to show.  Also,
I think I need to add netdev to this discussion ...




In the meantime, I'll try to capture this from both sides.


Here's another example, captured from both sides, with
TSO/GSO/GRO all off, snaplen 68.

This was captured from the client side.  Same pattern, in
that the client sends many retransmits over a period of
a couple minutes.  It's different in that the client
seems to give up and reconnect ...

11:57:35.984024 IP 172.17.131.32.6807  172.17.135.4.57589: Flags [.], ack 
18109890, win 5792, options [nop,nop,TS[|tcp]
11:57:35.984032 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18112786:18114234, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:35.984038 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18114234:18115682, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:35.984120 IP 172.17.131.32.6807  172.17.135.4.57589: Flags [.], ack 
18112786, win 5792, options [nop,nop,TS[|tcp]
11:57:35.984129 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18115682:18117130, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:35.984135 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18117130:18118578, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:35.984143 IP 172.17.131.32.6807  172.17.135.4.57589: Flags [.], ack 
18115682, win 5792, options [nop,nop,TS[|tcp]
11:57:35.984148 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18118578:18120026, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:35.984153 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:35.984270 IP 172.17.131.32.6807  172.17.135.4.57589: Flags [.], ack 
18118578, win 5792, options [nop,nop,TS[|tcp]
11:57:35.984278 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18121474:18122922, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:35.984283 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18122922:18124370, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:35.984420 IP 172.17.131.32.6807  172.17.135.4.57589: Flags [.], ack 
18120026, win 5792, options [nop,nop,TS[|tcp]
11:57:35.984428 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18124370:18125818, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:36.184945 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:36.587936 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:37.393937 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:39.003937 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:42.227933 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:57:48.675931 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:58:01.555935 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:58:27.347945 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
11:59:18.867935 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
12:00:22.673029 IP 172.17.131.32.6807  172.17.135.4.57589: Flags [P.], seq 
863:1036, ack 18120026, win 5792, options [nop,nop,TS[|tcp]
12:00:22.712933 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], ack 1036, 
win 19832, options [nop,nop,TS[|tcp]
12:01:02.035951 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq 
18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-10 Thread sridhar basam
On Fri, Feb 10, 2012 at 6:09 PM, Jim Schutt jasc...@sandia.gov wrote:

 [ added Cc:netdev
  See http://www.spinics.net/lists/ceph-devel/msg04804.html
  for the start of the thread.
  -- Jim
 ]


 On 02/10/2012 10:13 AM, sridhar basam wrote:

 On Fri, Feb 10, 2012 at 10:32 AM, Jim Schuttjasc...@sandia.gov  wrote:

 On 02/09/2012 06:26 PM, sridhar basam wrote:


 Do you mind capturing to a pcap file and providing that. Makes it
 easier to analyse things. If not, i understand. If you can make do the
 capture on both ends, do it with a snaplen of 68 so that you get all
 of the headers and there shouldn't be too much payload information in
 the file.



 I've got a pcap file for this run for this client, with snaplen 128
 (I thought I might need to look for ceph message headers).  It's 13 MB
 compressed.  How shall I get it to you?


 Can i grab it off some webserver you control? Or you can temporarily
 drop it into docs.google.com and add accesss from my email account?


 I tabled this for the moment while I worked on collecting
 packet traces from both ends.  But you'll probably want
 to see the pcap files for what I'm about to show.  Also,
 I think I need to add netdev to this discussion ...



 In the meantime, I'll try to capture this from both sides.


 Here's another example, captured from both sides, with
 TSO/GSO/GRO all off, snaplen 68.

 This was captured from the client side.  Same pattern, in
 that the client sends many retransmits over a period of
 a couple minutes.  It's different in that the client
 seems to give up and reconnect ...

 11:57:35.984024 IP 172.17.131.32.6807  172.17.135.4.57589: Flags [.], ack
 18109890, win 5792, options [nop,nop,TS[|tcp]
 11:57:35.984032 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18112786:18114234, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:35.984038 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18114234:18115682, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:35.984120 IP 172.17.131.32.6807  172.17.135.4.57589: Flags [.], ack
 18112786, win 5792, options [nop,nop,TS[|tcp]
 11:57:35.984129 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18115682:18117130, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:35.984135 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18117130:18118578, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:35.984143 IP 172.17.131.32.6807  172.17.135.4.57589: Flags [.], ack
 18115682, win 5792, options [nop,nop,TS[|tcp]
 11:57:35.984148 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18118578:18120026, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:35.984153 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:35.984270 IP 172.17.131.32.6807  172.17.135.4.57589: Flags [.], ack
 18118578, win 5792, options [nop,nop,TS[|tcp]
 11:57:35.984278 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18121474:18122922, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:35.984283 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18122922:18124370, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:35.984420 IP 172.17.131.32.6807  172.17.135.4.57589: Flags [.], ack
 18120026, win 5792, options [nop,nop,TS[|tcp]
 11:57:35.984428 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18124370:18125818, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:36.184945 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:36.587936 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:37.393937 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:39.003937 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:42.227933 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:57:48.675931 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:58:01.555935 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:58:27.347945 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
 11:59:18.867935 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], seq
 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp]
 12:00:22.673029 IP 172.17.131.32.6807  172.17.135.4.57589: Flags [P.], seq
 863:1036, ack 18120026, win 5792, options [nop,nop,TS[|tcp]
 12:00:22.712933 IP 172.17.135.4.57589  172.17.131.32.6807: Flags [.], ack
 1036, win 19832, options [nop,nop,TS[|tcp]
 12:01:02.035951 IP 

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-09 Thread Jim Schutt

On 02/06/2012 11:35 AM, Gregory Farnum wrote:

On Mon, Feb 6, 2012 at 10:20 AM, Jim Schuttjasc...@sandia.gov  wrote:

On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:


On Mon, Feb 6, 2012 at 8:20 AM, Jim Schuttjasc...@sandia.govwrote:





The above suggests to me that the slowdown is a result
of requests not getting submitted at the same rate as
when things are running well.



Yeah, it really looks like that. My suggestions wouldn't help there.

I do see that when things go well the number of writes per device is
capped at ~200 writes per second and the throughput per device is
~100MB/sec. Is 100MB/sec the expected device throughput?



Pretty much, at least for the outer tracks on a drive.  I've seen
~108 MB/s with dd to a block device.  Also, I've got 8 drives per
SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
disk subsystem is any sort of significant bottleneck.


Well, you might try changing your throttling settings on the OSDs.
ms_dispatch_throttle_bytes defaults to 10020 (100MB) and is used for
throttling dispatch; osd_max_client_bytes defaults to 50020 (500MB)
and is used to limit the amount of client data in memory (ie; messages
are included in this throttler for their entire lifetime, not just
while waiting for dispatch).




I've made a little progress isolating this.

osd client message size cap =  500 makes the stall
completely reproducible (which also means I can reproduce
on two different network types, ethernet and IPoIB.), and I
am able to generate graphs of throttled/receive/process time
for each request received by an OSD (see attached SVG plot).

Such plots suggest to me my problem is caused by stalled
receives.  Using debug ms = 30 on my OSDs turns up instances
of this:

osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader 
wants 4194432 from dispatch throttler 0/2500
osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader got 
front 128
osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader 
allocating new rx buffer at offset 0
osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader 
reading nonblocking into 0x1656c000 len 4194304
osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader read 
1369231 of 4194304
osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader 
reading nonblocking into 0x166ba48f len 2825073
osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 -- 172.17.131.32:6800/15199 
 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader read 
1448 of 2825073

which I take to mean that the reader thread sat in poll() for 56 secs, in
this case.

I was able to correlate such stalls with tcpdump output collected on
clients.  Here's an example from another run:

15:09:37.584600 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack 
23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
15:09:37.584613 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq 
23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 
1096144], length 31856
15:09:37.584655 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq 
23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 
1096144], length 31856
15:09:37.624476 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack 
23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
15:09:37.624489 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq 
23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 
1096184], length 31856
15:09:37.624532 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [P.], seq 
23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 
1096184], length 31856
15:09:37.664454 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack 
23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
15:09:37.664468 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq 
23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 
1096224], length 31856
15:09:37.664506 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq 
23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 
1096224], length 31856
15:09:37.706937 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack 
23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length 0
15:09:37.706950 

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-09 Thread Jim Schutt

On 02/09/2012 03:40 PM, sridhar basam wrote:

On Thu, Feb 9, 2012 at 3:53 PM, Jim Schuttjasc...@sandia.gov  wrote:

On 02/06/2012 11:35 AM, Gregory Farnum wrote:


On Mon, Feb 6, 2012 at 10:20 AM, Jim Schuttjasc...@sandia.govwrote:


On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:



On Mon, Feb 6, 2012 at 8:20 AM, Jim Schuttjasc...@sandia.gov  wrote:






The above suggests to me that the slowdown is a result
of requests not getting submitted at the same rate as
when things are running well.



Yeah, it really looks like that. My suggestions wouldn't help there.

I do see that when things go well the number of writes per device is
capped at ~200 writes per second and the throughput per device is
~100MB/sec. Is 100MB/sec the expected device throughput?




Pretty much, at least for the outer tracks on a drive.  I've seen
~108 MB/s with dd to a block device.  Also, I've got 8 drives per
SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
disk subsystem is any sort of significant bottleneck.



Well, you might try changing your throttling settings on the OSDs.
ms_dispatch_throttle_bytes defaults to 10020 (100MB) and is used for
throttling dispatch; osd_max_client_bytes defaults to 50020 (500MB)
and is used to limit the amount of client data in memory (ie; messages
are included in this throttler for their entire lifetime, not just
while waiting for dispatch).




I've made a little progress isolating this.

osd client message size cap =  500 makes the stall
completely reproducible (which also means I can reproduce
on two different network types, ethernet and IPoIB.), and I
am able to generate graphs of throttled/receive/process time
for each request received by an OSD (see attached SVG plot).

Such plots suggest to me my problem is caused by stalled
receives.  Using debug ms = 30 on my OSDs turns up instances
of this:

osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/2500
osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader got front 128
osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0
osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304
osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader read 1369231 of 4194304
osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073
osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 --
172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef sd=173
pgs=7 cs=1 l=1).reader read 1448 of 2825073

which I take to mean that the reader thread sat in poll() for 56 secs, in
this case.

I was able to correlate such stalls with tcpdump output collected on
clients.  Here's an example from another run:

15:09:37.584600 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack
23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
15:09:37.584613 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq
23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
1096144], length 31856
15:09:37.584655 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq
23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
1096144], length 31856
15:09:37.624476 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack
23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
15:09:37.624489 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq
23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
1096184], length 31856
15:09:37.624532 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [P.], seq
23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
1096184], length 31856
15:09:37.664454 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], ack
23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
15:09:37.664468 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq
23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
1096224], length 31856
15:09:37.664506 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.], seq
23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
1096224], length 31856
15:09:37.706937 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.], 

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-09 Thread Tommi Virtanen
On Thu, Feb 9, 2012 at 15:15, Jim Schutt jasc...@sandia.gov wrote:
 I suspect a bug in the stack, as at an application level I get
 the same sort of stalls whether I use IP over ethernet or IPoIB.
 I need to get traces for both cases to prove that it is the same
 stall...

Hi. I just wanted to confirm that what your tcpdump shows is a packet
loss between the client and the osd. If this wasn't packet loss, you'd
expect the TCP window size to drop to 0 -- not remain at 64k as it is
in your dump -- and you'd expect to see ACKs that don't advance the
sequence number. Something like this:

16:25:16.914407 IP 127.0.0.1.60336  127.0.0.1.: Flags [P.], seq
90150:93094, ack 1, win 257, options [nop,nop,TS val 3732293 ecr
3732270], length 2944
16:25:16.914416 IP 127.0.0.1.  127.0.0.1.60336: Flags [.], ack
93094, win 0, options [nop,nop,TS val 3732293 ecr 3732293], length 0
16:25:17.144409 IP 127.0.0.1.60336  127.0.0.1.: Flags [.], ack 1,
win 257, options [nop,nop,TS val 3732316 ecr 3732293], length 0
16:25:17.144421 IP 127.0.0.1.  127.0.0.1.60336: Flags [.], ack
93094, win 0, options [nop,nop,TS val 3732316 ecr 3732293], length 0
16:25:17.604409 IP 127.0.0.1.60336  127.0.0.1.: Flags [.], ack 1,
win 257, options [nop,nop,TS val 3732362 ecr 3732316], length 0
16:25:17.604419 IP 127.0.0.1.  127.0.0.1.60336: Flags [.], ack
93094, win 0, options [nop,nop,TS val 3732362 ecr 3732293], length 0

As pointed out by Sridhar, various TCP offload mechanisms (and
firewalling!) may make your tcpdump not see the underlying reality.

You might also be just actually losing packets, and the osd settings
might, perhaps, influence the performance of the machine enough to
make it lose packets -- though that sounds a bit far fetched.

You might also be suffering from a Path MTU Discovery black hole, and
need the osd size cap to get full-frame packets out. I see your
tcpdump indicated jumbo frames (at least until the TSO engine!), that
might be its own source of pain.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-09 Thread sridhar basam
On Thu, Feb 9, 2012 at 6:15 PM, Jim Schutt jasc...@sandia.gov wrote:
 On 02/09/2012 03:40 PM, sridhar basam wrote:

 On Thu, Feb 9, 2012 at 3:53 PM, Jim Schuttjasc...@sandia.gov  wrote:

 On 02/06/2012 11:35 AM, Gregory Farnum wrote:


 On Mon, Feb 6, 2012 at 10:20 AM, Jim Schuttjasc...@sandia.gov
  wrote:


 On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:



 On Mon, Feb 6, 2012 at 8:20 AM, Jim Schuttjasc...@sandia.gov
  wrote:





 The above suggests to me that the slowdown is a result
 of requests not getting submitted at the same rate as
 when things are running well.


 Yeah, it really looks like that. My suggestions wouldn't help there.

 I do see that when things go well the number of writes per device is
 capped at ~200 writes per second and the throughput per device is
 ~100MB/sec. Is 100MB/sec the expected device throughput?




 Pretty much, at least for the outer tracks on a drive.  I've seen
 ~108 MB/s with dd to a block device.  Also, I've got 8 drives per
 SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
 disk subsystem is any sort of significant bottleneck.



 Well, you might try changing your throttling settings on the OSDs.
 ms_dispatch_throttle_bytes defaults to 10020 (100MB) and is used for
 throttling dispatch; osd_max_client_bytes defaults to 50020 (500MB)
 and is used to limit the amount of client data in memory (ie; messages
 are included in this throttler for their entire lifetime, not just
 while waiting for dispatch).



 I've made a little progress isolating this.

 osd client message size cap =  500 makes the stall
 completely reproducible (which also means I can reproduce
 on two different network types, ethernet and IPoIB.), and I
 am able to generate graphs of throttled/receive/process time
 for each request received by an OSD (see attached SVG plot).

 Such plots suggest to me my problem is caused by stalled
 receives.  Using debug ms = 30 on my OSDs turns up instances
 of this:

 osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/2500
 osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader got front 128
 osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0
 osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304
 osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader read 1369231 of 4194304
 osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073
 osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 --
 172.17.131.32:6800/15199  172.17.135.85:0/2712733083 pipe(0x2ef
 sd=173
 pgs=7 cs=1 l=1).reader read 1448 of 2825073

 which I take to mean that the reader thread sat in poll() for 56 secs, in
 this case.

 I was able to correlate such stalls with tcpdump output collected on
 clients.  Here's an example from another run:

 15:09:37.584600 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.],
 ack
 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length
 0
 15:09:37.584613 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.],
 seq
 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615
 ecr
 1096144], length 31856
 15:09:37.584655 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.],
 seq
 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615
 ecr
 1096144], length 31856
 15:09:37.624476 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.],
 ack
 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length
 0
 15:09:37.624489 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.],
 seq
 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655
 ecr
 1096184], length 31856
 15:09:37.624532 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [P.],
 seq
 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655
 ecr
 1096184], length 31856
 15:09:37.664454 IP 172.17.131.32.6808  172.17.135.7.37045: Flags [.],
 ack
 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length
 0
 15:09:37.664468 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.],
 seq
 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695
 ecr
 1096224], length 31856
 15:09:37.664506 IP 172.17.135.7.37045  172.17.131.32.6808: Flags [.],
 seq
 

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-06 Thread Jim Schutt

On 02/03/2012 05:03 PM, Yehuda Sadeh Weinraub wrote:

On Fri, Feb 3, 2012 at 3:33 PM, Jim Schuttjasc...@sandia.gov  wrote:




You can try running 'iostat -t -kx -d 1' on the osds, and see whether %util
reaches 100%, and when it happens whether it's due to number of io
operations that are thrashing, or whether it's due to high amount of data.
FWIW, you may try setting  'filestore flusher = false', and set
/proc/sys/vm/dirty_background_ratio' to a small number (e.g., 1M).


Here's some iostat data from early in a run, when things are
running well:


02/02/2012 09:14:13 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  23.240.00   61.997.380.007.38

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sdc   0.00 0.000.00  206.00 0.00   101.57  1009.79
54.80  251.27   4.86 100.10
sdd   0.00 0.000.00  202.00 0.0098.10   994.61
27.85  132.42   4.96 100.10
sde   0.00 4.000.00  212.00 0.00   105.09  1015.25
96.06  588.43   4.72 100.10
sdh   0.00 0.000.00  200.00 0.0097.11   994.40
69.77  535.01   5.00 100.10
sdg   0.00 2.000.00  221.00 0.00   109.59  1015.60
82.05  298.71   4.53 100.10
sda   0.00 1.000.00  212.00 0.0083.93   810.75
18.26   84.82   4.68  99.30
sdf   0.00 0.000.00  208.00 0.00   102.55  1009.73
77.23  383.19   4.50  93.70
sdb   0.00 0.000.00  205.00 0.0098.66   985.68
19.97  133.98   4.84  99.20
sdj   0.00 0.000.00  202.00 0.0099.59  1009.66
69.97  257.47   4.95 100.00
sdk   0.00 0.000.00  204.00 0.0098.10   984.86
20.83  100.34   4.87  99.30
sdm   0.00 0.000.00  216.00 0.00   106.55  1010.22
77.73  268.67   4.63 100.00
sdn   0.00 0.000.00  205.00 0.0098.60   985.05
19.33   95.88   4.81  98.60
sdo   0.00 0.000.00  232.00 0.00   106.25   937.93
23.26   82.19   4.29  99.50
sdl   0.00 0.000.00  181.00 0.0085.12   963.09
24.73  131.71   4.80  86.80
sdp   0.00 4.000.00  207.00 0.0087.41   864.77
37.01  111.13   4.49  93.00
sdi   0.00 0.000.00  208.00 0.00   103.04  1014.54
72.30  263.72   4.70  97.70
sdr   0.00 0.000.00  191.00 0.0076.75   822.95
11.51   83.69   4.59  87.60
sds   0.00 0.000.00  209.00 0.00   101.91   998.58
49.95  278.08   4.70  98.20
sdt   0.00 0.000.00  209.00 0.0099.57   975.69
27.31  157.44   4.79 100.10
sdu   0.00 0.000.00  216.00 0.00   107.09  1015.41
79.82  345.88   4.63 100.10
sdw   0.00 0.000.00  208.00 0.00   103.09  1015.00
74.55  308.15   4.81 100.10
sdv   0.00 0.000.00  201.00 0.0098.05   999.08
76.87  265.88   4.98 100.10
sdx   0.00 0.000.00  202.00 0.00   100.50  1018.93   
110.40  327.68   4.96 100.10
sdq   0.00 0.000.00  228.00 0.00   112.59  1011.30
54.84  281.04   4.39 100.10

02/02/2012 09:14:14 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  22.110.00   54.03   15.380.008.48

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sdc   0.00 0.000.00  233.00 0.0099.68   876.15
95.98  384.42   4.29 100.00
sdd   0.00 0.000.00  205.00 0.0096.64   965.46
20.37  108.51   4.84  99.30
sde   0.00 0.000.00  225.00 0.0099.54   906.03
92.38  420.67   4.44 100.00
sdh   0.00 0.000.00  198.00 0.0097.05  1003.84
79.39  410.56   5.05 100.00
sdg   0.00 0.000.00  245.00 0.00   108.38   905.99
84.40  385.47   4.08 100.00
sda   0.00 4.000.00  220.00 0.0096.23   895.78
63.24  294.59   4.44  97.60
sdf   0.00 0.000.00  216.00 0.00   107.09  1015.41
87.67  399.14   4.57  98.80
sdb   0.00 0.000.00  156.00 0.0072.05   945.95
11.61   58.94   4.84  75.50
sdj   0.00 0.000.00  199.00 0.0095.41   981.95
56.28  366.11   4.84  96.40
sdk   0.00 0.000.00  206.00 0.00   100.14   995.57
54.69  241.41   4.86 100.10
sdm   0.00 0.000.00  200.00 0.0099.09  1014.72
79.51  506.47   4.74  94.70
sdn   0.00 0.000.00  191.00 0.0091.29   978.81
26.82  128.39   5.18  98.90
sdo   0.00 0.000.00  234.00 0.00   106.75   934.32
49.82  231.07   4.27 

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-06 Thread Yehuda Sadeh Weinraub
On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt jasc...@sandia.gov wrote:
 On 02/03/2012 05:03 PM, Yehuda Sadeh Weinraub wrote:

 On Fri, Feb 3, 2012 at 3:33 PM, Jim Schuttjasc...@sandia.gov  wrote:



 You can try running 'iostat -t -kx -d 1' on the osds, and see whether
 %util
 reaches 100%, and when it happens whether it's due to number of io
 operations that are thrashing, or whether it's due to high amount of data.
 FWIW, you may try setting  'filestore flusher = false', and set
 /proc/sys/vm/dirty_background_ratio' to a small number (e.g., 1M).


 Here's some iostat data from early in a run, when things are
 running well:


 02/02/2012 09:14:13 AM
 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          23.24    0.00   61.99    7.38    0.00    7.38

 Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
 avgqu-sz   await  svctm  %util
 sdc               0.00     0.00    0.00  206.00     0.00   101.57  1009.79
  54.80  251.27   4.86 100.10
 sdd               0.00     0.00    0.00  202.00     0.00    98.10   994.61
  27.85  132.42   4.96 100.10
 sde               0.00     4.00    0.00  212.00     0.00   105.09  1015.25
  96.06  588.43   4.72 100.10
 sdh               0.00     0.00    0.00  200.00     0.00    97.11   994.40
  69.77  535.01   5.00 100.10
 sdg               0.00     2.00    0.00  221.00     0.00   109.59  1015.60
  82.05  298.71   4.53 100.10
 sda               0.00     1.00    0.00  212.00     0.00    83.93   810.75
  18.26   84.82   4.68  99.30
 sdf               0.00     0.00    0.00  208.00     0.00   102.55  1009.73
  77.23  383.19   4.50  93.70
 sdb               0.00     0.00    0.00  205.00     0.00    98.66   985.68
  19.97  133.98   4.84  99.20
 sdj               0.00     0.00    0.00  202.00     0.00    99.59  1009.66
  69.97  257.47   4.95 100.00
 sdk               0.00     0.00    0.00  204.00     0.00    98.10   984.86
  20.83  100.34   4.87  99.30
 sdm               0.00     0.00    0.00  216.00     0.00   106.55  1010.22
  77.73  268.67   4.63 100.00
 sdn               0.00     0.00    0.00  205.00     0.00    98.60   985.05
  19.33   95.88   4.81  98.60
 sdo               0.00     0.00    0.00  232.00     0.00   106.25   937.93
  23.26   82.19   4.29  99.50
 sdl               0.00     0.00    0.00  181.00     0.00    85.12   963.09
  24.73  131.71   4.80  86.80
 sdp               0.00     4.00    0.00  207.00     0.00    87.41   864.77
  37.01  111.13   4.49  93.00
 sdi               0.00     0.00    0.00  208.00     0.00   103.04  1014.54
  72.30  263.72   4.70  97.70
 sdr               0.00     0.00    0.00  191.00     0.00    76.75   822.95
  11.51   83.69   4.59  87.60
 sds               0.00     0.00    0.00  209.00     0.00   101.91   998.58
  49.95  278.08   4.70  98.20
 sdt               0.00     0.00    0.00  209.00     0.00    99.57   975.69
  27.31  157.44   4.79 100.10
 sdu               0.00     0.00    0.00  216.00     0.00   107.09  1015.41
  79.82  345.88   4.63 100.10
 sdw               0.00     0.00    0.00  208.00     0.00   103.09  1015.00
  74.55  308.15   4.81 100.10
 sdv               0.00     0.00    0.00  201.00     0.00    98.05   999.08
  76.87  265.88   4.98 100.10
 sdx               0.00     0.00    0.00  202.00     0.00   100.50  1018.93
 110.40  327.68   4.96 100.10
 sdq               0.00     0.00    0.00  228.00     0.00   112.59  1011.30
  54.84  281.04   4.39 100.10

 02/02/2012 09:14:14 AM
 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          22.11    0.00   54.03   15.38    0.00    8.48

 Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
 avgqu-sz   await  svctm  %util
 sdc               0.00     0.00    0.00  233.00     0.00    99.68   876.15
  95.98  384.42   4.29 100.00
 sdd               0.00     0.00    0.00  205.00     0.00    96.64   965.46
  20.37  108.51   4.84  99.30
 sde               0.00     0.00    0.00  225.00     0.00    99.54   906.03
  92.38  420.67   4.44 100.00
 sdh               0.00     0.00    0.00  198.00     0.00    97.05  1003.84
  79.39  410.56   5.05 100.00
 sdg               0.00     0.00    0.00  245.00     0.00   108.38   905.99
  84.40  385.47   4.08 100.00
 sda               0.00     4.00    0.00  220.00     0.00    96.23   895.78
  63.24  294.59   4.44  97.60
 sdf               0.00     0.00    0.00  216.00     0.00   107.09  1015.41
  87.67  399.14   4.57  98.80
 sdb               0.00     0.00    0.00  156.00     0.00    72.05   945.95
  11.61   58.94   4.84  75.50
 sdj               0.00     0.00    0.00  199.00     0.00    95.41   981.95
  56.28  366.11   4.84  96.40
 sdk               0.00     0.00    0.00  206.00     0.00   100.14   995.57
  54.69  241.41   4.86 100.10
 sdm               0.00     0.00    0.00  200.00     0.00    99.09  1014.72
  79.51  506.47   4.74  94.70
 sdn               0.00     0.00    0.00  191.00     0.00    91.29   978.81
  26.82  128.39   5.18  98.90
 sdo               0.00     0.00    0.00  234.00     

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-06 Thread Jim Schutt

On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:

On Mon, Feb 6, 2012 at 8:20 AM, Jim Schuttjasc...@sandia.gov  wrote:




The above suggests to me that the slowdown is a result
of requests not getting submitted at the same rate as
when things are running well.



Yeah, it really looks like that. My suggestions wouldn't help there.

I do see that when things go well the number of writes per device is
capped at ~200 writes per second and the throughput per device is
~100MB/sec. Is 100MB/sec the expected device throughput?


Pretty much, at least for the outer tracks on a drive.  I've seen
~108 MB/s with dd to a block device.  Also, I've got 8 drives per
SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
disk subsystem is any sort of significant bottleneck.

-- Jim


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-06 Thread Gregory Farnum
On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt jasc...@sandia.gov wrote:
 On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:

 On Mon, Feb 6, 2012 at 8:20 AM, Jim Schuttjasc...@sandia.gov  wrote:



 The above suggests to me that the slowdown is a result
 of requests not getting submitted at the same rate as
 when things are running well.


 Yeah, it really looks like that. My suggestions wouldn't help there.

 I do see that when things go well the number of writes per device is
 capped at ~200 writes per second and the throughput per device is
 ~100MB/sec. Is 100MB/sec the expected device throughput?


 Pretty much, at least for the outer tracks on a drive.  I've seen
 ~108 MB/s with dd to a block device.  Also, I've got 8 drives per
 SAS adapter with 6 Gb/s links, so it seems unlikely to me that my
 disk subsystem is any sort of significant bottleneck.

Well, you might try changing your throttling settings on the OSDs.
ms_dispatch_throttle_bytes defaults to 10020 (100MB) and is used for
throttling dispatch; osd_max_client_bytes defaults to 50020 (500MB)
and is used to limit the amount of client data in memory (ie; messages
are included in this throttler for their entire lifetime, not just
while waiting for dispatch).
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-03 Thread Jim Schutt

On 02/02/2012 05:28 PM, Gregory Farnum wrote:

On Thu, Feb 2, 2012 at 12:22 PM, Jim Schuttjasc...@sandia.gov  wrote:

I found 0 instances of waiting for commit in all my OSD logs for my last
run.

So I never waited on the journal?


Looks like it. Interesting.



So far I'm looking at two behaviours I've noticed that seem anomalous to
me.

One is that I instrumented ms_dispatch(), and I see it take
a half-second or more several hundred times, out of several
thousand messages.  Is that expected?



How did you instrument it? If you wrapped the whole function it's
possible that those longer runs are actually chewing through several
messages that had to get waitlisted for some reason previously.
(That's the call to do_waiters().)



Yep, I wrapped the whole function, and also instrumented taking osd_lock
while I was there.  About half the time that ms_dispatch() takes more than
0.5 seconds, taking osd_lock is responsible for the delay.  There's two
dispatch threads, one for ops and one for rep_ops, right?  So one's
waiting on the other?


There's just one main dispatcher; no split for the ops and rep_ops .
The reason for that dispatch_running is that if there are requests
waiting then the tick() function will run through them if the
messenger dispatch thread is currently idle.
But it is possible for the Messenger to try and dispatch, and for that
to be blocked while some amount of (usually trivial) work is being
done by a different thread, yes. I don't think we've ever observed it
being a problem for anything other than updating OSD maps, though...


Ah, OK.

I guess I was confused by my log output, e.g.:

osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch 
ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400
osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch 
ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680

I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads.

I need to go study that code some more





Another is that once a message receive starts, I see ~50 messages
that take tens of seconds to receive, when the nominal receive time is
a half-second or less.  I'm in the process of tooling up to collect
tcpdump data on all my clients to try to catch what is going on with
that.



Again, how are you instrumenting that?



I post-process the logs, looking at the time difference between
reader got .* policy throttler and reader got .* osd_op(client.


I guess the logging output must have changed a bit at some pointer (or
was that one of your patches?). master has reader wants not reader
got for the policy throttler. (Just got a little confused when
checking the code.)


Yep, I added an extra message to make post-processing logs easier, sorry.




When I find a candidate message, I grep the log for just that reader thread,
and see, e.g., this:

osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/5000
seq 828/828 waiters 157/149 for src client.4301 tid=247
osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler
41944358/
osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader got front 128
osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 --
172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172

Note the ~2 minute delay (and ~300,000 lines of logging) between the first
and second reads.

During that time 129 sockets were processed - what makes sd=215 special?


Hrm. Well, you can try turning up the messenger debugging to 30 and
taking advantage of the reader reading reader read pair right
around tcp_read_nonblocking.


OK, I'll give that a try as well, thanks.



I've added tracepoints in my client kernel try_write(), and nothing seems
unusual (that's with running the patch to ceph_write_space() I posted
earlier):

 kworker/0:2-1790  [000]  1543.200887: ceph_try_write_msg_done: peer osd0
tid 179 seq 3 sent 4194304
 kworker/0:2-1790  

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-03 Thread Gregory Farnum
On Feb 3, 2012, at 8:18 AM, Jim Schutt jasc...@sandia.gov wrote:

 On 02/02/2012 05:28 PM, Gregory Farnum wrote:
 On Thu, Feb 2, 2012 at 12:22 PM, Jim Schuttjasc...@sandia.gov  wrote:
 I found 0 instances of waiting for commit in all my OSD logs for my last
 run.

 So I never waited on the journal?

 Looks like it. Interesting.


 So far I'm looking at two behaviours I've noticed that seem anomalous to
 me.

 One is that I instrumented ms_dispatch(), and I see it take
 a half-second or more several hundred times, out of several
 thousand messages.  Is that expected?


 How did you instrument it? If you wrapped the whole function it's
 possible that those longer runs are actually chewing through several
 messages that had to get waitlisted for some reason previously.
 (That's the call to do_waiters().)


 Yep, I wrapped the whole function, and also instrumented taking osd_lock
 while I was there.  About half the time that ms_dispatch() takes more than
 0.5 seconds, taking osd_lock is responsible for the delay.  There's two
 dispatch threads, one for ops and one for rep_ops, right?  So one's
 waiting on the other?

 There's just one main dispatcher; no split for the ops and rep_ops .
 The reason for that dispatch_running is that if there are requests
 waiting then the tick() function will run through them if the
 messenger dispatch thread is currently idle.
 But it is possible for the Messenger to try and dispatch, and for that
 to be blocked while some amount of (usually trivial) work is being
 done by a different thread, yes. I don't think we've ever observed it
 being a problem for anything other than updating OSD maps, though...

 Ah, OK.

 I guess I was confused by my log output, e.g.:

D'oh. Sorry, you confused me with your reference to repops, which
aren't special-cased or anything. But there are two messengers on the
OSD, each with their own dispatch thread. One of those messengers is
for clients and one is for other OSDs.

And now that you point that out, I wonder if the problem is lack of
Cond signaling in ms_dispatch. I'm on my phone right now but I believe
there's a chunk of commented-out code (why commented instead of
deleted? I don't know) that we want to uncomment for reasons that will
become clear when you look at it. :)
Try that and see how many of your problems disappear?



 osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 
 ms_dispatch ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400
 osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 
 ms_dispatch ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680

 I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads.

 I need to go study that code some more



 Another is that once a message receive starts, I see ~50 messages
 that take tens of seconds to receive, when the nominal receive time is
 a half-second or less.  I'm in the process of tooling up to collect
 tcpdump data on all my clients to try to catch what is going on with
 that.


 Again, how are you instrumenting that?


 I post-process the logs, looking at the time difference between
 reader got .* policy throttler and reader got .* osd_op(client.

 I guess the logging output must have changed a bit at some pointer (or
 was that one of your patches?). master has reader wants not reader
 got for the policy throttler. (Just got a little confused when
 checking the code.)

 Yep, I added an extra message to make post-processing logs easier, sorry.


 When I find a candidate message, I grep the log for just that reader thread,
 and see, e.g., this:

 osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/5000
 seq 828/828 waiters 157/149 for src client.4301 tid=247
 osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler
 41944358/
 osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader got front 128
 osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
 osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
 osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
 osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-03 Thread Sage Weil
On Fri, 3 Feb 2012, Jim Schutt wrote:
 On 02/02/2012 05:28 PM, Gregory Farnum wrote:
  On Thu, Feb 2, 2012 at 12:22 PM, Jim Schuttjasc...@sandia.gov  wrote:
   I found 0 instances of waiting for commit in all my OSD logs for my last
   run.
   
   So I never waited on the journal?
  
  Looks like it. Interesting.
  
  
 So far I'm looking at two behaviours I've noticed that seem anomalous
 to
 me.
 
 One is that I instrumented ms_dispatch(), and I see it take
 a half-second or more several hundred times, out of several
 thousand messages.  Is that expected?


How did you instrument it? If you wrapped the whole function it's
possible that those longer runs are actually chewing through several
messages that had to get waitlisted for some reason previously.
(That's the call to do_waiters().)
   
   
   Yep, I wrapped the whole function, and also instrumented taking osd_lock
   while I was there.  About half the time that ms_dispatch() takes more than
   0.5 seconds, taking osd_lock is responsible for the delay.  There's two
   dispatch threads, one for ops and one for rep_ops, right?  So one's
   waiting on the other?
  
  There's just one main dispatcher; no split for the ops and rep_ops .
  The reason for that dispatch_running is that if there are requests
  waiting then the tick() function will run through them if the
  messenger dispatch thread is currently idle.
  But it is possible for the Messenger to try and dispatch, and for that
  to be blocked while some amount of (usually trivial) work is being
  done by a different thread, yes. I don't think we've ever observed it
  being a problem for anything other than updating OSD maps, though...
 
 Ah, OK.
 
 I guess I was confused by my log output, e.g.:
 
 osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch
 ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400
 osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch
 ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680
 
 I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads.
 
 I need to go study that code some more

Oh... they are separate thread.  In the OSD's case two different 
messengers (the public and cluster ones) are wired up to the same 
dispatcher (OSD::ms_dispatch).  MOSDOps come in on the public thread, 
MOSDSubOps on the cluster one, but they're both fed to the same function.  
That's why there's some funkiness going on in, say, handle_osd_map().

But like Greg said, I don't think we've seen any significant latencies 
there except from map processing.  If you have a log, that would be 
interesting to look at!

sage


 
  
  
 Another is that once a message receive starts, I see ~50 messages
 that take tens of seconds to receive, when the nominal receive time is
 a half-second or less.  I'm in the process of tooling up to collect
 tcpdump data on all my clients to try to catch what is going on with
 that.


Again, how are you instrumenting that?
   
   
   I post-process the logs, looking at the time difference between
   reader got .* policy throttler and reader got .* osd_op(client.
  
  I guess the logging output must have changed a bit at some pointer (or
  was that one of your patches?). master has reader wants not reader
  got for the policy throttler. (Just got a little confused when
  checking the code.)
 
 Yep, I added an extra message to make post-processing logs easier, sorry.
 
  
   When I find a candidate message, I grep the log for just that reader
   thread,
   and see, e.g., this:
   
   osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
   172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680
   sd=215
   pgs=49 cs=1 l=1).reader got 2670720 from policy throttler
   48809510/5000
   seq 828/828 waiters 157/149 for src client.4301 tid=247
   osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 --
   172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680
   sd=215
   pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler
   41944358/
   osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 --
   172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680
   sd=215
   pgs=49 cs=1 l=1).reader got front 128
   osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 --
   172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680
   sd=215
   pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
   osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 --
   172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680
   sd=215
   pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
   osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 --
   172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680
   sd=215
   pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
   osd.0.log:1561092:2012-02-02 

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-02 Thread Jim Schutt

(resent because I forgot the list on my original reply)

On 02/01/2012 03:33 PM, Gregory Farnum wrote:

On Wed, Feb 1, 2012 at 7:54 AM, Jim Schuttjasc...@sandia.gov  wrote:

Hi,

FWIW, I've been trying to understand op delays under very heavy write
load, and have been working a little with the policy throttler in hopes of
using throttling delays to help track down which ops were backing up.
Without much success, unfortunately.

When I saw the wip-osd-op-tracking branch, I wondered if any of this
stuff might be helpful.  Here it is, just in case.


In general these patches are dumping information to the logs, and part
of the wip-osd-op-tracking branch is actually keeping track of most of
the message queueing wait times as part of the message itself
(although not the information about number of waiters and sleep/wake
seqs). I'm inclined to prefer that approach to log dumping.


I agree - I've just been using log dumping because I can extract
any relationships I can write a perl script to find :)  So far,
not too helpful.


Are there any patches you recommend for merging? I'm a little curious
about the ordered wakeup one — do you have data about when that's a
problem?


I've been trying to push the client:osd ratio, and in my testbed
I can run up to 166 linux clients. Right now I'm running them
against 48 OSDs.  The clients are 1 Gb/s ethernet, and the OSDs
have a 10 Gb/s ethernet for clients and another for the cluster.

During sustained write loads I see a factor of 10 oscillation
in aggregate throughput, and during that time I see clients
stuck in the policy throttler for hundreds of seconds, and I
see a number of waiters equal to
  number of clients - (throttler limit) / (msg size)
If I do a histogram of throttler wait times I see a handful of
messages that wait for an extra couple hundreds of seconds
without the ordered wakeup.

I'm not sure what this will look like if my throughput
variations can be fixed.  But, for our HPC loads I expect
we'll often see periods where offered load is much higher
that aggregate bandwidth of any system we can afford to
build, so ordered wakeup may be useful in such cases for
client fairness.

So I'd recommend the ordered wakeup patch if you don't
see any downsides.

Sorry for the noise on the others - mostly I just wanted
to share the sort of things I've been looking at.  I'll
be learning to use your new stuff soon...

-- Jim


-Greg





--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-02 Thread Gregory Farnum
On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt jasc...@sandia.gov wrote:
 I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
 per OSD.  During a test I watch both OSD servers with both
 vmstat and iostat.

 During a good period, vmstat says the server is sustaining  2 GB/s
 for multiple tens of seconds.  Since I use replication factor 2, that
 means that server is sustaining  500 MB/s aggregate client throughput,
 right?  During such a period vmstat also reports ~10% CPU idle.

 During a bad period, vmstat says the server is doing ~200 MB/s,
 with lots of idle cycles.  It is during these periods that
 messages stuck in the policy throttler build up such long
 wait times.  Sometimes I see really bad periods with aggregate
 throughput per server  100 MB/s.

 The typical pattern I see is that a run starts with tens of seconds
 of aggregate throughput  2 GB/s.  Then it drops and bounces around
 500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
 it ramps back up near 2 GB/s again.

Hmm. 100MB/s is awfully low for this theory, but have you tried to
correlate the drops in throughput with the OSD journals running out of
space? I assume from your setup that they're sharing the disk with the
store (although it works either way), and your description makes me
think that throughput is initially constrained by sequential journal
writes but then the journal runs out of space and the OSD has to wait
for the main store to catch up (with random IO), and that sends the IO
patterns all to hell. (If you can say that random 4MB IOs are
hellish.)
I'm also curious about memory usage as a possible explanation for the
more dramatic drops.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-02 Thread Jim Schutt

On 02/02/2012 10:52 AM, Gregory Farnum wrote:

On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov  wrote:

I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
per OSD.  During a test I watch both OSD servers with both
vmstat and iostat.

During a good period, vmstat says the server is sustaining  2 GB/s
for multiple tens of seconds.  Since I use replication factor 2, that
means that server is sustaining  500 MB/s aggregate client throughput,
right?  During such a period vmstat also reports ~10% CPU idle.

During a bad period, vmstat says the server is doing ~200 MB/s,
with lots of idle cycles.  It is during these periods that
messages stuck in the policy throttler build up such long
wait times.  Sometimes I see really bad periods with aggregate
throughput per server  100 MB/s.

The typical pattern I see is that a run starts with tens of seconds
of aggregate throughput  2 GB/s.  Then it drops and bounces around
500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
it ramps back up near 2 GB/s again.


Hmm. 100MB/s is awfully low for this theory, but have you tried to
correlate the drops in throughput with the OSD journals running out of
space?


A spot check of logs from my last run doesn't seem to have any
journal throttle: waited for messages during a slowdown.
Is that what you mean?

During the fast part of I run I see lots of journal messages
with this pattern:

2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops 
and 50346596 bytes, now 22 ops and 90041106 bytes
2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops 
and 50346596 bytes, now 21 ops and 85845571 bytes

which I think means my journal is doing 50 MB/s, right?


I assume from your setup that they're sharing the disk with the
store (although it works either way),


I've got a 4 GB journal partition on the outer tracks of the disk.


and your description makes me
think that throughput is initially constrained by sequential journal
writes but then the journal runs out of space and the OSD has to wait
for the main store to catch up (with random IO), and that sends the IO
patterns all to hell. (If you can say that random 4MB IOs are
hellish.)


iostat 1 during the fast part of a run shows both journal and data
partitions running at 45-50 MB/s.  During the slow part of a run
they both show similar but low data rates.


I'm also curious about memory usage as a possible explanation for the
more dramatic drops.


My OSD servers have 48 GB memory.  During a run I rarely see less than
24 GB used by the page cache, with the rest mostly used by anonymous memory.
I don't run with any swap.

So far I'm looking at two behaviours I've noticed that seem anomalous to me.

One is that I instrumented ms_dispatch(), and I see it take
a half-second or more several hundred times, out of several
thousand messages.  Is that expected?

Another is that once a message receive starts, I see ~50 messages
that take tens of seconds to receive, when the nominal receive time is
a half-second or less.  I'm in the process of tooling up to collect
tcpdump data on all my clients to try to catch what is going on with that.

Any other ideas on what to look for would be greatly appreciated.

-- Jim


-Greg





--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-02 Thread Sage Weil
On Thu, 2 Feb 2012, Jim Schutt wrote:
 On 02/02/2012 10:52 AM, Gregory Farnum wrote:
  On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov  wrote:
   I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
   per OSD.  During a test I watch both OSD servers with both
   vmstat and iostat.
   
   During a good period, vmstat says the server is sustaining  2 GB/s
   for multiple tens of seconds.  Since I use replication factor 2, that
   means that server is sustaining  500 MB/s aggregate client throughput,
   right?  During such a period vmstat also reports ~10% CPU idle.
   
   During a bad period, vmstat says the server is doing ~200 MB/s,
   with lots of idle cycles.  It is during these periods that
   messages stuck in the policy throttler build up such long
   wait times.  Sometimes I see really bad periods with aggregate
   throughput per server  100 MB/s.
   
   The typical pattern I see is that a run starts with tens of seconds
   of aggregate throughput  2 GB/s.  Then it drops and bounces around
   500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
   it ramps back up near 2 GB/s again.
  
  Hmm. 100MB/s is awfully low for this theory, but have you tried to
  correlate the drops in throughput with the OSD journals running out of
  space?
 
 A spot check of logs from my last run doesn't seem to have any
 journal throttle: waited for messages during a slowdown.
 Is that what you mean?
 
 During the fast part of I run I see lots of journal messages
 with this pattern:
 
 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops
 and 50346596 bytes, now 22 ops and 90041106 bytes
 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops
 and 50346596 bytes, now 21 ops and 85845571 bytes

It occurs to me that part of the problem may be the current sync 
io behavior in the journal.  It ends up doing really big writes, which 
makes things bursty, and will get stuff blocked up behind the throttler.  
You might try making 'journal max write bytes' smaller?  Hmm, although 
it's currently 10MB, which isn't too bad.  So unless you've changed it 
from the default, that's probably not it.

 which I think means my journal is doing 50 MB/s, right?
 
  I assume from your setup that they're sharing the disk with the
  store (although it works either way),
 
 I've got a 4 GB journal partition on the outer tracks of the disk.

This is on the same disk as the osd data?  As an experiement, you could 
try putting the journal on a separate disk (half disks for journals, half 
for data).  That's obviously not what you want in the real world, but it 
would be interesting to see if contention for the spindle is responsible 
for this.

 So far I'm looking at two behaviours I've noticed that seem anomalous to me.
 
 One is that I instrumented ms_dispatch(), and I see it take
 a half-second or more several hundred times, out of several
 thousand messages.  Is that expected?

I don't think so, but it could happen if the cpu/memory contention, or in 
the case of osdmap updates where we block on io in that thread.

 Another is that once a message receive starts, I see ~50 messages
 that take tens of seconds to receive, when the nominal receive time is
 a half-second or less.  I'm in the process of tooling up to collect
 tcpdump data on all my clients to try to catch what is going on with that.
 
 Any other ideas on what to look for would be greatly appreciated.

I'd rule out the journal+data on same disk as the source of pain first.  
If that's what's going on, we can take a closer look at speficically how 
to make it behave better!

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-02 Thread Gregory Farnum
On Thu, Feb 2, 2012 at 11:06 AM, Jim Schutt jasc...@sandia.gov wrote:
 On 02/02/2012 10:52 AM, Gregory Farnum wrote:

 On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov  wrote:
 The typical pattern I see is that a run starts with tens of seconds
 of aggregate throughput  2 GB/s.  Then it drops and bounces around
 500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
 it ramps back up near 2 GB/s again.


 Hmm. 100MB/s is awfully low for this theory, but have you tried to
 correlate the drops in throughput with the OSD journals running out of
 space?


 A spot check of logs from my last run doesn't seem to have any
 journal throttle: waited for messages during a slowdown.
 Is that what you mean?

I'd expect to see those, yes, but I actually meant the on-disk journal
itself getting full. I believe that should result in output like:
write_thread_entry full, going to sleep (waiting for commit)
...although I now notice that's a much higher log level (20) than the
other messages (1/5).

 During the fast part of I run I see lots of journal messages
 with this pattern:

 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops
 and 50346596 bytes, now 22 ops and 90041106 bytes
 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops
 and 50346596 bytes, now 21 ops and 85845571 bytes

 which I think means my journal is doing 50 MB/s, right?

Generally, yes — although that'll also pop up if the store manages to
commit faster than the journal (unlikely). :)

 and your description makes me
 think that throughput is initially constrained by sequential journal
 writes but then the journal runs out of space and the OSD has to wait
 for the main store to catch up (with random IO), and that sends the IO
 patterns all to hell. (If you can say that random 4MB IOs are
 hellish.)


 iostat 1 during the fast part of a run shows both journal and data
 partitions running at 45-50 MB/s.  During the slow part of a run
 they both show similar but low data rates.

All right. That's actually not that surprising; random 4MB writes are
pretty nice to a modern drive.

 I'm also curious about memory usage as a possible explanation for the
 more dramatic drops.

 My OSD servers have 48 GB memory.  During a run I rarely see less than
 24 GB used by the page cache, with the rest mostly used by anonymous memory.
 I don't run with any swap.

 So far I'm looking at two behaviours I've noticed that seem anomalous to me.

 One is that I instrumented ms_dispatch(), and I see it take
 a half-second or more several hundred times, out of several
 thousand messages.  Is that expected?

How did you instrument it? If you wrapped the whole function it's
possible that those longer runs are actually chewing through several
messages that had to get waitlisted for some reason previously.
(That's the call to do_waiters().)

 Another is that once a message receive starts, I see ~50 messages
 that take tens of seconds to receive, when the nominal receive time is
 a half-second or less.  I'm in the process of tooling up to collect
 tcpdump data on all my clients to try to catch what is going on with that.

Again, how are you instrumenting that?

-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-02 Thread Jim Schutt

On 02/02/2012 12:15 PM, Sage Weil wrote:

On Thu, 2 Feb 2012, Jim Schutt wrote:

On 02/02/2012 10:52 AM, Gregory Farnum wrote:

On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov   wrote:

I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive
per OSD.  During a test I watch both OSD servers with both
vmstat and iostat.

During a good period, vmstat says the server is sustaining   2 GB/s
for multiple tens of seconds.  Since I use replication factor 2, that
means that server is sustaining   500 MB/s aggregate client throughput,
right?  During such a period vmstat also reports ~10% CPU idle.

During a bad period, vmstat says the server is doing ~200 MB/s,
with lots of idle cycles.  It is during these periods that
messages stuck in the policy throttler build up such long
wait times.  Sometimes I see really bad periods with aggregate
throughput per server   100 MB/s.

The typical pattern I see is that a run starts with tens of seconds
of aggregate throughput   2 GB/s.  Then it drops and bounces around
500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
it ramps back up near 2 GB/s again.


Hmm. 100MB/s is awfully low for this theory, but have you tried to
correlate the drops in throughput with the OSD journals running out of
space?


A spot check of logs from my last run doesn't seem to have any
journal throttle: waited for messages during a slowdown.
Is that what you mean?

During the fast part of I run I see lots of journal messages
with this pattern:

2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops
and 50346596 bytes, now 22 ops and 90041106 bytes
2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops
and 50346596 bytes, now 21 ops and 85845571 bytes


It occurs to me that part of the problem may be the current sync
io behavior in the journal.  It ends up doing really big writes, which
makes things bursty, and will get stuff blocked up behind the throttler.
You might try making 'journal max write bytes' smaller?  Hmm, although
it's currently 10MB, which isn't too bad.  So unless you've changed it
from the default, that's probably not it.


I have changed it recently, but I was seeing this type of
behaviour before making that change.

FWIW, here's my current non-standard tunings.  I'm using
these because the standard ones work even worse for me
on this test case:

osd op threads = 48
filestore queue max ops = 16
osd client message size cap = 5000
client oc max dirty = 5000
journal max write bytes = 5000
ms dispatch throttle bytes =  
client oc size =  1
journal queue max bytes = 12500
filestore queue max bytes =   12500
objector inflight op bytes =  2


FWIW, turning down filestore queue max ops and turning up osd op threads
made the biggest positive impact on my performance levels on this test.

With default values for those I was seeing much worse stalling behaviour.




which I think means my journal is doing 50 MB/s, right?


I assume from your setup that they're sharing the disk with the
store (although it works either way),


I've got a 4 GB journal partition on the outer tracks of the disk.


This is on the same disk as the osd data?  As an experiement, you could
try putting the journal on a separate disk (half disks for journals, half
for data).  That's obviously not what you want in the real world, but it
would be interesting to see if contention for the spindle is responsible
for this.


So far I'm looking at two behaviours I've noticed that seem anomalous to me.

One is that I instrumented ms_dispatch(), and I see it take
a half-second or more several hundred times, out of several
thousand messages.  Is that expected?


I don't think so, but it could happen if the cpu/memory contention, or in
the case of osdmap updates where we block on io in that thread.


Hmmm.  Would those go into the filestore op queue?

I didn't think to check ms_dispatch() ET until after I had tuned
down filestore queue max ops.  Is it worth checking this?




Another is that once 

Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-02 Thread Jim Schutt

On 02/02/2012 12:32 PM, Gregory Farnum wrote:

On Thu, Feb 2, 2012 at 11:06 AM, Jim Schuttjasc...@sandia.gov  wrote:

On 02/02/2012 10:52 AM, Gregory Farnum wrote:


On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.govwrote:

The typical pattern I see is that a run starts with tens of seconds
of aggregate throughput2 GB/s.  Then it drops and bounces around
500 - 1000 MB/s, with occasional excursions under 100 MB/s.  Then
it ramps back up near 2 GB/s again.



Hmm. 100MB/s is awfully low for this theory, but have you tried to
correlate the drops in throughput with the OSD journals running out of
space?



A spot check of logs from my last run doesn't seem to have any
journal throttle: waited for messages during a slowdown.
Is that what you mean?


I'd expect to see those, yes, but I actually meant the on-disk journal
itself getting full. I believe that should result in output like:
 write_thread_entry full, going to sleep (waiting for commit)
...although I now notice that's a much higher log level (20) than the
other messages (1/5).


So I've been running OSDs with
debug osd = 20
debug journal = 20   ; local journaling
debug filestore = 20 ; local object storage
debug objector = 20
debug ms = 20
debug = 1

I found 0 instances of waiting for commit in all my OSD logs for my last run.

So I never waited on the journal?




During the fast part of I run I see lots of journal messages
with this pattern:

2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops
and 50346596 bytes, now 22 ops and 90041106 bytes
2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops
and 50346596 bytes, now 21 ops and 85845571 bytes

which I think means my journal is doing 50 MB/s, right?


Generally, yes — although that'll also pop up if the store manages to
commit faster than the journal (unlikely). :)


and your description makes me
think that throughput is initially constrained by sequential journal
writes but then the journal runs out of space and the OSD has to wait
for the main store to catch up (with random IO), and that sends the IO
patterns all to hell. (If you can say that random 4MB IOs are
hellish.)



iostat 1 during the fast part of a run shows both journal and data
partitions running at 45-50 MB/s.  During the slow part of a run
they both show similar but low data rates.


All right. That's actually not that surprising; random 4MB writes are
pretty nice to a modern drive.


I'm also curious about memory usage as a possible explanation for the
more dramatic drops.


My OSD servers have 48 GB memory.  During a run I rarely see less than
24 GB used by the page cache, with the rest mostly used by anonymous memory.
I don't run with any swap.

So far I'm looking at two behaviours I've noticed that seem anomalous to me.

One is that I instrumented ms_dispatch(), and I see it take
a half-second or more several hundred times, out of several
thousand messages.  Is that expected?


How did you instrument it? If you wrapped the whole function it's
possible that those longer runs are actually chewing through several
messages that had to get waitlisted for some reason previously.
(That's the call to do_waiters().)


Yep, I wrapped the whole function, and also instrumented taking osd_lock
while I was there.  About half the time that ms_dispatch() takes more than
0.5 seconds, taking osd_lock is responsible for the delay.  There's two
dispatch threads, one for ops and one for rep_ops, right?  So one's
waiting on the other?




Another is that once a message receive starts, I see ~50 messages
that take tens of seconds to receive, when the nominal receive time is
a half-second or less.  I'm in the process of tooling up to collect
tcpdump data on all my clients to try to catch what is going on with that.


Again, how are you instrumenting that?


I post-process the logs, looking at the time difference between
reader got .* policy throttler and reader got .* osd_op(client.

When I find a candidate message, I grep the log for just that reader thread,
and see, e.g., this:

osd.0.log:1280693:2012-02-02 09:17:57.704508 

Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-02 Thread Jim Schutt

On 02/02/2012 01:22 PM, Jim Schutt wrote:

On 02/02/2012 12:32 PM, Gregory Farnum wrote:

On Thu, Feb 2, 2012 at 11:06 AM, Jim Schuttjasc...@sandia.gov wrote:

On 02/02/2012 10:52 AM, Gregory Farnum wrote:


On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov wrote:

The typical pattern I see is that a run starts with tens of seconds
of aggregate throughput 2 GB/s. Then it drops and bounces around
500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then
it ramps back up near 2 GB/s again.



Hmm. 100MB/s is awfully low for this theory, but have you tried to
correlate the drops in throughput with the OSD journals running out of
space?



A spot check of logs from my last run doesn't seem to have any
journal throttle: waited for messages during a slowdown.
Is that what you mean?


I'd expect to see those, yes, but I actually meant the on-disk journal
itself getting full. I believe that should result in output like:
write_thread_entry full, going to sleep (waiting for commit)
...although I now notice that's a much higher log level (20) than the
other messages (1/5).


So I've been running OSDs with
debug osd = 20
debug journal = 20 ; local journaling
debug filestore = 20 ; local object storage
debug objector = 20
debug ms = 20
debug = 1

I found 0 instances of waiting for commit in all my OSD logs for my last run.

So I never waited on the journal?




During the fast part of I run I see lots of journal messages
with this pattern:

2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops
and 50346596 bytes, now 22 ops and 90041106 bytes
2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes
2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes
2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes
2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes
2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes
2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes
2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes
2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes
2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes
2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes
2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes
2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops
and 50346596 bytes, now 21 ops and 85845571 bytes

which I think means my journal is doing 50 MB/s, right?


Generally, yes — although that'll also pop up if the store manages to
commit faster than the journal (unlikely). :)


and your description makes me
think that throughput is initially constrained by sequential journal
writes but then the journal runs out of space and the OSD has to wait
for the main store to catch up (with random IO), and that sends the IO
patterns all to hell. (If you can say that random 4MB IOs are
hellish.)



iostat 1 during the fast part of a run shows both journal and data
partitions running at 45-50 MB/s. During the slow part of a run
they both show similar but low data rates.


All right. That's actually not that surprising; random 4MB writes are
pretty nice to a modern drive.


I'm also curious about memory usage as a possible explanation for the
more dramatic drops.


My OSD servers have 48 GB memory. During a run I rarely see less than
24 GB used by the page cache, with the rest mostly used by anonymous memory.
I don't run with any swap.

So far I'm looking at two behaviours I've noticed that seem anomalous to me.

One is that I instrumented ms_dispatch(), and I see it take
a half-second or more several hundred times, out of several
thousand messages. Is that expected?


How did you instrument it? If you wrapped the whole function it's
possible that those longer runs are actually chewing through several
messages that had to get waitlisted for some reason previously.
(That's the call to do_waiters().)


Yep, I wrapped the whole function, and also instrumented taking osd_lock
while I was there. About half the time that ms_dispatch() takes more than
0.5 seconds, taking osd_lock is responsible for the delay. There's two
dispatch threads, one for ops and one for rep_ops, right? So one's
waiting on the other?




Another is that once a message receive starts, I see ~50 messages
that take tens of seconds to receive, when the nominal receive time is
a half-second or less. I'm in the process of tooling up to collect
tcpdump data on all my clients to try to catch what is going on with that.


Again, how are you instrumenting that?


I post-process the logs, looking at the time difference between
reader got .* policy throttler and reader got .* osd_op(client.

When I find a candidate message, I grep the log for just that reader thread,
and see, e.g., this:

osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 -- 

Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-02 Thread Gregory Farnum
On Thu, Feb 2, 2012 at 12:22 PM, Jim Schutt jasc...@sandia.gov wrote:
 I found 0 instances of waiting for commit in all my OSD logs for my last
 run.

 So I never waited on the journal?

Looks like it. Interesting.


 So far I'm looking at two behaviours I've noticed that seem anomalous to
 me.

 One is that I instrumented ms_dispatch(), and I see it take
 a half-second or more several hundred times, out of several
 thousand messages.  Is that expected?


 How did you instrument it? If you wrapped the whole function it's
 possible that those longer runs are actually chewing through several
 messages that had to get waitlisted for some reason previously.
 (That's the call to do_waiters().)


 Yep, I wrapped the whole function, and also instrumented taking osd_lock
 while I was there.  About half the time that ms_dispatch() takes more than
 0.5 seconds, taking osd_lock is responsible for the delay.  There's two
 dispatch threads, one for ops and one for rep_ops, right?  So one's
 waiting on the other?

There's just one main dispatcher; no split for the ops and rep_ops .
The reason for that dispatch_running is that if there are requests
waiting then the tick() function will run through them if the
messenger dispatch thread is currently idle.
But it is possible for the Messenger to try and dispatch, and for that
to be blocked while some amount of (usually trivial) work is being
done by a different thread, yes. I don't think we've ever observed it
being a problem for anything other than updating OSD maps, though...


 Another is that once a message receive starts, I see ~50 messages
 that take tens of seconds to receive, when the nominal receive time is
 a half-second or less.  I'm in the process of tooling up to collect
 tcpdump data on all my clients to try to catch what is going on with
 that.


 Again, how are you instrumenting that?


 I post-process the logs, looking at the time difference between
 reader got .* policy throttler and reader got .* osd_op(client.

I guess the logging output must have changed a bit at some pointer (or
was that one of your patches?). master has reader wants not reader
got for the policy throttler. (Just got a little confused when
checking the code.)

 When I find a candidate message, I grep the log for just that reader thread,
 and see, e.g., this:

 osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/5000
 seq 828/828 waiters 157/149 for src client.4301 tid=247
 osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler
 41944358/
 osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader got front 128
 osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0
 osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592
 osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620
 osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 --
 172.17.131.32:6800/14974  172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172

 Note the ~2 minute delay (and ~300,000 lines of logging) between the first
 and second reads.

 During that time 129 sockets were processed - what makes sd=215 special?

Hrm. Well, you can try turning up the messenger debugging to 30 and
taking advantage of the reader reading reader read pair right
around tcp_read_nonblocking.

 I've added tracepoints in my client kernel try_write(), and nothing seems
 unusual (that's with running the patch to ceph_write_space() I posted
 earlier):

     kworker/0:2-1790  [000]  1543.200887: ceph_try_write_msg_done: peer osd0
 tid 179 seq 3 sent 4194304
     kworker/0:2-1790  [000]  1543.200901: ceph_prepare_write_msg: peer osd0
 tid 207 seq 4 sent 0
*snip*
     kworker/0:2-1790  [000]  1569.078614: ceph_try_write_msg_done: peer osd0
 tid 207 seq 4 sent 4194304

 There's a 25 second gap at 1543.236256, but nothing like the
 100 second gap in the reader.

 Hence, tcpdump seems like a good idea?

You do bring us interesting problems! Let us know what info you come up with.

Oh, and I keep forgetting to ask: what does the write workload look
like? At first I assumed this was a CephFS workload, but given that
you're changing 

[RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-01 Thread Jim Schutt
Hi,

FWIW, I've been trying to understand op delays under very heavy write
load, and have been working a little with the policy throttler in hopes of
using throttling delays to help track down which ops were backing up.
Without much success, unfortunately.

When I saw the wip-osd-op-tracking branch, I wondered if any of this
stuff might be helpful.  Here it is, just in case.

-- Jim

Jim Schutt (6):
  msgr: print message sequence number and tid when receiving message
envelope
  common/Throttle: track sleep/wake sequences in Throttle, report them
for policy throttler
  common/Throttle: throttle in FIFO order
  common/Throttle: FIFO throttler doesn't need to signal waiters when
max changes
  common/Throttle: make get() report number of waiters on entry/exit
  msg: log Message interactions with throttler

 src/common/Throttle.h  |   75 +++-
 src/msg/Message.h  |   71 +++--
 src/msg/SimpleMessenger.cc |   22 +---
 3 files changed, 129 insertions(+), 39 deletions(-)


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2012-02-01 Thread Gregory Farnum
On Wed, Feb 1, 2012 at 7:54 AM, Jim Schutt jasc...@sandia.gov wrote:
 Hi,

 FWIW, I've been trying to understand op delays under very heavy write
 load, and have been working a little with the policy throttler in hopes of
 using throttling delays to help track down which ops were backing up.
 Without much success, unfortunately.

 When I saw the wip-osd-op-tracking branch, I wondered if any of this
 stuff might be helpful.  Here it is, just in case.

In general these patches are dumping information to the logs, and part
of the wip-osd-op-tracking branch is actually keeping track of most of
the message queueing wait times as part of the message itself
(although not the information about number of waiters and sleep/wake
seqs). I'm inclined to prefer that approach to log dumping.
Are there any patches you recommend for merging? I'm a little curious
about the ordered wakeup one — do you have data about when that's a
problem?
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html