Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
Hi Sage, On 02/26/2013 12:36 PM, Sage Weil wrote: On Tue, 26 Feb 2013, Jim Schutt wrote: I think the right solution is to make an option that will setsockopt on SO_RECVBUF to some value (say, 256KB). I pushed a branch that does this, wip-tcp. Do you mind checking to see if this addresses the issue (without manually adjusting things in /proc)? I'll be happy to test it out... That would be great! It's branch wip-tcp, and the setting is 'ms tcp rcvbuf'. I've verified that I can reproduce the slowdown with the default value of 1 for /proc/sys/net/ipv4/tcp_moderate_rcvbuf, and 'ms tcp rcvbuf' at 0. I've also verified that I could not reproduce any slowdown when I configure 'ms tcp rcvbuf' to 256 KiB on OSDs. So, that's great news - sorry for the delay in testing. Also, FWIW, I ended up testing with commits cb15e6e0f4 and c346282940 cherry-picked on top of next as of a day or so ago (commit f58601d681), as for some reason wip-tcp wouldn't work for me - ceph-mon was non-responsive in some way I didn't dig into. -- Jim Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Thu, 28 Feb 2013, Jim Schutt wrote: Hi Sage, On 02/26/2013 12:36 PM, Sage Weil wrote: On Tue, 26 Feb 2013, Jim Schutt wrote: I think the right solution is to make an option that will setsockopt on SO_RECVBUF to some value (say, 256KB). I pushed a branch that does this, wip-tcp. Do you mind checking to see if this addresses the issue (without manually adjusting things in /proc)? I'll be happy to test it out... That would be great! It's branch wip-tcp, and the setting is 'ms tcp rcvbuf'. I've verified that I can reproduce the slowdown with the default value of 1 for /proc/sys/net/ipv4/tcp_moderate_rcvbuf, and 'ms tcp rcvbuf' at 0. I've also verified that I could not reproduce any slowdown when I configure 'ms tcp rcvbuf' to 256 KiB on OSDs. So, that's great news - sorry for the delay in testing. Awesome--thanks so much for testing that! Pulling it into master now. Also, FWIW, I ended up testing with commits cb15e6e0f4 and c346282940 cherry-picked on top of next as of a day or so ago (commit f58601d681), as for some reason wip-tcp wouldn't work for me - ceph-mon was non-responsive in some way I didn't dig into. Yeah, sorry about that. I rebased wip-tcp a few days ago but you may have picked up the previous version. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
Hi Sage, On 02/20/2013 05:12 PM, Sage Weil wrote: Hi Jim, I'm resurrecting an ancient thread here, but: we've just observed this on another big cluster and remembered that this hasn't actually been fixed. Sorry for the delayed reply - I missed this in a backlog of unread email... I think the right solution is to make an option that will setsockopt on SO_RECVBUF to some value (say, 256KB). I pushed a branch that does this, wip-tcp. Do you mind checking to see if this addresses the issue (without manually adjusting things in /proc)? I'll be happy to test it out... And perhaps we should consider making this default to 256KB... That's the value I've been using with my /proc adjustments since I figured out what was going on. My servers use a 10 GbE port for each of the cluster and public networks, with cephfs clients using 1 GbE, and I've not detected any issues resulting from that value. So, it seems like a decent starting point for a default... -- Jim Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Tue, 26 Feb 2013, Jim Schutt wrote: I think the right solution is to make an option that will setsockopt on SO_RECVBUF to some value (say, 256KB). I pushed a branch that does this, wip-tcp. Do you mind checking to see if this addresses the issue (without manually adjusting things in /proc)? I'll be happy to test it out... That would be great! It's branch wip-tcp, and the setting is 'ms tcp rcvbuf'. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
Hi Jim, I'm resurrecting an ancient thread here, but: we've just observed this on another big cluster and remembered that this hasn't actually been fixed. I think the right solution is to make an option that will setsockopt on SO_RECVBUF to some value (say, 256KB). I pushed a branch that does this, wip-tcp. Do you mind checking to see if this addresses the issue (without manually adjusting things in /proc)? And perhaps we should consider making this default to 256KB... Thanks! sage On Fri, 24 Feb 2012, Jim Schutt wrote: On 02/02/2012 10:52 AM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov wrote: I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive per OSD. During a test I watch both OSD servers with both vmstat and iostat. During a good period, vmstat says the server is sustaining 2 GB/s for multiple tens of seconds. Since I use replication factor 2, that means that server is sustaining 500 MB/s aggregate client throughput, right? During such a period vmstat also reports ~10% CPU idle. During a bad period, vmstat says the server is doing ~200 MB/s, with lots of idle cycles. It is during these periods that messages stuck in the policy throttler build up such long wait times. Sometimes I see really bad periods with aggregate throughput per server 100 MB/s. The typical pattern I see is that a run starts with tens of seconds of aggregate throughput 2 GB/s. Then it drops and bounces around 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then it ramps back up near 2 GB/s again. Hmm. 100MB/s is awfully low for this theory, but have you tried to correlate the drops in throughput with the OSD journals running out of space? I assume from your setup that they're sharing the disk with the store (although it works either way), and your description makes me think that throughput is initially constrained by sequential journal writes but then the journal runs out of space and the OSD has to wait for the main store to catch up (with random IO), and that sends the IO patterns all to hell. (If you can say that random 4MB IOs are hellish.) I'm also curious about memory usage as a possible explanation for the more dramatic drops. I've finally figured out what is going on with this behaviour. Memory usage was on the right track. It turns out to be an unfortunate interaction between the number of OSDs/server, number of clients, TCP socket buffer autotuning, the policy throttler, and limits on the total memory used by the TCP stack (net/ipv4/tcp_mem sysctl). What happens is that for throttled reader threads, the TCP stack will continue to receive data as long as there is available socket buffer, and the sender has data to send. As each reader thread receives successive messages, the TCP socket buffer autotuning increases the size of the socket buffer. Eventually, due to the number of OSDs per server and the number of clients trying to write, all the memory the TCP stack is allowed by net/ipv4/tcp_mem to use is consumed by the socket buffers of throttled reader threads. When this happens, TCP processing is affected to the point that the TCP stack cannot send ACKs on behalf of the reader threads that aren't throttled. At that point the OSD stalls until the TCP retransmit count on some connection is exceeded, causing it to be reset. Since my OSD servers don't run anything else, the simplest solution for me is to turn off socket buffer autotuning (net/ipv4/tcp_moderate_rcvbuf), and set the default socket buffer size to something reasonable. 256k seems to be working well for me right now. -- Jim -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On 02/02/2012 10:52 AM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov wrote: I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive per OSD. During a test I watch both OSD servers with both vmstat and iostat. During a good period, vmstat says the server is sustaining 2 GB/s for multiple tens of seconds. Since I use replication factor 2, that means that server is sustaining 500 MB/s aggregate client throughput, right? During such a period vmstat also reports ~10% CPU idle. During a bad period, vmstat says the server is doing ~200 MB/s, with lots of idle cycles. It is during these periods that messages stuck in the policy throttler build up such long wait times. Sometimes I see really bad periods with aggregate throughput per server 100 MB/s. The typical pattern I see is that a run starts with tens of seconds of aggregate throughput 2 GB/s. Then it drops and bounces around 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then it ramps back up near 2 GB/s again. Hmm. 100MB/s is awfully low for this theory, but have you tried to correlate the drops in throughput with the OSD journals running out of space? I assume from your setup that they're sharing the disk with the store (although it works either way), and your description makes me think that throughput is initially constrained by sequential journal writes but then the journal runs out of space and the OSD has to wait for the main store to catch up (with random IO), and that sends the IO patterns all to hell. (If you can say that random 4MB IOs are hellish.) I'm also curious about memory usage as a possible explanation for the more dramatic drops. I've finally figured out what is going on with this behaviour. Memory usage was on the right track. It turns out to be an unfortunate interaction between the number of OSDs/server, number of clients, TCP socket buffer autotuning, the policy throttler, and limits on the total memory used by the TCP stack (net/ipv4/tcp_mem sysctl). What happens is that for throttled reader threads, the TCP stack will continue to receive data as long as there is available socket buffer, and the sender has data to send. As each reader thread receives successive messages, the TCP socket buffer autotuning increases the size of the socket buffer. Eventually, due to the number of OSDs per server and the number of clients trying to write, all the memory the TCP stack is allowed by net/ipv4/tcp_mem to use is consumed by the socket buffers of throttled reader threads. When this happens, TCP processing is affected to the point that the TCP stack cannot send ACKs on behalf of the reader threads that aren't throttled. At that point the OSD stalls until the TCP retransmit count on some connection is exceeded, causing it to be reset. Since my OSD servers don't run anything else, the simplest solution for me is to turn off socket buffer autotuning (net/ipv4/tcp_moderate_rcvbuf), and set the default socket buffer size to something reasonable. 256k seems to be working well for me right now. -- Jim -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Fri, Feb 24, 2012 at 07:38, Jim Schutt jasc...@sandia.gov wrote: I've finally figured out what is going on with this behaviour. Memory usage was on the right track. It turns out to be an unfortunate interaction between the number of OSDs/server, number of clients, TCP socket buffer autotuning, the policy throttler, and limits on the total memory used by the TCP stack (net/ipv4/tcp_mem sysctl). What happens is that for throttled reader threads, the TCP stack will continue to receive data as long as there is available socket buffer, and the sender has data to send. Ohh! Yes, if the userspace stops reading a socket, kernel will buffer data as per SO_RCVBUF etc. And TCP has global limits, and that is going to push it uncomfortably close to the global limit. Ceph *could* manipulate SO_RCVBUF size at the time it decides to throttle a client, that would limit the TCP buffer space consumed by throttled clients (except for a race where the data got received before Ceph called setsockopt). I recall seeing a trick like that pulled off somewhere, but I can't find an example right now. Or perhaps we just say sorry your server is swamped with too much work for the resources it's given; you need more of them. That's not nice though, when throttling can slow down the non-throttled connections. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
I created ticket http://tracker.newdream.net/issues/2100 for this. On Fri, Feb 24, 2012 at 10:31, Tommi Virtanen tommi.virta...@dreamhost.com wrote: On Fri, Feb 24, 2012 at 07:38, Jim Schutt jasc...@sandia.gov wrote: I've finally figured out what is going on with this behaviour. Memory usage was on the right track. It turns out to be an unfortunate interaction between the number of OSDs/server, number of clients, TCP socket buffer autotuning, the policy throttler, and limits on the total memory used by the TCP stack (net/ipv4/tcp_mem sysctl). What happens is that for throttled reader threads, the TCP stack will continue to receive data as long as there is available socket buffer, and the sender has data to send. Ohh! Yes, if the userspace stops reading a socket, kernel will buffer data as per SO_RCVBUF etc. And TCP has global limits, and that is going to push it uncomfortably close to the global limit. Ceph *could* manipulate SO_RCVBUF size at the time it decides to throttle a client, that would limit the TCP buffer space consumed by throttled clients (except for a race where the data got received before Ceph called setsockopt). I recall seeing a trick like that pulled off somewhere, but I can't find an example right now. Or perhaps we just say sorry your server is swamped with too much work for the resources it's given; you need more of them. That's not nice though, when throttling can slow down the non-throttled connections. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On 02/10/2012 05:05 PM, sridhar basam wrote: But the server never ACKed that packet. Too busy? I was collecting vmstat data during the run; here's the important bits: Fri Feb 10 11:56:51 MST 2012 vmstat -w 8 16 procs ---memory-- ---swap-- -io --system-- -cpu--- r b swpd free buff cache si sobibo in cs us sy id wa st 13 10 0 250272944 3785908000 7 5346 1098 444 2 5 92 1 0 88 8 0 260472944 3672877600 0 1329838 257602 68861 19 73 5 4 0 100 10 0 241952944 36066536 0 0 0 1635891 340724 85570 22 68 6 4 0 105 9 0 250288944 34750820 0 0 0 1584816 433223 111462 21 73 4 3 0 126 3 0 259908944 3384169600 0 749648 225707 86716 9 83 4 3 0 157 2 0 245032944 31572536 0 0 0 736841 252406 99083 9 81 5 5 0 45 17 0 246720944 2887764000 1 755085 282177 116551 8 77 9 5 0 Holy crap! That might explain why you aren't seeing anything. You are writing out over a 1.6 million blocks/sec. That too averaged over a 8 second interval. I bet the missed acks are when this is happening. What sort of I/O load is going through this system during those times? What sort of filesystem and Linux system are these OSDs on? Dual socket Nehalem EP @ 3 GHz, 24 ea. 7200RPM SAS drives w/ 64 MB cache, 3 LSI SAS HBAs w/8 drives per HBA, btrfs, 3.2.0 kernel. Each OSD has a ceph journal and a ceph data store on a single drive. I'm running 24 OSDs on such a box; all that write load is the result of dd from 166 linux ceph clients. FWIW, I've seen these boxes sustain 2 GB/s for 60 sec or so under this load, when I have TSO/GSO/GRO turned on, and am writing to a freshly created ceph filesystem. That lasts until my OSDs get stalled reading from a socket, as documented by those packet traces I posted. If you compare the timestamps on the retransmits to the times that vmstat is dumping reports, at least some of the retransmits hit the system when it is ~80% idle. -- Jim Sridhar -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On 02/09/2012 06:26 PM, sridhar basam wrote: Do you mind capturing to a pcap file and providing that. Makes it easier to analyse things. If not, i understand. If you can make do the capture on both ends, do it with a snaplen of 68 so that you get all of the headers and there shouldn't be too much payload information in the file. I've got a pcap file for this run for this client, with snaplen 128 (I thought I might need to look for ceph message headers). It's 13 MB compressed. How shall I get it to you? In the meantime, I'll try to capture this from both sides. I will take a look at the additional output and see if anything pops out. I am assuming the below output was immediately after what you posted in your earlier email. Yes. Thanks -- Jim Sridhar -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Fri, Feb 10, 2012 at 10:32 AM, Jim Schutt jasc...@sandia.gov wrote: On 02/09/2012 06:26 PM, sridhar basam wrote: Do you mind capturing to a pcap file and providing that. Makes it easier to analyse things. If not, i understand. If you can make do the capture on both ends, do it with a snaplen of 68 so that you get all of the headers and there shouldn't be too much payload information in the file. I've got a pcap file for this run for this client, with snaplen 128 (I thought I might need to look for ceph message headers). It's 13 MB compressed. How shall I get it to you? Can i grab it off some webserver you control? Or you can temporarily drop it into docs.google.com and add accesss from my email account? In the meantime, I'll try to capture this from both sides. I will take a look at the additional output and see if anything pops out. I am assuming the below output was immediately after what you posted in your earlier email. i don't see anything out of the ordinary once things recover, the sender even starts to do TSO after a short while. Sridhar Yes. Thanks -- Jim Sridhar -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
[ added Cc:netdev See http://www.spinics.net/lists/ceph-devel/msg04804.html for the start of the thread. -- Jim ] On 02/10/2012 10:13 AM, sridhar basam wrote: On Fri, Feb 10, 2012 at 10:32 AM, Jim Schuttjasc...@sandia.gov wrote: On 02/09/2012 06:26 PM, sridhar basam wrote: Do you mind capturing to a pcap file and providing that. Makes it easier to analyse things. If not, i understand. If you can make do the capture on both ends, do it with a snaplen of 68 so that you get all of the headers and there shouldn't be too much payload information in the file. I've got a pcap file for this run for this client, with snaplen 128 (I thought I might need to look for ceph message headers). It's 13 MB compressed. How shall I get it to you? Can i grab it off some webserver you control? Or you can temporarily drop it into docs.google.com and add accesss from my email account? I tabled this for the moment while I worked on collecting packet traces from both ends. But you'll probably want to see the pcap files for what I'm about to show. Also, I think I need to add netdev to this discussion ... In the meantime, I'll try to capture this from both sides. Here's another example, captured from both sides, with TSO/GSO/GRO all off, snaplen 68. This was captured from the client side. Same pattern, in that the client sends many retransmits over a period of a couple minutes. It's different in that the client seems to give up and reconnect ... 11:57:35.984024 IP 172.17.131.32.6807 172.17.135.4.57589: Flags [.], ack 18109890, win 5792, options [nop,nop,TS[|tcp] 11:57:35.984032 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18112786:18114234, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984038 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18114234:18115682, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984120 IP 172.17.131.32.6807 172.17.135.4.57589: Flags [.], ack 18112786, win 5792, options [nop,nop,TS[|tcp] 11:57:35.984129 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18115682:18117130, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984135 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18117130:18118578, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984143 IP 172.17.131.32.6807 172.17.135.4.57589: Flags [.], ack 18115682, win 5792, options [nop,nop,TS[|tcp] 11:57:35.984148 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18118578:18120026, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984153 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984270 IP 172.17.131.32.6807 172.17.135.4.57589: Flags [.], ack 18118578, win 5792, options [nop,nop,TS[|tcp] 11:57:35.984278 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18121474:18122922, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984283 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18122922:18124370, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984420 IP 172.17.131.32.6807 172.17.135.4.57589: Flags [.], ack 18120026, win 5792, options [nop,nop,TS[|tcp] 11:57:35.984428 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18124370:18125818, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:36.184945 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:36.587936 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:37.393937 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:39.003937 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:42.227933 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:48.675931 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:58:01.555935 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:58:27.347945 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:59:18.867935 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 12:00:22.673029 IP 172.17.131.32.6807 172.17.135.4.57589: Flags [P.], seq 863:1036, ack 18120026, win 5792, options [nop,nop,TS[|tcp] 12:00:22.712933 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], ack 1036, win 19832, options [nop,nop,TS[|tcp] 12:01:02.035951 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 1036, win 19832, options [nop,nop,TS[|tcp]
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Fri, Feb 10, 2012 at 6:09 PM, Jim Schutt jasc...@sandia.gov wrote: [ added Cc:netdev See http://www.spinics.net/lists/ceph-devel/msg04804.html for the start of the thread. -- Jim ] On 02/10/2012 10:13 AM, sridhar basam wrote: On Fri, Feb 10, 2012 at 10:32 AM, Jim Schuttjasc...@sandia.gov wrote: On 02/09/2012 06:26 PM, sridhar basam wrote: Do you mind capturing to a pcap file and providing that. Makes it easier to analyse things. If not, i understand. If you can make do the capture on both ends, do it with a snaplen of 68 so that you get all of the headers and there shouldn't be too much payload information in the file. I've got a pcap file for this run for this client, with snaplen 128 (I thought I might need to look for ceph message headers). It's 13 MB compressed. How shall I get it to you? Can i grab it off some webserver you control? Or you can temporarily drop it into docs.google.com and add accesss from my email account? I tabled this for the moment while I worked on collecting packet traces from both ends. But you'll probably want to see the pcap files for what I'm about to show. Also, I think I need to add netdev to this discussion ... In the meantime, I'll try to capture this from both sides. Here's another example, captured from both sides, with TSO/GSO/GRO all off, snaplen 68. This was captured from the client side. Same pattern, in that the client sends many retransmits over a period of a couple minutes. It's different in that the client seems to give up and reconnect ... 11:57:35.984024 IP 172.17.131.32.6807 172.17.135.4.57589: Flags [.], ack 18109890, win 5792, options [nop,nop,TS[|tcp] 11:57:35.984032 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18112786:18114234, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984038 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18114234:18115682, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984120 IP 172.17.131.32.6807 172.17.135.4.57589: Flags [.], ack 18112786, win 5792, options [nop,nop,TS[|tcp] 11:57:35.984129 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18115682:18117130, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984135 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18117130:18118578, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984143 IP 172.17.131.32.6807 172.17.135.4.57589: Flags [.], ack 18115682, win 5792, options [nop,nop,TS[|tcp] 11:57:35.984148 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18118578:18120026, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984153 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984270 IP 172.17.131.32.6807 172.17.135.4.57589: Flags [.], ack 18118578, win 5792, options [nop,nop,TS[|tcp] 11:57:35.984278 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18121474:18122922, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984283 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18122922:18124370, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:35.984420 IP 172.17.131.32.6807 172.17.135.4.57589: Flags [.], ack 18120026, win 5792, options [nop,nop,TS[|tcp] 11:57:35.984428 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18124370:18125818, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:36.184945 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:36.587936 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:37.393937 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:39.003937 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:42.227933 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:57:48.675931 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:58:01.555935 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:58:27.347945 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 11:59:18.867935 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], seq 18120026:18121474, ack 863, win 18760, options [nop,nop,TS[|tcp] 12:00:22.673029 IP 172.17.131.32.6807 172.17.135.4.57589: Flags [P.], seq 863:1036, ack 18120026, win 5792, options [nop,nop,TS[|tcp] 12:00:22.712933 IP 172.17.135.4.57589 172.17.131.32.6807: Flags [.], ack 1036, win 19832, options [nop,nop,TS[|tcp] 12:01:02.035951 IP
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On 02/06/2012 11:35 AM, Gregory Farnum wrote: On Mon, Feb 6, 2012 at 10:20 AM, Jim Schuttjasc...@sandia.gov wrote: On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote: On Mon, Feb 6, 2012 at 8:20 AM, Jim Schuttjasc...@sandia.govwrote: The above suggests to me that the slowdown is a result of requests not getting submitted at the same rate as when things are running well. Yeah, it really looks like that. My suggestions wouldn't help there. I do see that when things go well the number of writes per device is capped at ~200 writes per second and the throughput per device is ~100MB/sec. Is 100MB/sec the expected device throughput? Pretty much, at least for the outer tracks on a drive. I've seen ~108 MB/s with dd to a block device. Also, I've got 8 drives per SAS adapter with 6 Gb/s links, so it seems unlikely to me that my disk subsystem is any sort of significant bottleneck. Well, you might try changing your throttling settings on the OSDs. ms_dispatch_throttle_bytes defaults to 10020 (100MB) and is used for throttling dispatch; osd_max_client_bytes defaults to 50020 (500MB) and is used to limit the amount of client data in memory (ie; messages are included in this throttler for their entire lifetime, not just while waiting for dispatch). I've made a little progress isolating this. osd client message size cap = 500 makes the stall completely reproducible (which also means I can reproduce on two different network types, ethernet and IPoIB.), and I am able to generate graphs of throttled/receive/process time for each request received by an OSD (see attached SVG plot). Such plots suggest to me my problem is caused by stalled receives. Using debug ms = 30 on my OSDs turns up instances of this: osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/2500 osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader got front 128 osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0 osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304 osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader read 1369231 of 4194304 osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073 osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader read 1448 of 2825073 which I take to mean that the reader thread sat in poll() for 56 secs, in this case. I was able to correlate such stalls with tcpdump output collected on clients. Here's an example from another run: 15:09:37.584600 IP 172.17.131.32.6808 172.17.135.7.37045: Flags [.], ack 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0 15:09:37.584613 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856 15:09:37.584655 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856 15:09:37.624476 IP 172.17.131.32.6808 172.17.135.7.37045: Flags [.], ack 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0 15:09:37.624489 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856 15:09:37.624532 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [P.], seq 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856 15:09:37.664454 IP 172.17.131.32.6808 172.17.135.7.37045: Flags [.], ack 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0 15:09:37.664468 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856 15:09:37.664506 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856 15:09:37.706937 IP 172.17.131.32.6808 172.17.135.7.37045: Flags [.], ack 23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length 0 15:09:37.706950
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On 02/09/2012 03:40 PM, sridhar basam wrote: On Thu, Feb 9, 2012 at 3:53 PM, Jim Schuttjasc...@sandia.gov wrote: On 02/06/2012 11:35 AM, Gregory Farnum wrote: On Mon, Feb 6, 2012 at 10:20 AM, Jim Schuttjasc...@sandia.govwrote: On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote: On Mon, Feb 6, 2012 at 8:20 AM, Jim Schuttjasc...@sandia.gov wrote: The above suggests to me that the slowdown is a result of requests not getting submitted at the same rate as when things are running well. Yeah, it really looks like that. My suggestions wouldn't help there. I do see that when things go well the number of writes per device is capped at ~200 writes per second and the throughput per device is ~100MB/sec. Is 100MB/sec the expected device throughput? Pretty much, at least for the outer tracks on a drive. I've seen ~108 MB/s with dd to a block device. Also, I've got 8 drives per SAS adapter with 6 Gb/s links, so it seems unlikely to me that my disk subsystem is any sort of significant bottleneck. Well, you might try changing your throttling settings on the OSDs. ms_dispatch_throttle_bytes defaults to 10020 (100MB) and is used for throttling dispatch; osd_max_client_bytes defaults to 50020 (500MB) and is used to limit the amount of client data in memory (ie; messages are included in this throttler for their entire lifetime, not just while waiting for dispatch). I've made a little progress isolating this. osd client message size cap = 500 makes the stall completely reproducible (which also means I can reproduce on two different network types, ethernet and IPoIB.), and I am able to generate graphs of throttled/receive/process time for each request received by an OSD (see attached SVG plot). Such plots suggest to me my problem is caused by stalled receives. Using debug ms = 30 on my OSDs turns up instances of this: osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/2500 osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader got front 128 osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0 osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304 osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader read 1369231 of 4194304 osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073 osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader read 1448 of 2825073 which I take to mean that the reader thread sat in poll() for 56 secs, in this case. I was able to correlate such stalls with tcpdump output collected on clients. Here's an example from another run: 15:09:37.584600 IP 172.17.131.32.6808 172.17.135.7.37045: Flags [.], ack 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0 15:09:37.584613 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856 15:09:37.584655 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856 15:09:37.624476 IP 172.17.131.32.6808 172.17.135.7.37045: Flags [.], ack 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0 15:09:37.624489 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856 15:09:37.624532 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [P.], seq 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856 15:09:37.664454 IP 172.17.131.32.6808 172.17.135.7.37045: Flags [.], ack 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0 15:09:37.664468 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856 15:09:37.664506 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856 15:09:37.706937 IP 172.17.131.32.6808 172.17.135.7.37045: Flags [.],
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Thu, Feb 9, 2012 at 15:15, Jim Schutt jasc...@sandia.gov wrote: I suspect a bug in the stack, as at an application level I get the same sort of stalls whether I use IP over ethernet or IPoIB. I need to get traces for both cases to prove that it is the same stall... Hi. I just wanted to confirm that what your tcpdump shows is a packet loss between the client and the osd. If this wasn't packet loss, you'd expect the TCP window size to drop to 0 -- not remain at 64k as it is in your dump -- and you'd expect to see ACKs that don't advance the sequence number. Something like this: 16:25:16.914407 IP 127.0.0.1.60336 127.0.0.1.: Flags [P.], seq 90150:93094, ack 1, win 257, options [nop,nop,TS val 3732293 ecr 3732270], length 2944 16:25:16.914416 IP 127.0.0.1. 127.0.0.1.60336: Flags [.], ack 93094, win 0, options [nop,nop,TS val 3732293 ecr 3732293], length 0 16:25:17.144409 IP 127.0.0.1.60336 127.0.0.1.: Flags [.], ack 1, win 257, options [nop,nop,TS val 3732316 ecr 3732293], length 0 16:25:17.144421 IP 127.0.0.1. 127.0.0.1.60336: Flags [.], ack 93094, win 0, options [nop,nop,TS val 3732316 ecr 3732293], length 0 16:25:17.604409 IP 127.0.0.1.60336 127.0.0.1.: Flags [.], ack 1, win 257, options [nop,nop,TS val 3732362 ecr 3732316], length 0 16:25:17.604419 IP 127.0.0.1. 127.0.0.1.60336: Flags [.], ack 93094, win 0, options [nop,nop,TS val 3732362 ecr 3732293], length 0 As pointed out by Sridhar, various TCP offload mechanisms (and firewalling!) may make your tcpdump not see the underlying reality. You might also be just actually losing packets, and the osd settings might, perhaps, influence the performance of the machine enough to make it lose packets -- though that sounds a bit far fetched. You might also be suffering from a Path MTU Discovery black hole, and need the osd size cap to get full-frame packets out. I see your tcpdump indicated jumbo frames (at least until the TSO engine!), that might be its own source of pain. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Thu, Feb 9, 2012 at 6:15 PM, Jim Schutt jasc...@sandia.gov wrote: On 02/09/2012 03:40 PM, sridhar basam wrote: On Thu, Feb 9, 2012 at 3:53 PM, Jim Schuttjasc...@sandia.gov wrote: On 02/06/2012 11:35 AM, Gregory Farnum wrote: On Mon, Feb 6, 2012 at 10:20 AM, Jim Schuttjasc...@sandia.gov wrote: On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote: On Mon, Feb 6, 2012 at 8:20 AM, Jim Schuttjasc...@sandia.gov wrote: The above suggests to me that the slowdown is a result of requests not getting submitted at the same rate as when things are running well. Yeah, it really looks like that. My suggestions wouldn't help there. I do see that when things go well the number of writes per device is capped at ~200 writes per second and the throughput per device is ~100MB/sec. Is 100MB/sec the expected device throughput? Pretty much, at least for the outer tracks on a drive. I've seen ~108 MB/s with dd to a block device. Also, I've got 8 drives per SAS adapter with 6 Gb/s links, so it seems unlikely to me that my disk subsystem is any sort of significant bottleneck. Well, you might try changing your throttling settings on the OSDs. ms_dispatch_throttle_bytes defaults to 10020 (100MB) and is used for throttling dispatch; osd_max_client_bytes defaults to 50020 (500MB) and is used to limit the amount of client data in memory (ie; messages are included in this throttler for their entire lifetime, not just while waiting for dispatch). I've made a little progress isolating this. osd client message size cap = 500 makes the stall completely reproducible (which also means I can reproduce on two different network types, ethernet and IPoIB.), and I am able to generate graphs of throttled/receive/process time for each request received by an OSD (see attached SVG plot). Such plots suggest to me my problem is caused by stalled receives. Using debug ms = 30 on my OSDs turns up instances of this: osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader wants 4194432 from dispatch throttler 0/2500 osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader got front 128 osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader allocating new rx buffer at offset 0 osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x1656c000 len 4194304 osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader read 1369231 of 4194304 osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader reading nonblocking into 0x166ba48f len 2825073 osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 -- 172.17.131.32:6800/15199 172.17.135.85:0/2712733083 pipe(0x2ef sd=173 pgs=7 cs=1 l=1).reader read 1448 of 2825073 which I take to mean that the reader thread sat in poll() for 56 secs, in this case. I was able to correlate such stalls with tcpdump output collected on clients. Here's an example from another run: 15:09:37.584600 IP 172.17.131.32.6808 172.17.135.7.37045: Flags [.], ack 23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0 15:09:37.584613 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856 15:09:37.584655 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr 1096144], length 31856 15:09:37.624476 IP 172.17.131.32.6808 172.17.135.7.37045: Flags [.], ack 23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0 15:09:37.624489 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856 15:09:37.624532 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [P.], seq 23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr 1096184], length 31856 15:09:37.664454 IP 172.17.131.32.6808 172.17.135.7.37045: Flags [.], ack 23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0 15:09:37.664468 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq 23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr 1096224], length 31856 15:09:37.664506 IP 172.17.135.7.37045 172.17.131.32.6808: Flags [.], seq
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On 02/03/2012 05:03 PM, Yehuda Sadeh Weinraub wrote: On Fri, Feb 3, 2012 at 3:33 PM, Jim Schuttjasc...@sandia.gov wrote: You can try running 'iostat -t -kx -d 1' on the osds, and see whether %util reaches 100%, and when it happens whether it's due to number of io operations that are thrashing, or whether it's due to high amount of data. FWIW, you may try setting 'filestore flusher = false', and set /proc/sys/vm/dirty_background_ratio' to a small number (e.g., 1M). Here's some iostat data from early in a run, when things are running well: 02/02/2012 09:14:13 AM avg-cpu: %user %nice %system %iowait %steal %idle 23.240.00 61.997.380.007.38 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await svctm %util sdc 0.00 0.000.00 206.00 0.00 101.57 1009.79 54.80 251.27 4.86 100.10 sdd 0.00 0.000.00 202.00 0.0098.10 994.61 27.85 132.42 4.96 100.10 sde 0.00 4.000.00 212.00 0.00 105.09 1015.25 96.06 588.43 4.72 100.10 sdh 0.00 0.000.00 200.00 0.0097.11 994.40 69.77 535.01 5.00 100.10 sdg 0.00 2.000.00 221.00 0.00 109.59 1015.60 82.05 298.71 4.53 100.10 sda 0.00 1.000.00 212.00 0.0083.93 810.75 18.26 84.82 4.68 99.30 sdf 0.00 0.000.00 208.00 0.00 102.55 1009.73 77.23 383.19 4.50 93.70 sdb 0.00 0.000.00 205.00 0.0098.66 985.68 19.97 133.98 4.84 99.20 sdj 0.00 0.000.00 202.00 0.0099.59 1009.66 69.97 257.47 4.95 100.00 sdk 0.00 0.000.00 204.00 0.0098.10 984.86 20.83 100.34 4.87 99.30 sdm 0.00 0.000.00 216.00 0.00 106.55 1010.22 77.73 268.67 4.63 100.00 sdn 0.00 0.000.00 205.00 0.0098.60 985.05 19.33 95.88 4.81 98.60 sdo 0.00 0.000.00 232.00 0.00 106.25 937.93 23.26 82.19 4.29 99.50 sdl 0.00 0.000.00 181.00 0.0085.12 963.09 24.73 131.71 4.80 86.80 sdp 0.00 4.000.00 207.00 0.0087.41 864.77 37.01 111.13 4.49 93.00 sdi 0.00 0.000.00 208.00 0.00 103.04 1014.54 72.30 263.72 4.70 97.70 sdr 0.00 0.000.00 191.00 0.0076.75 822.95 11.51 83.69 4.59 87.60 sds 0.00 0.000.00 209.00 0.00 101.91 998.58 49.95 278.08 4.70 98.20 sdt 0.00 0.000.00 209.00 0.0099.57 975.69 27.31 157.44 4.79 100.10 sdu 0.00 0.000.00 216.00 0.00 107.09 1015.41 79.82 345.88 4.63 100.10 sdw 0.00 0.000.00 208.00 0.00 103.09 1015.00 74.55 308.15 4.81 100.10 sdv 0.00 0.000.00 201.00 0.0098.05 999.08 76.87 265.88 4.98 100.10 sdx 0.00 0.000.00 202.00 0.00 100.50 1018.93 110.40 327.68 4.96 100.10 sdq 0.00 0.000.00 228.00 0.00 112.59 1011.30 54.84 281.04 4.39 100.10 02/02/2012 09:14:14 AM avg-cpu: %user %nice %system %iowait %steal %idle 22.110.00 54.03 15.380.008.48 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await svctm %util sdc 0.00 0.000.00 233.00 0.0099.68 876.15 95.98 384.42 4.29 100.00 sdd 0.00 0.000.00 205.00 0.0096.64 965.46 20.37 108.51 4.84 99.30 sde 0.00 0.000.00 225.00 0.0099.54 906.03 92.38 420.67 4.44 100.00 sdh 0.00 0.000.00 198.00 0.0097.05 1003.84 79.39 410.56 5.05 100.00 sdg 0.00 0.000.00 245.00 0.00 108.38 905.99 84.40 385.47 4.08 100.00 sda 0.00 4.000.00 220.00 0.0096.23 895.78 63.24 294.59 4.44 97.60 sdf 0.00 0.000.00 216.00 0.00 107.09 1015.41 87.67 399.14 4.57 98.80 sdb 0.00 0.000.00 156.00 0.0072.05 945.95 11.61 58.94 4.84 75.50 sdj 0.00 0.000.00 199.00 0.0095.41 981.95 56.28 366.11 4.84 96.40 sdk 0.00 0.000.00 206.00 0.00 100.14 995.57 54.69 241.41 4.86 100.10 sdm 0.00 0.000.00 200.00 0.0099.09 1014.72 79.51 506.47 4.74 94.70 sdn 0.00 0.000.00 191.00 0.0091.29 978.81 26.82 128.39 5.18 98.90 sdo 0.00 0.000.00 234.00 0.00 106.75 934.32 49.82 231.07 4.27
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt jasc...@sandia.gov wrote: On 02/03/2012 05:03 PM, Yehuda Sadeh Weinraub wrote: On Fri, Feb 3, 2012 at 3:33 PM, Jim Schuttjasc...@sandia.gov wrote: You can try running 'iostat -t -kx -d 1' on the osds, and see whether %util reaches 100%, and when it happens whether it's due to number of io operations that are thrashing, or whether it's due to high amount of data. FWIW, you may try setting 'filestore flusher = false', and set /proc/sys/vm/dirty_background_ratio' to a small number (e.g., 1M). Here's some iostat data from early in a run, when things are running well: 02/02/2012 09:14:13 AM avg-cpu: %user %nice %system %iowait %steal %idle 23.24 0.00 61.99 7.38 0.00 7.38 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdc 0.00 0.00 0.00 206.00 0.00 101.57 1009.79 54.80 251.27 4.86 100.10 sdd 0.00 0.00 0.00 202.00 0.00 98.10 994.61 27.85 132.42 4.96 100.10 sde 0.00 4.00 0.00 212.00 0.00 105.09 1015.25 96.06 588.43 4.72 100.10 sdh 0.00 0.00 0.00 200.00 0.00 97.11 994.40 69.77 535.01 5.00 100.10 sdg 0.00 2.00 0.00 221.00 0.00 109.59 1015.60 82.05 298.71 4.53 100.10 sda 0.00 1.00 0.00 212.00 0.00 83.93 810.75 18.26 84.82 4.68 99.30 sdf 0.00 0.00 0.00 208.00 0.00 102.55 1009.73 77.23 383.19 4.50 93.70 sdb 0.00 0.00 0.00 205.00 0.00 98.66 985.68 19.97 133.98 4.84 99.20 sdj 0.00 0.00 0.00 202.00 0.00 99.59 1009.66 69.97 257.47 4.95 100.00 sdk 0.00 0.00 0.00 204.00 0.00 98.10 984.86 20.83 100.34 4.87 99.30 sdm 0.00 0.00 0.00 216.00 0.00 106.55 1010.22 77.73 268.67 4.63 100.00 sdn 0.00 0.00 0.00 205.00 0.00 98.60 985.05 19.33 95.88 4.81 98.60 sdo 0.00 0.00 0.00 232.00 0.00 106.25 937.93 23.26 82.19 4.29 99.50 sdl 0.00 0.00 0.00 181.00 0.00 85.12 963.09 24.73 131.71 4.80 86.80 sdp 0.00 4.00 0.00 207.00 0.00 87.41 864.77 37.01 111.13 4.49 93.00 sdi 0.00 0.00 0.00 208.00 0.00 103.04 1014.54 72.30 263.72 4.70 97.70 sdr 0.00 0.00 0.00 191.00 0.00 76.75 822.95 11.51 83.69 4.59 87.60 sds 0.00 0.00 0.00 209.00 0.00 101.91 998.58 49.95 278.08 4.70 98.20 sdt 0.00 0.00 0.00 209.00 0.00 99.57 975.69 27.31 157.44 4.79 100.10 sdu 0.00 0.00 0.00 216.00 0.00 107.09 1015.41 79.82 345.88 4.63 100.10 sdw 0.00 0.00 0.00 208.00 0.00 103.09 1015.00 74.55 308.15 4.81 100.10 sdv 0.00 0.00 0.00 201.00 0.00 98.05 999.08 76.87 265.88 4.98 100.10 sdx 0.00 0.00 0.00 202.00 0.00 100.50 1018.93 110.40 327.68 4.96 100.10 sdq 0.00 0.00 0.00 228.00 0.00 112.59 1011.30 54.84 281.04 4.39 100.10 02/02/2012 09:14:14 AM avg-cpu: %user %nice %system %iowait %steal %idle 22.11 0.00 54.03 15.38 0.00 8.48 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sdc 0.00 0.00 0.00 233.00 0.00 99.68 876.15 95.98 384.42 4.29 100.00 sdd 0.00 0.00 0.00 205.00 0.00 96.64 965.46 20.37 108.51 4.84 99.30 sde 0.00 0.00 0.00 225.00 0.00 99.54 906.03 92.38 420.67 4.44 100.00 sdh 0.00 0.00 0.00 198.00 0.00 97.05 1003.84 79.39 410.56 5.05 100.00 sdg 0.00 0.00 0.00 245.00 0.00 108.38 905.99 84.40 385.47 4.08 100.00 sda 0.00 4.00 0.00 220.00 0.00 96.23 895.78 63.24 294.59 4.44 97.60 sdf 0.00 0.00 0.00 216.00 0.00 107.09 1015.41 87.67 399.14 4.57 98.80 sdb 0.00 0.00 0.00 156.00 0.00 72.05 945.95 11.61 58.94 4.84 75.50 sdj 0.00 0.00 0.00 199.00 0.00 95.41 981.95 56.28 366.11 4.84 96.40 sdk 0.00 0.00 0.00 206.00 0.00 100.14 995.57 54.69 241.41 4.86 100.10 sdm 0.00 0.00 0.00 200.00 0.00 99.09 1014.72 79.51 506.47 4.74 94.70 sdn 0.00 0.00 0.00 191.00 0.00 91.29 978.81 26.82 128.39 5.18 98.90 sdo 0.00 0.00 0.00 234.00
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote: On Mon, Feb 6, 2012 at 8:20 AM, Jim Schuttjasc...@sandia.gov wrote: The above suggests to me that the slowdown is a result of requests not getting submitted at the same rate as when things are running well. Yeah, it really looks like that. My suggestions wouldn't help there. I do see that when things go well the number of writes per device is capped at ~200 writes per second and the throughput per device is ~100MB/sec. Is 100MB/sec the expected device throughput? Pretty much, at least for the outer tracks on a drive. I've seen ~108 MB/s with dd to a block device. Also, I've got 8 drives per SAS adapter with 6 Gb/s links, so it seems unlikely to me that my disk subsystem is any sort of significant bottleneck. -- Jim -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt jasc...@sandia.gov wrote: On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote: On Mon, Feb 6, 2012 at 8:20 AM, Jim Schuttjasc...@sandia.gov wrote: The above suggests to me that the slowdown is a result of requests not getting submitted at the same rate as when things are running well. Yeah, it really looks like that. My suggestions wouldn't help there. I do see that when things go well the number of writes per device is capped at ~200 writes per second and the throughput per device is ~100MB/sec. Is 100MB/sec the expected device throughput? Pretty much, at least for the outer tracks on a drive. I've seen ~108 MB/s with dd to a block device. Also, I've got 8 drives per SAS adapter with 6 Gb/s links, so it seems unlikely to me that my disk subsystem is any sort of significant bottleneck. Well, you might try changing your throttling settings on the OSDs. ms_dispatch_throttle_bytes defaults to 10020 (100MB) and is used for throttling dispatch; osd_max_client_bytes defaults to 50020 (500MB) and is used to limit the amount of client data in memory (ie; messages are included in this throttler for their entire lifetime, not just while waiting for dispatch). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On 02/02/2012 05:28 PM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 12:22 PM, Jim Schuttjasc...@sandia.gov wrote: I found 0 instances of waiting for commit in all my OSD logs for my last run. So I never waited on the journal? Looks like it. Interesting. So far I'm looking at two behaviours I've noticed that seem anomalous to me. One is that I instrumented ms_dispatch(), and I see it take a half-second or more several hundred times, out of several thousand messages. Is that expected? How did you instrument it? If you wrapped the whole function it's possible that those longer runs are actually chewing through several messages that had to get waitlisted for some reason previously. (That's the call to do_waiters().) Yep, I wrapped the whole function, and also instrumented taking osd_lock while I was there. About half the time that ms_dispatch() takes more than 0.5 seconds, taking osd_lock is responsible for the delay. There's two dispatch threads, one for ops and one for rep_ops, right? So one's waiting on the other? There's just one main dispatcher; no split for the ops and rep_ops . The reason for that dispatch_running is that if there are requests waiting then the tick() function will run through them if the messenger dispatch thread is currently idle. But it is possible for the Messenger to try and dispatch, and for that to be blocked while some amount of (usually trivial) work is being done by a different thread, yes. I don't think we've ever observed it being a problem for anything other than updating OSD maps, though... Ah, OK. I guess I was confused by my log output, e.g.: osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400 osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680 I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads. I need to go study that code some more Another is that once a message receive starts, I see ~50 messages that take tens of seconds to receive, when the nominal receive time is a half-second or less. I'm in the process of tooling up to collect tcpdump data on all my clients to try to catch what is going on with that. Again, how are you instrumenting that? I post-process the logs, looking at the time difference between reader got .* policy throttler and reader got .* osd_op(client. I guess the logging output must have changed a bit at some pointer (or was that one of your patches?). master has reader wants not reader got for the policy throttler. (Just got a little confused when checking the code.) Yep, I added an extra message to make post-processing logs easier, sorry. When I find a candidate message, I grep the log for just that reader thread, and see, e.g., this: osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/5000 seq 828/828 waiters 157/149 for src client.4301 tid=247 osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler 41944358/ osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got front 128 osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0 osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592 osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620 osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172 Note the ~2 minute delay (and ~300,000 lines of logging) between the first and second reads. During that time 129 sockets were processed - what makes sd=215 special? Hrm. Well, you can try turning up the messenger debugging to 30 and taking advantage of the reader reading reader read pair right around tcp_read_nonblocking. OK, I'll give that a try as well, thanks. I've added tracepoints in my client kernel try_write(), and nothing seems unusual (that's with running the patch to ceph_write_space() I posted earlier): kworker/0:2-1790 [000] 1543.200887: ceph_try_write_msg_done: peer osd0 tid 179 seq 3 sent 4194304 kworker/0:2-1790
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Feb 3, 2012, at 8:18 AM, Jim Schutt jasc...@sandia.gov wrote: On 02/02/2012 05:28 PM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 12:22 PM, Jim Schuttjasc...@sandia.gov wrote: I found 0 instances of waiting for commit in all my OSD logs for my last run. So I never waited on the journal? Looks like it. Interesting. So far I'm looking at two behaviours I've noticed that seem anomalous to me. One is that I instrumented ms_dispatch(), and I see it take a half-second or more several hundred times, out of several thousand messages. Is that expected? How did you instrument it? If you wrapped the whole function it's possible that those longer runs are actually chewing through several messages that had to get waitlisted for some reason previously. (That's the call to do_waiters().) Yep, I wrapped the whole function, and also instrumented taking osd_lock while I was there. About half the time that ms_dispatch() takes more than 0.5 seconds, taking osd_lock is responsible for the delay. There's two dispatch threads, one for ops and one for rep_ops, right? So one's waiting on the other? There's just one main dispatcher; no split for the ops and rep_ops . The reason for that dispatch_running is that if there are requests waiting then the tick() function will run through them if the messenger dispatch thread is currently idle. But it is possible for the Messenger to try and dispatch, and for that to be blocked while some amount of (usually trivial) work is being done by a different thread, yes. I don't think we've ever observed it being a problem for anything other than updating OSD maps, though... Ah, OK. I guess I was confused by my log output, e.g.: D'oh. Sorry, you confused me with your reference to repops, which aren't special-cased or anything. But there are two messengers on the OSD, each with their own dispatch thread. One of those messengers is for clients and one is for other OSDs. And now that you point that out, I wonder if the problem is lack of Cond signaling in ms_dispatch. I'm on my phone right now but I believe there's a chunk of commented-out code (why commented instead of deleted? I don't know) that we want to uncomment for reasons that will become clear when you look at it. :) Try that and see how many of your problems disappear? osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400 osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680 I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads. I need to go study that code some more Another is that once a message receive starts, I see ~50 messages that take tens of seconds to receive, when the nominal receive time is a half-second or less. I'm in the process of tooling up to collect tcpdump data on all my clients to try to catch what is going on with that. Again, how are you instrumenting that? I post-process the logs, looking at the time difference between reader got .* policy throttler and reader got .* osd_op(client. I guess the logging output must have changed a bit at some pointer (or was that one of your patches?). master has reader wants not reader got for the policy throttler. (Just got a little confused when checking the code.) Yep, I added an extra message to make post-processing logs easier, sorry. When I find a candidate message, I grep the log for just that reader thread, and see, e.g., this: osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/5000 seq 828/828 waiters 157/149 for src client.4301 tid=247 osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler 41944358/ osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got front 128 osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0 osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592 osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620 osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Fri, 3 Feb 2012, Jim Schutt wrote: On 02/02/2012 05:28 PM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 12:22 PM, Jim Schuttjasc...@sandia.gov wrote: I found 0 instances of waiting for commit in all my OSD logs for my last run. So I never waited on the journal? Looks like it. Interesting. So far I'm looking at two behaviours I've noticed that seem anomalous to me. One is that I instrumented ms_dispatch(), and I see it take a half-second or more several hundred times, out of several thousand messages. Is that expected? How did you instrument it? If you wrapped the whole function it's possible that those longer runs are actually chewing through several messages that had to get waitlisted for some reason previously. (That's the call to do_waiters().) Yep, I wrapped the whole function, and also instrumented taking osd_lock while I was there. About half the time that ms_dispatch() takes more than 0.5 seconds, taking osd_lock is responsible for the delay. There's two dispatch threads, one for ops and one for rep_ops, right? So one's waiting on the other? There's just one main dispatcher; no split for the ops and rep_ops . The reason for that dispatch_running is that if there are requests waiting then the tick() function will run through them if the messenger dispatch thread is currently idle. But it is possible for the Messenger to try and dispatch, and for that to be blocked while some amount of (usually trivial) work is being done by a different thread, yes. I don't think we've ever observed it being a problem for anything other than updating OSD maps, though... Ah, OK. I guess I was confused by my log output, e.g.: osd.0.log:2277569:2012-02-02 09:23:41.666420 7fe5fe65e700 osd.0 31 ms_dispatch ET 0.990204 osd_lock ET 0.001438 msg 0xbe19400 osd.0.log:2277697:2012-02-02 09:23:41.669949 7fe5fee5f700 osd.0 31 ms_dispatch ET 0.993136 osd_lock ET 0.992708 msg 0x13afd680 I thought 7fe5fe65e700 and 7fe5fee5f700 identified the threads. I need to go study that code some more Oh... they are separate thread. In the OSD's case two different messengers (the public and cluster ones) are wired up to the same dispatcher (OSD::ms_dispatch). MOSDOps come in on the public thread, MOSDSubOps on the cluster one, but they're both fed to the same function. That's why there's some funkiness going on in, say, handle_osd_map(). But like Greg said, I don't think we've seen any significant latencies there except from map processing. If you have a log, that would be interesting to look at! sage Another is that once a message receive starts, I see ~50 messages that take tens of seconds to receive, when the nominal receive time is a half-second or less. I'm in the process of tooling up to collect tcpdump data on all my clients to try to catch what is going on with that. Again, how are you instrumenting that? I post-process the logs, looking at the time difference between reader got .* policy throttler and reader got .* osd_op(client. I guess the logging output must have changed a bit at some pointer (or was that one of your patches?). master has reader wants not reader got for the policy throttler. (Just got a little confused when checking the code.) Yep, I added an extra message to make post-processing logs easier, sorry. When I find a candidate message, I grep the log for just that reader thread, and see, e.g., this: osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/5000 seq 828/828 waiters 157/149 for src client.4301 tid=247 osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler 41944358/ osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got front 128 osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0 osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592 osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620 osd.0.log:1561092:2012-02-02
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
(resent because I forgot the list on my original reply) On 02/01/2012 03:33 PM, Gregory Farnum wrote: On Wed, Feb 1, 2012 at 7:54 AM, Jim Schuttjasc...@sandia.gov wrote: Hi, FWIW, I've been trying to understand op delays under very heavy write load, and have been working a little with the policy throttler in hopes of using throttling delays to help track down which ops were backing up. Without much success, unfortunately. When I saw the wip-osd-op-tracking branch, I wondered if any of this stuff might be helpful. Here it is, just in case. In general these patches are dumping information to the logs, and part of the wip-osd-op-tracking branch is actually keeping track of most of the message queueing wait times as part of the message itself (although not the information about number of waiters and sleep/wake seqs). I'm inclined to prefer that approach to log dumping. I agree - I've just been using log dumping because I can extract any relationships I can write a perl script to find :) So far, not too helpful. Are there any patches you recommend for merging? I'm a little curious about the ordered wakeup one — do you have data about when that's a problem? I've been trying to push the client:osd ratio, and in my testbed I can run up to 166 linux clients. Right now I'm running them against 48 OSDs. The clients are 1 Gb/s ethernet, and the OSDs have a 10 Gb/s ethernet for clients and another for the cluster. During sustained write loads I see a factor of 10 oscillation in aggregate throughput, and during that time I see clients stuck in the policy throttler for hundreds of seconds, and I see a number of waiters equal to number of clients - (throttler limit) / (msg size) If I do a histogram of throttler wait times I see a handful of messages that wait for an extra couple hundreds of seconds without the ordered wakeup. I'm not sure what this will look like if my throughput variations can be fixed. But, for our HPC loads I expect we'll often see periods where offered load is much higher that aggregate bandwidth of any system we can afford to build, so ordered wakeup may be useful in such cases for client fairness. So I'd recommend the ordered wakeup patch if you don't see any downsides. Sorry for the noise on the others - mostly I just wanted to share the sort of things I've been looking at. I'll be learning to use your new stuff soon... -- Jim -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Thu, Feb 2, 2012 at 7:29 AM, Jim Schutt jasc...@sandia.gov wrote: I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive per OSD. During a test I watch both OSD servers with both vmstat and iostat. During a good period, vmstat says the server is sustaining 2 GB/s for multiple tens of seconds. Since I use replication factor 2, that means that server is sustaining 500 MB/s aggregate client throughput, right? During such a period vmstat also reports ~10% CPU idle. During a bad period, vmstat says the server is doing ~200 MB/s, with lots of idle cycles. It is during these periods that messages stuck in the policy throttler build up such long wait times. Sometimes I see really bad periods with aggregate throughput per server 100 MB/s. The typical pattern I see is that a run starts with tens of seconds of aggregate throughput 2 GB/s. Then it drops and bounces around 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then it ramps back up near 2 GB/s again. Hmm. 100MB/s is awfully low for this theory, but have you tried to correlate the drops in throughput with the OSD journals running out of space? I assume from your setup that they're sharing the disk with the store (although it works either way), and your description makes me think that throughput is initially constrained by sequential journal writes but then the journal runs out of space and the OSD has to wait for the main store to catch up (with random IO), and that sends the IO patterns all to hell. (If you can say that random 4MB IOs are hellish.) I'm also curious about memory usage as a possible explanation for the more dramatic drops. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On 02/02/2012 10:52 AM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov wrote: I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive per OSD. During a test I watch both OSD servers with both vmstat and iostat. During a good period, vmstat says the server is sustaining 2 GB/s for multiple tens of seconds. Since I use replication factor 2, that means that server is sustaining 500 MB/s aggregate client throughput, right? During such a period vmstat also reports ~10% CPU idle. During a bad period, vmstat says the server is doing ~200 MB/s, with lots of idle cycles. It is during these periods that messages stuck in the policy throttler build up such long wait times. Sometimes I see really bad periods with aggregate throughput per server 100 MB/s. The typical pattern I see is that a run starts with tens of seconds of aggregate throughput 2 GB/s. Then it drops and bounces around 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then it ramps back up near 2 GB/s again. Hmm. 100MB/s is awfully low for this theory, but have you tried to correlate the drops in throughput with the OSD journals running out of space? A spot check of logs from my last run doesn't seem to have any journal throttle: waited for messages during a slowdown. Is that what you mean? During the fast part of I run I see lots of journal messages with this pattern: 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 22 ops and 90041106 bytes 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 21 ops and 85845571 bytes which I think means my journal is doing 50 MB/s, right? I assume from your setup that they're sharing the disk with the store (although it works either way), I've got a 4 GB journal partition on the outer tracks of the disk. and your description makes me think that throughput is initially constrained by sequential journal writes but then the journal runs out of space and the OSD has to wait for the main store to catch up (with random IO), and that sends the IO patterns all to hell. (If you can say that random 4MB IOs are hellish.) iostat 1 during the fast part of a run shows both journal and data partitions running at 45-50 MB/s. During the slow part of a run they both show similar but low data rates. I'm also curious about memory usage as a possible explanation for the more dramatic drops. My OSD servers have 48 GB memory. During a run I rarely see less than 24 GB used by the page cache, with the rest mostly used by anonymous memory. I don't run with any swap. So far I'm looking at two behaviours I've noticed that seem anomalous to me. One is that I instrumented ms_dispatch(), and I see it take a half-second or more several hundred times, out of several thousand messages. Is that expected? Another is that once a message receive starts, I see ~50 messages that take tens of seconds to receive, when the nominal receive time is a half-second or less. I'm in the process of tooling up to collect tcpdump data on all my clients to try to catch what is going on with that. Any other ideas on what to look for would be greatly appreciated. -- Jim -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Thu, 2 Feb 2012, Jim Schutt wrote: On 02/02/2012 10:52 AM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov wrote: I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive per OSD. During a test I watch both OSD servers with both vmstat and iostat. During a good period, vmstat says the server is sustaining 2 GB/s for multiple tens of seconds. Since I use replication factor 2, that means that server is sustaining 500 MB/s aggregate client throughput, right? During such a period vmstat also reports ~10% CPU idle. During a bad period, vmstat says the server is doing ~200 MB/s, with lots of idle cycles. It is during these periods that messages stuck in the policy throttler build up such long wait times. Sometimes I see really bad periods with aggregate throughput per server 100 MB/s. The typical pattern I see is that a run starts with tens of seconds of aggregate throughput 2 GB/s. Then it drops and bounces around 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then it ramps back up near 2 GB/s again. Hmm. 100MB/s is awfully low for this theory, but have you tried to correlate the drops in throughput with the OSD journals running out of space? A spot check of logs from my last run doesn't seem to have any journal throttle: waited for messages during a slowdown. Is that what you mean? During the fast part of I run I see lots of journal messages with this pattern: 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 22 ops and 90041106 bytes 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 21 ops and 85845571 bytes It occurs to me that part of the problem may be the current sync io behavior in the journal. It ends up doing really big writes, which makes things bursty, and will get stuff blocked up behind the throttler. You might try making 'journal max write bytes' smaller? Hmm, although it's currently 10MB, which isn't too bad. So unless you've changed it from the default, that's probably not it. which I think means my journal is doing 50 MB/s, right? I assume from your setup that they're sharing the disk with the store (although it works either way), I've got a 4 GB journal partition on the outer tracks of the disk. This is on the same disk as the osd data? As an experiement, you could try putting the journal on a separate disk (half disks for journals, half for data). That's obviously not what you want in the real world, but it would be interesting to see if contention for the spindle is responsible for this. So far I'm looking at two behaviours I've noticed that seem anomalous to me. One is that I instrumented ms_dispatch(), and I see it take a half-second or more several hundred times, out of several thousand messages. Is that expected? I don't think so, but it could happen if the cpu/memory contention, or in the case of osdmap updates where we block on io in that thread. Another is that once a message receive starts, I see ~50 messages that take tens of seconds to receive, when the nominal receive time is a half-second or less. I'm in the process of tooling up to collect tcpdump data on all my clients to try to catch what is going on with that. Any other ideas on what to look for would be greatly appreciated. I'd rule out the journal+data on same disk as the source of pain first. If that's what's going on, we can take a closer look at speficically how to make it behave better! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Thu, Feb 2, 2012 at 11:06 AM, Jim Schutt jasc...@sandia.gov wrote: On 02/02/2012 10:52 AM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov wrote: The typical pattern I see is that a run starts with tens of seconds of aggregate throughput 2 GB/s. Then it drops and bounces around 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then it ramps back up near 2 GB/s again. Hmm. 100MB/s is awfully low for this theory, but have you tried to correlate the drops in throughput with the OSD journals running out of space? A spot check of logs from my last run doesn't seem to have any journal throttle: waited for messages during a slowdown. Is that what you mean? I'd expect to see those, yes, but I actually meant the on-disk journal itself getting full. I believe that should result in output like: write_thread_entry full, going to sleep (waiting for commit) ...although I now notice that's a much higher log level (20) than the other messages (1/5). During the fast part of I run I see lots of journal messages with this pattern: 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 22 ops and 90041106 bytes 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 21 ops and 85845571 bytes which I think means my journal is doing 50 MB/s, right? Generally, yes — although that'll also pop up if the store manages to commit faster than the journal (unlikely). :) and your description makes me think that throughput is initially constrained by sequential journal writes but then the journal runs out of space and the OSD has to wait for the main store to catch up (with random IO), and that sends the IO patterns all to hell. (If you can say that random 4MB IOs are hellish.) iostat 1 during the fast part of a run shows both journal and data partitions running at 45-50 MB/s. During the slow part of a run they both show similar but low data rates. All right. That's actually not that surprising; random 4MB writes are pretty nice to a modern drive. I'm also curious about memory usage as a possible explanation for the more dramatic drops. My OSD servers have 48 GB memory. During a run I rarely see less than 24 GB used by the page cache, with the rest mostly used by anonymous memory. I don't run with any swap. So far I'm looking at two behaviours I've noticed that seem anomalous to me. One is that I instrumented ms_dispatch(), and I see it take a half-second or more several hundred times, out of several thousand messages. Is that expected? How did you instrument it? If you wrapped the whole function it's possible that those longer runs are actually chewing through several messages that had to get waitlisted for some reason previously. (That's the call to do_waiters().) Another is that once a message receive starts, I see ~50 messages that take tens of seconds to receive, when the nominal receive time is a half-second or less. I'm in the process of tooling up to collect tcpdump data on all my clients to try to catch what is going on with that. Again, how are you instrumenting that? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On 02/02/2012 12:15 PM, Sage Weil wrote: On Thu, 2 Feb 2012, Jim Schutt wrote: On 02/02/2012 10:52 AM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov wrote: I'm currently running 24 OSDs/server, one 1TB 7200 RPM SAS drive per OSD. During a test I watch both OSD servers with both vmstat and iostat. During a good period, vmstat says the server is sustaining 2 GB/s for multiple tens of seconds. Since I use replication factor 2, that means that server is sustaining 500 MB/s aggregate client throughput, right? During such a period vmstat also reports ~10% CPU idle. During a bad period, vmstat says the server is doing ~200 MB/s, with lots of idle cycles. It is during these periods that messages stuck in the policy throttler build up such long wait times. Sometimes I see really bad periods with aggregate throughput per server 100 MB/s. The typical pattern I see is that a run starts with tens of seconds of aggregate throughput 2 GB/s. Then it drops and bounces around 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then it ramps back up near 2 GB/s again. Hmm. 100MB/s is awfully low for this theory, but have you tried to correlate the drops in throughput with the OSD journals running out of space? A spot check of logs from my last run doesn't seem to have any journal throttle: waited for messages during a slowdown. Is that what you mean? During the fast part of I run I see lots of journal messages with this pattern: 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 22 ops and 90041106 bytes 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 21 ops and 85845571 bytes It occurs to me that part of the problem may be the current sync io behavior in the journal. It ends up doing really big writes, which makes things bursty, and will get stuff blocked up behind the throttler. You might try making 'journal max write bytes' smaller? Hmm, although it's currently 10MB, which isn't too bad. So unless you've changed it from the default, that's probably not it. I have changed it recently, but I was seeing this type of behaviour before making that change. FWIW, here's my current non-standard tunings. I'm using these because the standard ones work even worse for me on this test case: osd op threads = 48 filestore queue max ops = 16 osd client message size cap = 5000 client oc max dirty = 5000 journal max write bytes = 5000 ms dispatch throttle bytes = client oc size = 1 journal queue max bytes = 12500 filestore queue max bytes = 12500 objector inflight op bytes = 2 FWIW, turning down filestore queue max ops and turning up osd op threads made the biggest positive impact on my performance levels on this test. With default values for those I was seeing much worse stalling behaviour. which I think means my journal is doing 50 MB/s, right? I assume from your setup that they're sharing the disk with the store (although it works either way), I've got a 4 GB journal partition on the outer tracks of the disk. This is on the same disk as the osd data? As an experiement, you could try putting the journal on a separate disk (half disks for journals, half for data). That's obviously not what you want in the real world, but it would be interesting to see if contention for the spindle is responsible for this. So far I'm looking at two behaviours I've noticed that seem anomalous to me. One is that I instrumented ms_dispatch(), and I see it take a half-second or more several hundred times, out of several thousand messages. Is that expected? I don't think so, but it could happen if the cpu/memory contention, or in the case of osdmap updates where we block on io in that thread. Hmmm. Would those go into the filestore op queue? I didn't think to check ms_dispatch() ET until after I had tuned down filestore queue max ops. Is it worth checking this? Another is that once
Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On 02/02/2012 12:32 PM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 11:06 AM, Jim Schuttjasc...@sandia.gov wrote: On 02/02/2012 10:52 AM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.govwrote: The typical pattern I see is that a run starts with tens of seconds of aggregate throughput2 GB/s. Then it drops and bounces around 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then it ramps back up near 2 GB/s again. Hmm. 100MB/s is awfully low for this theory, but have you tried to correlate the drops in throughput with the OSD journals running out of space? A spot check of logs from my last run doesn't seem to have any journal throttle: waited for messages during a slowdown. Is that what you mean? I'd expect to see those, yes, but I actually meant the on-disk journal itself getting full. I believe that should result in output like: write_thread_entry full, going to sleep (waiting for commit) ...although I now notice that's a much higher log level (20) than the other messages (1/5). So I've been running OSDs with debug osd = 20 debug journal = 20 ; local journaling debug filestore = 20 ; local object storage debug objector = 20 debug ms = 20 debug = 1 I found 0 instances of waiting for commit in all my OSD logs for my last run. So I never waited on the journal? During the fast part of I run I see lots of journal messages with this pattern: 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 22 ops and 90041106 bytes 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 21 ops and 85845571 bytes which I think means my journal is doing 50 MB/s, right? Generally, yes — although that'll also pop up if the store manages to commit faster than the journal (unlikely). :) and your description makes me think that throughput is initially constrained by sequential journal writes but then the journal runs out of space and the OSD has to wait for the main store to catch up (with random IO), and that sends the IO patterns all to hell. (If you can say that random 4MB IOs are hellish.) iostat 1 during the fast part of a run shows both journal and data partitions running at 45-50 MB/s. During the slow part of a run they both show similar but low data rates. All right. That's actually not that surprising; random 4MB writes are pretty nice to a modern drive. I'm also curious about memory usage as a possible explanation for the more dramatic drops. My OSD servers have 48 GB memory. During a run I rarely see less than 24 GB used by the page cache, with the rest mostly used by anonymous memory. I don't run with any swap. So far I'm looking at two behaviours I've noticed that seem anomalous to me. One is that I instrumented ms_dispatch(), and I see it take a half-second or more several hundred times, out of several thousand messages. Is that expected? How did you instrument it? If you wrapped the whole function it's possible that those longer runs are actually chewing through several messages that had to get waitlisted for some reason previously. (That's the call to do_waiters().) Yep, I wrapped the whole function, and also instrumented taking osd_lock while I was there. About half the time that ms_dispatch() takes more than 0.5 seconds, taking osd_lock is responsible for the delay. There's two dispatch threads, one for ops and one for rep_ops, right? So one's waiting on the other? Another is that once a message receive starts, I see ~50 messages that take tens of seconds to receive, when the nominal receive time is a half-second or less. I'm in the process of tooling up to collect tcpdump data on all my clients to try to catch what is going on with that. Again, how are you instrumenting that? I post-process the logs, looking at the time difference between reader got .* policy throttler and reader got .* osd_op(client. When I find a candidate message, I grep the log for just that reader thread, and see, e.g., this: osd.0.log:1280693:2012-02-02 09:17:57.704508
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On 02/02/2012 01:22 PM, Jim Schutt wrote: On 02/02/2012 12:32 PM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 11:06 AM, Jim Schuttjasc...@sandia.gov wrote: On 02/02/2012 10:52 AM, Gregory Farnum wrote: On Thu, Feb 2, 2012 at 7:29 AM, Jim Schuttjasc...@sandia.gov wrote: The typical pattern I see is that a run starts with tens of seconds of aggregate throughput 2 GB/s. Then it drops and bounces around 500 - 1000 MB/s, with occasional excursions under 100 MB/s. Then it ramps back up near 2 GB/s again. Hmm. 100MB/s is awfully low for this theory, but have you tried to correlate the drops in throughput with the OSD journals running out of space? A spot check of logs from my last run doesn't seem to have any journal throttle: waited for messages during a slowdown. Is that what you mean? I'd expect to see those, yes, but I actually meant the on-disk journal itself getting full. I believe that should result in output like: write_thread_entry full, going to sleep (waiting for commit) ...although I now notice that's a much higher log level (20) than the other messages (1/5). So I've been running OSDs with debug osd = 20 debug journal = 20 ; local journaling debug filestore = 20 ; local object storage debug objector = 20 debug ms = 20 debug = 1 I found 0 instances of waiting for commit in all my OSD logs for my last run. So I never waited on the journal? During the fast part of I run I see lots of journal messages with this pattern: 2012-02-02 09:16:18.376996 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 22 ops and 90041106 bytes 2012-02-02 09:16:18.417507 7fe5eb436700 journal throttle: waited for bytes 2012-02-02 09:16:18.417656 7fe5e742e700 journal throttle: waited for bytes 2012-02-02 09:16:18.417756 7fe5f2444700 journal throttle: waited for bytes 2012-02-02 09:16:18.422157 7fe5ea434700 journal throttle: waited for bytes 2012-02-02 09:16:18.422186 7fe5e9c33700 journal throttle: waited for bytes 2012-02-02 09:16:18.424195 7fe5e642c700 journal throttle: waited for bytes 2012-02-02 09:16:18.427106 7fe5fb456700 journal throttle: waited for bytes 2012-02-02 09:16:18.427139 7fe5f7c4f700 journal throttle: waited for bytes 2012-02-02 09:16:18.427159 7fe5e5c2b700 journal throttle: waited for bytes 2012-02-02 09:16:18.427176 7fe5ee43c700 journal throttle: waited for bytes 2012-02-02 09:16:18.428299 7fe5f744e700 journal throttle: waited for bytes 2012-02-02 09:16:19.297369 7fe602e67700 journal put_throttle finished 12 ops and 50346596 bytes, now 21 ops and 85845571 bytes which I think means my journal is doing 50 MB/s, right? Generally, yes — although that'll also pop up if the store manages to commit faster than the journal (unlikely). :) and your description makes me think that throughput is initially constrained by sequential journal writes but then the journal runs out of space and the OSD has to wait for the main store to catch up (with random IO), and that sends the IO patterns all to hell. (If you can say that random 4MB IOs are hellish.) iostat 1 during the fast part of a run shows both journal and data partitions running at 45-50 MB/s. During the slow part of a run they both show similar but low data rates. All right. That's actually not that surprising; random 4MB writes are pretty nice to a modern drive. I'm also curious about memory usage as a possible explanation for the more dramatic drops. My OSD servers have 48 GB memory. During a run I rarely see less than 24 GB used by the page cache, with the rest mostly used by anonymous memory. I don't run with any swap. So far I'm looking at two behaviours I've noticed that seem anomalous to me. One is that I instrumented ms_dispatch(), and I see it take a half-second or more several hundred times, out of several thousand messages. Is that expected? How did you instrument it? If you wrapped the whole function it's possible that those longer runs are actually chewing through several messages that had to get waitlisted for some reason previously. (That's the call to do_waiters().) Yep, I wrapped the whole function, and also instrumented taking osd_lock while I was there. About half the time that ms_dispatch() takes more than 0.5 seconds, taking osd_lock is responsible for the delay. There's two dispatch threads, one for ops and one for rep_ops, right? So one's waiting on the other? Another is that once a message receive starts, I see ~50 messages that take tens of seconds to receive, when the nominal receive time is a half-second or less. I'm in the process of tooling up to collect tcpdump data on all my clients to try to catch what is going on with that. Again, how are you instrumenting that? I post-process the logs, looking at the time difference between reader got .* policy throttler and reader got .* osd_op(client. When I find a candidate message, I grep the log for just that reader thread, and see, e.g., this: osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 --
Re: [EXTERNAL] Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Thu, Feb 2, 2012 at 12:22 PM, Jim Schutt jasc...@sandia.gov wrote: I found 0 instances of waiting for commit in all my OSD logs for my last run. So I never waited on the journal? Looks like it. Interesting. So far I'm looking at two behaviours I've noticed that seem anomalous to me. One is that I instrumented ms_dispatch(), and I see it take a half-second or more several hundred times, out of several thousand messages. Is that expected? How did you instrument it? If you wrapped the whole function it's possible that those longer runs are actually chewing through several messages that had to get waitlisted for some reason previously. (That's the call to do_waiters().) Yep, I wrapped the whole function, and also instrumented taking osd_lock while I was there. About half the time that ms_dispatch() takes more than 0.5 seconds, taking osd_lock is responsible for the delay. There's two dispatch threads, one for ops and one for rep_ops, right? So one's waiting on the other? There's just one main dispatcher; no split for the ops and rep_ops . The reason for that dispatch_running is that if there are requests waiting then the tick() function will run through them if the messenger dispatch thread is currently idle. But it is possible for the Messenger to try and dispatch, and for that to be blocked while some amount of (usually trivial) work is being done by a different thread, yes. I don't think we've ever observed it being a problem for anything other than updating OSD maps, though... Another is that once a message receive starts, I see ~50 messages that take tens of seconds to receive, when the nominal receive time is a half-second or less. I'm in the process of tooling up to collect tcpdump data on all my clients to try to catch what is going on with that. Again, how are you instrumenting that? I post-process the logs, looking at the time difference between reader got .* policy throttler and reader got .* osd_op(client. I guess the logging output must have changed a bit at some pointer (or was that one of your patches?). master has reader wants not reader got for the policy throttler. (Just got a little confused when checking the code.) When I find a candidate message, I grep the log for just that reader thread, and see, e.g., this: osd.0.log:1280693:2012-02-02 09:17:57.704508 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got 2670720 from policy throttler 48809510/5000 seq 828/828 waiters 157/149 for src client.4301 tid=247 osd.0.log:1280694:2012-02-02 09:17:57.704525 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader wants 2670720 from dispatch throttler 41944358/ osd.0.log:1280701:2012-02-02 09:17:57.704654 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader got front 128 osd.0.log:1280705:2012-02-02 09:17:57.704752 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader allocating new rx buffer at offset 0 osd.0.log:1280710:2012-02-02 09:17:57.704873 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11922000 len 2670592 osd.0.log:1559767:2012-02-02 09:19:40.726589 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6a5cc len 1325620 osd.0.log:1561092:2012-02-02 09:19:40.927559 7fe5c9099700 -- 172.17.131.32:6800/14974 172.17.135.85:0/1283168808 pipe(0xbdc9680 sd=215 pgs=49 cs=1 l=1).reader reading nonblocking into 0x11a6ab74 len 1324172 Note the ~2 minute delay (and ~300,000 lines of logging) between the first and second reads. During that time 129 sockets were processed - what makes sd=215 special? Hrm. Well, you can try turning up the messenger debugging to 30 and taking advantage of the reader reading reader read pair right around tcp_read_nonblocking. I've added tracepoints in my client kernel try_write(), and nothing seems unusual (that's with running the patch to ceph_write_space() I posted earlier): kworker/0:2-1790 [000] 1543.200887: ceph_try_write_msg_done: peer osd0 tid 179 seq 3 sent 4194304 kworker/0:2-1790 [000] 1543.200901: ceph_prepare_write_msg: peer osd0 tid 207 seq 4 sent 0 *snip* kworker/0:2-1790 [000] 1569.078614: ceph_try_write_msg_done: peer osd0 tid 207 seq 4 sent 4194304 There's a 25 second gap at 1543.236256, but nothing like the 100 second gap in the reader. Hence, tcpdump seems like a good idea? You do bring us interesting problems! Let us know what info you come up with. Oh, and I keep forgetting to ask: what does the write workload look like? At first I assumed this was a CephFS workload, but given that you're changing
[RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
Hi, FWIW, I've been trying to understand op delays under very heavy write load, and have been working a little with the policy throttler in hopes of using throttling delays to help track down which ops were backing up. Without much success, unfortunately. When I saw the wip-osd-op-tracking branch, I wondered if any of this stuff might be helpful. Here it is, just in case. -- Jim Jim Schutt (6): msgr: print message sequence number and tid when receiving message envelope common/Throttle: track sleep/wake sequences in Throttle, report them for policy throttler common/Throttle: throttle in FIFO order common/Throttle: FIFO throttler doesn't need to signal waiters when max changes common/Throttle: make get() report number of waiters on entry/exit msg: log Message interactions with throttler src/common/Throttle.h | 75 +++- src/msg/Message.h | 71 +++-- src/msg/SimpleMessenger.cc | 22 +--- 3 files changed, 129 insertions(+), 39 deletions(-) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Wed, Feb 1, 2012 at 7:54 AM, Jim Schutt jasc...@sandia.gov wrote: Hi, FWIW, I've been trying to understand op delays under very heavy write load, and have been working a little with the policy throttler in hopes of using throttling delays to help track down which ops were backing up. Without much success, unfortunately. When I saw the wip-osd-op-tracking branch, I wondered if any of this stuff might be helpful. Here it is, just in case. In general these patches are dumping information to the logs, and part of the wip-osd-op-tracking branch is actually keeping track of most of the message queueing wait times as part of the message itself (although not the information about number of waiters and sleep/wake seqs). I'm inclined to prefer that approach to log dumping. Are there any patches you recommend for merging? I'm a little curious about the ordered wakeup one — do you have data about when that's a problem? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html