Re: poor write performance
Hi, > Unless Sylvian implemented this in his tool > explicitly, it won't happen there either. The small bench tool submits requests using the asynchronous API as fast as possible, using a 1M chunk. Then it just waits for all the completions to be done. Sylvain -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
> You may want to try increasing your read_ahead_kb on the OSD data disks and > see if that helps read speeds. Jumping into this thread late, so I'm not sure if this was covered, but: Remember that readahead on the OSDs will only help up to the size of the object (4MB). To get good read performance in general what is really needed is for the librbd user to do readahead so that the next object(s) are being fetched before they are needed. I don't think this happens with 'dd' (opening a block device as a file does not trigger the kernel VM readahead code, IIRC). Unless Sylvian implemented this in his tool explicitly, it won't happen there either. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
On 04/22/2013 07:01 AM, Mark Nelson wrote: On 04/22/2013 06:48 AM, James Harper wrote: My read speed is consistently around 40MB/second, and my write speed is consistently around 22MB/second. I had expected better of read... You may want to try increasing your read_ahead_kb on the OSD data disks and see if that helps read speeds. Default appears to be 128 and I was getting 40MB/second Increasing to 256 takes me up to 48MB/second Increasing to 512 takes me up to 53Mb/second Any further increases don't do anything that I can measure Is increasing read_ahead_kb good for general performance, or just for impressing people with benchmarks? If the kernel spent time reading ahead woult it hurt random read/write performance? Potentially yes, but it depends on a lot of of factors. I suspect that increasing it may be acceptable on modern drives, but you'll need to do some testing to see how it goes in practice. If anyone on the list knows how many sectors per track is typical for modern 1-3TB drives I'm dying to know. That would help us guess at how much data can be writen/read on average without imposing any head movement. :) Aha, sorry to reply to my own mail. I found some specifications for Hitachi drives at least: http://www.hgst.com/tech/techlib.nsf/products/Ultrastar_7K4000 look at section 4.2 of the "Ultrastar 7K4000 OEM Specification" document. It specifies 310ktpi, or 310,000 tracks/inch. Via google I found that this drive is using 5 800GB platters, meaning there are 10 heads in this drive. Using hitachi's specifications: (7,814,037,168 sectors / (310,000 tracks / inch * 3.5 inches)) / 10 heads * 512 bytes / sector = ~360KB/track head So assuming my math is right, it looks like we can read up to around 360KB of data before hitting a head switch. Now unfortunately (or maybe fortunately!) this is just the average case. Outer tracks will store more data than inner tracks, so depending on what portion of the disk you are doing the read from, you might introduce head switches more or less often. It looks like even with a 256k or 512k read_ahead you probably won't introduce a next-cylinder seek that often, though from what I can find it's not going to be all that much more expensive vs a head switch (2-3ms vs 1-2ms). Mark Thanks James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
On 04/22/2013 06:48 AM, James Harper wrote: My read speed is consistently around 40MB/second, and my write speed is consistently around 22MB/second. I had expected better of read... You may want to try increasing your read_ahead_kb on the OSD data disks and see if that helps read speeds. Default appears to be 128 and I was getting 40MB/second Increasing to 256 takes me up to 48MB/second Increasing to 512 takes me up to 53Mb/second Any further increases don't do anything that I can measure Is increasing read_ahead_kb good for general performance, or just for impressing people with benchmarks? If the kernel spent time reading ahead woult it hurt random read/write performance? Potentially yes, but it depends on a lot of of factors. I suspect that increasing it may be acceptable on modern drives, but you'll need to do some testing to see how it goes in practice. If anyone on the list knows how many sectors per track is typical for modern 1-3TB drives I'm dying to know. That would help us guess at how much data can be writen/read on average without imposing any head movement. :) Thanks James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: poor write performance
> > My read speed is consistently around 40MB/second, and my write speed is > > consistently around 22MB/second. I had expected better of read... > > You may want to try increasing your read_ahead_kb on the OSD data disks > and see if that helps read speeds. > Default appears to be 128 and I was getting 40MB/second Increasing to 256 takes me up to 48MB/second Increasing to 512 takes me up to 53Mb/second Any further increases don't do anything that I can measure Is increasing read_ahead_kb good for general performance, or just for impressing people with benchmarks? If the kernel spent time reading ahead woult it hurt random read/write performance? Thanks James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: poor write performance
> > I upgraded to 0.60 and that seems to have made a big difference. If I kill > > off > > one of my OSD's I get around 20MB/second throughput in live testing (test > > restore of Xen Windows VM from USB backup), which is pretty much the > > limit of the USB disk. If I reactivate the second OSD throughput drops back > > to > > ~10MB/second which isn't as good but is much better than I was getting. > > > > Ah, are these disks both connected through USB(2?)? > I guess I was a bit brief :) Both my OSD disks are SATA attached. Inside a VM I have attached another disk which is attached to the host via USB. This disk contains a backup of a server (using Windows Server Backup) and am doing a test restore of it, with ceph holding the C: drive of the virtual server (eg the write target). What I was saying is that I would never expect more than about 20-30MB/s write speed in this test because that is going to be approximately the limit of the USB interface that the data is coming from. This is more a production test than a benchmark, and I was just iostat to monitor the throughput of the /dev/rbdX interfaces while doing the restore. James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
On 04/22/2013 06:34 AM, James Harper wrote: Hi, Correct, but that's the theoretical maximum I was referring to. If I calculate that I should be able to get 50MB/second then 30MB/second is acceptable but 500KB/second is not :) I have written a small benchmark for RBD : https://gist.github.com/smunaut/5433222 It uses the librbd API directly without kernel client and queue requests long in advance and this should give an "upper" bound to what you can get at best. It reads and writes the whole image, so I usually just create a 1 or 2 G image for testing. Using two OSDs on two distinct recent 7200rpm drives (with journal on the same disk as data), I get : Read: 89.52 Mb/s (2147483648 bytes in 22877 ms) Write: 10.62 Mb/s (2147483648 bytes in 192874 ms) I like your benchmark tool! How many replicas? With two OSD's with xfs on ~3yo 1TB disks with two replicas I get: # ./a.out admin xen test Read: 111.99 Mb/s (1073741824 bytes in 9144 ms) Write: 29.68 Mb/s (1073741824 bytes in 34507 ms) Which means I forgot to drop caches on the OSD's so I'm seeing the limit on my public network (single gigabit interface). After dropping caches I consistently get: # ./a.out admin xen test Read: 39.98 Mb/s (1073741824 bytes in 25614 ms) Write: 23.11 Mb/s (1073741824 bytes in 44316 ms) Journal is on the same disk. Network is... confusing :) but is basically public on a single gigabit and cluster on a bonded pair of gigabit links. The whole network thing is shared with my existing drbd cluster so performance may vary over time. My read speed is consistently around 40MB/second, and my write speed is consistently around 22MB/second. I had expected better of read... You may want to try increasing your read_ahead_kb on the OSD data disks and see if that helps read speeds. While running, iostat on each osd reports a read rate of around 20MB/second (1/2 total on each) during read test and a rate of 40-60MB/second (~2x total on each) during write test, which is pretty much exactly right. iperf on the cluster network (pair of gigabits bonded) gives me about 1.97Gbits/second. iperf between osd and client is around 0.94Gbits/second. changing the scheduler on the harddisk doesn't seem to make any difference, even when I set it to cfq which normally really sucks. What ceph version are you using and what filesystem? Thanks James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: poor write performance
> Hi, > > > Correct, but that's the theoretical maximum I was referring to. If I > > calculate > that I should be able to get 50MB/second then 30MB/second is acceptable > but 500KB/second is not :) > > I have written a small benchmark for RBD : > > https://gist.github.com/smunaut/5433222 > > It uses the librbd API directly without kernel client and queue > requests long in advance and this should give an "upper" bound to what > you can get at best. > It reads and writes the whole image, so I usually just create a 1 or 2 > G image for testing. > > Using two OSDs on two distinct recent 7200rpm drives (with journal on > the same disk as data), I get : > > Read: 89.52 Mb/s (2147483648 bytes in 22877 ms) > Write: 10.62 Mb/s (2147483648 bytes in 192874 ms) > I like your benchmark tool! How many replicas? With two OSD's with xfs on ~3yo 1TB disks with two replicas I get: # ./a.out admin xen test Read: 111.99 Mb/s (1073741824 bytes in 9144 ms) Write: 29.68 Mb/s (1073741824 bytes in 34507 ms) Which means I forgot to drop caches on the OSD's so I'm seeing the limit on my public network (single gigabit interface). After dropping caches I consistently get: # ./a.out admin xen test Read: 39.98 Mb/s (1073741824 bytes in 25614 ms) Write: 23.11 Mb/s (1073741824 bytes in 44316 ms) Journal is on the same disk. Network is... confusing :) but is basically public on a single gigabit and cluster on a bonded pair of gigabit links. The whole network thing is shared with my existing drbd cluster so performance may vary over time. My read speed is consistently around 40MB/second, and my write speed is consistently around 22MB/second. I had expected better of read... While running, iostat on each osd reports a read rate of around 20MB/second (1/2 total on each) during read test and a rate of 40-60MB/second (~2x total on each) during write test, which is pretty much exactly right. iperf on the cluster network (pair of gigabits bonded) gives me about 1.97Gbits/second. iperf between osd and client is around 0.94Gbits/second. changing the scheduler on the harddisk doesn't seem to make any difference, even when I set it to cfq which normally really sucks. What ceph version are you using and what filesystem? Thanks James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
On 04/22/2013 12:32 AM, James Harper wrote: On 04/19/2013 08:30 PM, James Harper wrote: rados -p -b 4096 bench 300 seq -t 64 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 read got -2 error during benchmark: -5 error 5: (5) Input/output error not sure what that's about... Oops... I typo'd --no-cleanup. Now I get: sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 Total time run:0.243709 Total reads made: 1292 Read size:4096 Bandwidth (MB/sec):20.709 Average Latency: 0.0118838 Max latency: 0.031942 Min latency: 0.001445 So it finishes instantly without seeming to do much actual testing... My bad. I forgot to tell you to do a sync/flush on the OSDs after the write test. All of those reads are probably coming from pagecache. The good news is that this is demonstrating that reading 4k objects from pagecache isn't insanely bad on your setup (for larger sustained loads I see 4k object reads from pagecache hit up to around 100MB/s with multiple clients on my test nodes). On your OSD nodes try: sync echo 3 > /proc/sys/vm/drop_caches right before you run the read test. I tell it to test for 300 seconds and it tests for 0 seconds so I must be doing something else wrong. It will try to read for up to 300 seconds, but if it runs out of data it stops. Since you only wrote out something like 1300 4k objects, and you were reading at 20+MB/s, the test ran for under a second. Whatever issue you are facing is probably down at the filestore level or possible lower down yet. How do your drives benchmark with something like fio doing random 4k writes? Are your drives dedicated for ceph? What filesystem? Also what is the journal device you are using? Drives are dedicated for ceph. I originally put my journals on /, but that was ext3 and my throughput went down even further so the journal shares the osd disk for now. I upgraded to 0.60 and that seems to have made a big difference. If I kill off one of my OSD's I get around 20MB/second throughput in live testing (test restore of Xen Windows VM from USB backup), which is pretty much the limit of the USB disk. If I reactivate the second OSD throughput drops back to ~10MB/second which isn't as good but is much better than I was getting. Ah, are these disks both connected through USB(2?)? Thanks James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
Hi, > Correct, but that's the theoretical maximum I was referring to. If I > calculate that I should be able to get 50MB/second then 30MB/second is > acceptable but 500KB/second is not :) I have written a small benchmark for RBD : https://gist.github.com/smunaut/5433222 It uses the librbd API directly without kernel client and queue requests long in advance and this should give an "upper" bound to what you can get at best. It reads and writes the whole image, so I usually just create a 1 or 2 G image for testing. Using two OSDs on two distinct recent 7200rpm drives (with journal on the same disk as data), I get : Read: 89.52 Mb/s (2147483648 bytes in 22877 ms) Write: 10.62 Mb/s (2147483648 bytes in 192874 ms) The raw disk do about 45 Mo/s when written by 1M chunk. But when written by 4k chunk, this falls to ~500 ko/s ... # dd if=/dev/zero of=/dev/xen-disks/test bs=1M oflag=direct 2049+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 49.3943 s, 43.5 MB/s # dd if=/dev/zero of=/dev/xen-disks/test bs=4k oflag=direct ^C61667+0 records in 61667+0 records out 252588032 bytes (253 MB) copied, 539.123 s, 469 kB/s Cheers, Sylvain -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: poor write performance
> > On 04/19/2013 08:30 PM, James Harper wrote: > >>> rados -p -b 4096 bench 300 seq -t 64 > >> > >> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > >> 0 0 0 0 0 0 - 0 > >> read got -2 > >> error during benchmark: -5 > >> error 5: (5) Input/output error > >> > >> not sure what that's about... > >> > > > > Oops... I typo'd --no-cleanup. Now I get: > > > > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > > 0 0 0 0 0 0 - 0 > > Total time run:0.243709 > > Total reads made: 1292 > > Read size:4096 > > Bandwidth (MB/sec):20.709 > > > > Average Latency: 0.0118838 > > Max latency: 0.031942 > > Min latency: 0.001445 > > > > So it finishes instantly without seeming to do much actual testing... > > My bad. I forgot to tell you to do a sync/flush on the OSDs after the > write test. All of those reads are probably coming from pagecache. The > good news is that this is demonstrating that reading 4k objects from > pagecache isn't insanely bad on your setup (for larger sustained loads I > see 4k object reads from pagecache hit up to around 100MB/s with > multiple clients on my test nodes). > > On your OSD nodes try: > > sync > echo 3 > /proc/sys/vm/drop_caches > > right before you run the read test. > I tell it to test for 300 seconds and it tests for 0 seconds so I must be doing something else wrong. > Whatever issue you are facing is probably down at the filestore level or > possible lower down yet. > > How do your drives benchmark with something like fio doing random 4k > writes? Are your drives dedicated for ceph? What filesystem? Also > what is the journal device you are using? > Drives are dedicated for ceph. I originally put my journals on /, but that was ext3 and my throughput went down even further so the journal shares the osd disk for now. I upgraded to 0.60 and that seems to have made a big difference. If I kill off one of my OSD's I get around 20MB/second throughput in live testing (test restore of Xen Windows VM from USB backup), which is pretty much the limit of the USB disk. If I reactivate the second OSD throughput drops back to ~10MB/second which isn't as good but is much better than I was getting. Thanks James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: poor write performance
> Hi, > > > My goal is 4 OSD's, each on separate machines, with 1 drive in each for a > start, but I want to see performance of at least the same order of magnitude > as the theoretical maximum on my hardware before I think about replacing > my existing setup. > > My current understanding is that it's not even possible, you always > have a min 2/3x slow down in the best case. > > If you do sustained sequential write benchmark, and have a single > drive, then that drive ends up writing the data twice (journal + final > storage area) which with the seeks will more than divide by 2 the peak > perf of the drive. And since it's sequential, it will only write to 1 > PG at a time (so not divided among several OSD). > > Also AFAIU the OSD receiving the data will also have to send the data > to the other OSD in the PG and wait for them to say everything is > written before confirming the write, which slows it even more. > Correct, but that's the theoretical maximum I was referring to. If I calculate that I should be able to get 50MB/second then 30MB/second is acceptable but 500KB/second is not :) James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
Hi, > My goal is 4 OSD's, each on separate machines, with 1 drive in each for a > start, but I want to see performance of at least the same order of magnitude > as the theoretical maximum on my hardware before I think about replacing my > existing setup. My current understanding is that it's not even possible, you always have a min 2/3x slow down in the best case. If you do sustained sequential write benchmark, and have a single drive, then that drive ends up writing the data twice (journal + final storage area) which with the seeks will more than divide by 2 the peak perf of the drive. And since it's sequential, it will only write to 1 PG at a time (so not divided among several OSD). Also AFAIU the OSD receiving the data will also have to send the data to the other OSD in the PG and wait for them to say everything is written before confirming the write, which slows it even more. Cheers, Sylvain -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
On 04/19/2013 08:30 PM, James Harper wrote: rados -p -b 4096 bench 300 seq -t 64 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 read got -2 error during benchmark: -5 error 5: (5) Input/output error not sure what that's about... Oops... I typo'd --no-cleanup. Now I get: sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 Total time run:0.243709 Total reads made: 1292 Read size:4096 Bandwidth (MB/sec):20.709 Average Latency: 0.0118838 Max latency: 0.031942 Min latency: 0.001445 So it finishes instantly without seeming to do much actual testing... My bad. I forgot to tell you to do a sync/flush on the OSDs after the write test. All of those reads are probably coming from pagecache. The good news is that this is demonstrating that reading 4k objects from pagecache isn't insanely bad on your setup (for larger sustained loads I see 4k object reads from pagecache hit up to around 100MB/s with multiple clients on my test nodes). On your OSD nodes try: sync echo 3 > /proc/sys/vm/drop_caches right before you run the read test. Whatever issue you are facing is probably down at the filestore level or possible lower down yet. How do your drives benchmark with something like fio doing random 4k writes? Are your drives dedicated for ceph? What filesystem? Also what is the journal device you are using? Mark James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
James Harper wrote: Hi James, do you VLAN's interfaces configured on your bonding interfaces? Because I saw a similar situation in my setup. No VLAN's on my bonding interface, although extensively used elsewhere. What the OP described is *exactly* like a problem I've been struggling with. I thought the blame had lay elsewhere but maybe not. My setup: 4 Ceph nodes, with 6 OSDs each and dual (bonded) 10GbE, with VLANs, running Precise. OSDs are using XFS. Replica count of 3. 3 of these are mons. 4 compute nodes, with dual (bonded) 10GbE, with VLANs, running a base of Precise along with a 3.6.3 Ceph-provided kernel, running KVM-based VMs. 2 of these are also mons. VMs are Precise and accessing RBD through the kernel client. (Eventually there will be 12 Ceph nodes. 5 mons seemed an appropriate number and when I've run into issues in the past I've actually gotten to cases where > 3 mons were knocked out, so 5 is a comfortable number unless it's problematic.) In the VMs, I/O with ext4 is fine -- 10-15MB/s sustained. However, using ZFS (via ZFSonLinux, not FUSE), I see write speeds of about 150kb/sec, just like the OP. I had figured that the problem lay with ZFS inside the VM (I've used ZFSonLinux on many bare metal machines without a problem for a couple of years now). The VMs were using virtio, and I'd heard that it was found that pre-1.4 Qemu versions could have some serious problems with virtio (which I didn't know at the time); also, I know that the kernel client is not the preferred client, and the version I'm using is a rather older version of the Ceph-provided builds. As a result, my plan was to try the updated Qemu version along with native Qemu librados RBD support once Raring was out, as I figured that the problem was either something in ZFSonLinux (though I reported the issue and nobody had ever heard of any such problem, or had any idea why it would be happening) or something specifically about ZFS running inside Qemu, as ext4 in the VMs is fine. But, this thread has made me wonder if what's actually happening is in fact something else -- either something, as someone else saw, to do with using VLANs on the bonded interface (although I don't see such a write problem with any other traffic going through these VLANs); or, something about how ZFS inside the VM is writing to the RBD disk causing some kind of giant slowdown in Ceph. The numbers that the OP cited were exactly in line with what I was seeing. I don't know offhand what the block sizes are that the kernel client was using, or that the different filesystems inside the VMs might be using when trying to write to their virtual disks (I'm guessing that if you are using virtio, as I am, it potentially could be anything). But perhaps ZFS writes extremely small blocks and ext4 doesn't. Unfortunately, I don't have access to this testbed for the next few weeks, so for the moment I can only recount my experience and not actually test out any suggestions (unless I can corral someone with access to it to run tests). Thanks, Jeff -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: poor write performance
> > Hi James, > > do you VLAN's interfaces configured on your bonding interfaces? Because > I saw a similar situation in my setup. > No VLAN's on my bonding interface, although extensively used elsewhere. Thanks James
Re: poor write performance
Hi James, do you VLAN's interfaces configured on your bonding interfaces? Because I saw a similar situation in my setup. Kind Regards Harald Roessler On Fri, 2013-04-19 at 01:11 +0200, James Harper wrote: > > > > Hi James, > > > > This is just pure speculation, but can you assure that the bonding works > > correctly? Maybe you have issues there. I have seen a lot of incorrectly > > configured bonding throughout my life as unix admin. > > > > The bonding gives me iperf performance consistent with 2 x 1GB links so I > think it's okay. > > James > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Mit freundlichen Grüßen, Harald Rößler . . . . . . . . . . . . . . . . BTD System GmbH Tel.: +49 (89) - 20 05 - 44 30 Tel.: +49 (89) - 660 291 - 251 Mob.: +49 (151) - 11 70 17 59 Fax: +49 (89) 89 - 20 05 - 44 11 harald.roess...@btd.de www.btd.de Projektbüro Allianz-Arena • Ebene 4 Werner-Heisenberg-Allee 25 • D-80939 München Goethestraße 34 • D-80336 München HRB München 154370 Geschäftsführer: Stefan Leibhard, Kersten Kröhl, Harald Rößler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONFIDENTIALITY NOTICE This communication contains information which is confidential and may also be privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s), please note that any distribution, copying or use of this communication or the information in it is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone on +49 (0) 89 - 20 05 - 44 00 and then destroy the email and any copies of it. This communication is from BTD System GmbH whose office is at Werner-Heisenberg-Allee 25, D-80939 München. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
RE: poor write performance
> > rados -p -b 4096 bench 300 seq -t 64 > > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > 0 0 0 0 0 0 - 0 > read got -2 > error during benchmark: -5 > error 5: (5) Input/output error > > not sure what that's about... > Oops... I typo'd --no-cleanup. Now I get: sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 Total time run:0.243709 Total reads made: 1292 Read size:4096 Bandwidth (MB/sec):20.709 Average Latency: 0.0118838 Max latency: 0.031942 Min latency: 0.001445 So it finishes instantly without seeming to do much actual testing... James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: poor write performance
> > On 04/19/2013 06:09 AM, James Harper wrote: > > I just tried a 3.8 series kernel and can now get 25mbytes/second using dd > with a 4mb block size, instead of the 700kbytes/second I was getting with the > debian 3.2 kernel. > > That's unexpected. Was this the kernel on the client, the OSDs, or > both? Kernel on the client. I can't easily change the kernel on the OSD's although if you think it will make a big difference I can arrange it. > > > > I'm still getting 120kbytes/second with a dd 4kb block size though... is > > that > expected? > > that's still quite a bit lower than I'd expect as well. What were your > fs mount options on the OSDs? I didn't explicitly set any, so I guess these are the defaults: xfs (rw,noatime,attr2,delaylog,inode64,noquota) > Can you try some rados bench read/write > tests on your pool? Something like: > > rados -p -b 4096 bench 300 write --no-cleanup -t 64 Ah. It's the --no-cleanup that explains why my pervious seq tests didn't work! Total time run: 300.430516 Total writes made: 26726 Write size: 4096 Bandwidth (MB/sec): 0.347 Stddev Bandwidth: 0.322983 Max bandwidth (MB/sec): 1.34375 Min bandwidth (MB/sec): 0 Average Latency:0.719337 Stddev Latency: 0.985265 Max latency:7.2241 Min latency:0.018218 But then it just hung and I had to hit ctrl-c What is the unit of measure for latency and for write size? > rados -p -b 4096 bench 300 seq -t 64 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 read got -2 error during benchmark: -5 error 5: (5) Input/output error not sure what that's about... > > with 2 drives and 2x replication I wouldn't expect much without RBD > cache, but 120kb/s is rather excessively bad. :) > What is rbd cache? I've seen it mentioned but haven't found documentation for it anywhere... My goal is 4 OSD's, each on separate machines, with 1 drive in each for a start, but I want to see performance of at least the same order of magnitude as the theoretical maximum on my hardware before I think about replacing my existing setup. Thanks James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
On 04/19/2013 06:09 AM, James Harper wrote: I just tried a 3.8 series kernel and can now get 25mbytes/second using dd with a 4mb block size, instead of the 700kbytes/second I was getting with the debian 3.2 kernel. That's unexpected. Was this the kernel on the client, the OSDs, or both? I'm still getting 120kbytes/second with a dd 4kb block size though... is that expected? that's still quite a bit lower than I'd expect as well. What were your fs mount options on the OSDs? Can you try some rados bench read/write tests on your pool? Something like: rados -p -b 4096 bench 300 write --no-cleanup -t 64 rados -p -b 4096 bench 300 seq -t 64 with 2 drives and 2x replication I wouldn't expect much without RBD cache, but 120kb/s is rather excessively bad. :) James Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: poor write performance
I just tried a 3.8 series kernel and can now get 25mbytes/second using dd with a 4mb block size, instead of the 700kbytes/second I was getting with the debian 3.2 kernel. I'm still getting 120kbytes/second with a dd 4kb block size though... is that expected? James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: poor write performance
> > I did an strace -c to gather some performance info, if that helps: > Oops. Forgot to say that that's an strace -c of the osd process! > % time seconds usecs/call callserrors syscall > -- --- --- - - > 78.13 39.5895492750 14398 967 futex > 12.456.3087844200 1502 poll > 7.994.048253 22490318 9 restart_syscall > 0.650.331042 635 521 writev > 0.340.172011 57337 3 SYS_344 > 0.220.110395 117 944 close > 0.080.040002 310 129 truncate64 > 0.070.036003 12001 3 fsync > 0.020.010611 1 10263 gettimeofday > 0.020.0080001333 6 pwrite64 > 0.010.004941 9 521 fsetxattr > 0.010.004256 33 129 sync_file_range > 0.010.002779 1 3660 814 stat64 > 0.000.001775 4 442 sendmsg > 0.000.001266 1 1507 recv > 0.000.001103 1 948 4 open > 0.000.000640 1 979 time > 0.000.000493 1 409 clock_gettime > 0.000.000375 1 522 _llseek > 0.000.000111 1110 read > 0.000.00 0 1 setxattr > 0.000.00 0 1 getxattr > 0.000.00 032 8 fgetxattr > 0.000.00 0 5 statfs64 > 0.000.00 0 5 5 fallocate > -- --- --- - - > 100.00 50.672389 36958 1807 total > > Does that look about what you'd expect? > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: poor write performance
> > > > Where should I start looking for performance problems? I've tried > running > > > some of the benchmark stuff in the documentation but I haven't gotten > very > > > far... > > > > Hi James! Sorry to hear about the performance trouble! Is it just > > sequential 4KB direct IO writes that are giving you troubles? If you > > are using the kernel version of RBD, we don't have any kind of cache > > implemented there and since you are bypassing the pagecache on the > > client, those writes are being sent to the different OSDs in 4KB chunks > > over the network. RBD stores data in blocks that are represented by 4MB > > objects on one of the OSDs, so without cache a lot of sequential 4KB > > writes will be hitting 1 OSD repeatedly and then moving on to the next > > one. Hopefully those writes would get aggregated at the OSD level, but > > clearly that's not really happening here given your performance. > > Using dd I tried various block sizes. With 4kb I was getting around > 500kbytes/second rate. With 1MB I was getting a few mbytes/second. Read > performance seems great though. > I did an strace -c to gather some performance info, if that helps: % time seconds usecs/call callserrors syscall -- --- --- - - 78.13 39.5895492750 14398 967 futex 12.456.3087844200 1502 poll 7.994.048253 22490318 9 restart_syscall 0.650.331042 635 521 writev 0.340.172011 57337 3 SYS_344 0.220.110395 117 944 close 0.080.040002 310 129 truncate64 0.070.036003 12001 3 fsync 0.020.010611 1 10263 gettimeofday 0.020.0080001333 6 pwrite64 0.010.004941 9 521 fsetxattr 0.010.004256 33 129 sync_file_range 0.010.002779 1 3660 814 stat64 0.000.001775 4 442 sendmsg 0.000.001266 1 1507 recv 0.000.001103 1 948 4 open 0.000.000640 1 979 time 0.000.000493 1 409 clock_gettime 0.000.000375 1 522 _llseek 0.000.000111 1110 read 0.000.00 0 1 setxattr 0.000.00 0 1 getxattr 0.000.00 032 8 fgetxattr 0.000.00 0 5 statfs64 0.000.00 0 5 5 fallocate -- --- --- - - 100.00 50.672389 36958 1807 total Does that look about what you'd expect? James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: poor write performance
> > Where should I start looking for performance problems? I've tried running > > some of the benchmark stuff in the documentation but I haven't gotten very > > far... > > Hi James! Sorry to hear about the performance trouble! Is it just > sequential 4KB direct IO writes that are giving you troubles? If you > are using the kernel version of RBD, we don't have any kind of cache > implemented there and since you are bypassing the pagecache on the > client, those writes are being sent to the different OSDs in 4KB chunks > over the network. RBD stores data in blocks that are represented by 4MB > objects on one of the OSDs, so without cache a lot of sequential 4KB > writes will be hitting 1 OSD repeatedly and then moving on to the next > one. Hopefully those writes would get aggregated at the OSD level, but > clearly that's not really happening here given your performance. Using dd I tried various block sizes. With 4kb I was getting around 500kbytes/second rate. With 1MB I was getting a few mbytes/second. Read performance seems great though. > Here's a couple of thoughts: > > 1) If you are working with VMs, using the QEMU/KVM interface with virtio > drivers and RBD cache enabled will give you a huge jump in small > sequential write performance relative to what you are seeing now. I'm using Xen so that won't work for me right now, although I did notice someone posted some blktap code to support ceph. I'm trying a windows restore of a physical machine into a VM under Xen and performance matches what I am seeing with dd - very very slow. > 2) You may want to try upgrading to 0.60. We made a change to how the > pg_log works that causes fewer disk seeks during small IO, especially > with XFS. Do packages for this exist for Debian? At the moment my sources.list contains "ceph.com/debian-bobtail wheezy main". > 3) If you are still having trouble, testing your network, disk speeds, > and using rados bench to test the object store all may be helpful. > I tried that and while the write worked the seq test always said I had to do a write test first. While running my Xen restore, /var/log/ceph/ceph.log looks like: pgmap v18316: 832 pgs: 832 active+clean; 61443 MB data, 119 GB used, 1742 GB / 1862 GB avail; 824KB/s wr, 12op/s pgmap v18317: 832 pgs: 832 active+clean; 61446 MB data, 119 GB used, 1742 GB / 1862 GB avail; 649KB/s wr, 10op/s pgmap v18318: 832 pgs: 832 active+clean; 61449 MB data, 119 GB used, 1742 GB / 1862 GB avail; 652KB/s wr, 10op/s pgmap v18319: 832 pgs: 832 active+clean; 61452 MB data, 119 GB used, 1742 GB / 1862 GB avail; 614KB/s wr, 9op/s pgmap v18320: 832 pgs: 832 active+clean; 61454 MB data, 119 GB used, 1742 GB / 1862 GB avail; 537KB/s wr, 8op/s pgmap v18321: 832 pgs: 832 active+clean; 61457 MB data, 119 GB used, 1742 GB / 1862 GB avail; 511KB/s wr, 7op/s James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: poor write performance
> > Hi James, > > This is just pure speculation, but can you assure that the bonding works > correctly? Maybe you have issues there. I have seen a lot of incorrectly > configured bonding throughout my life as unix admin. > The bonding gives me iperf performance consistent with 2 x 1GB links so I think it's okay. James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
On 04/18/2013 11:46 AM, Andrey Korolyov wrote: On Thu, Apr 18, 2013 at 5:43 PM, Mark Nelson wrote: On 04/18/2013 06:46 AM, James Harper wrote: I'm doing some basic testing so I'm not really fussed about poor performance, but my write performance appears to be so bad I think I'm doing something wrong. Using dd to test gives me kbytes/second for write performance for 4kb block sizes, while read performance is acceptable (for testing at least). For dd I'm using iflag=direct for read and oflag=direct for write testing. My setup, approximately, is: Two OSD's . 1 x 7200RPM SATA disk each . 2 x gigabit cluster network interfaces each in a bonded configuration directly attached (osd to osd, no switch) . 1 x gigabit public network . journal on another spindle Three MON's . 1 each on the OSD's . 1 on another server, which is also the one used for testing performance I'm using debian packages from ceph which are version 0.56.4 For comparison, my existing production storage is 2 servers running DRBD with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the iSCSI. Performance not spectacular but acceptable. The servers in question are the same specs as the servers I'm testing on. Where should I start looking for performance problems? I've tried running some of the benchmark stuff in the documentation but I haven't gotten very far... Hi James! Sorry to hear about the performance trouble! Is it just sequential 4KB direct IO writes that are giving you troubles? If you are using the kernel version of RBD, we don't have any kind of cache implemented there and since you are bypassing the pagecache on the client, those writes are being sent to the different OSDs in 4KB chunks over the network. RBD stores data in blocks that are represented by 4MB objects on one of the OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD repeatedly and then moving on to the next one. Hopefully those writes would get aggregated at the OSD level, but clearly that's not really happening here given your performance. Here's a couple of thoughts: 1) If you are working with VMs, using the QEMU/KVM interface with virtio drivers and RBD cache enabled will give you a huge jump in small sequential write performance relative to what you are seeing now. 2) You may want to try upgrading to 0.60. We made a change to how the pg_log works that causes fewer disk seeks during small IO, especially with XFS. Can you point into related commits, if possible? here you go: http://tracker.ceph.com/projects/ceph/repository/revisions/188f3ea6867eeb6e950f6efed18d53ff17522bbc 3) If you are still having trouble, testing your network, disk speeds, and using rados bench to test the object store all may be helpful. Thanks James Good luck! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
On Thu, Apr 18, 2013 at 5:43 PM, Mark Nelson wrote: > On 04/18/2013 06:46 AM, James Harper wrote: >> >> I'm doing some basic testing so I'm not really fussed about poor >> performance, but my write performance appears to be so bad I think I'm doing >> something wrong. >> >> Using dd to test gives me kbytes/second for write performance for 4kb >> block sizes, while read performance is acceptable (for testing at least). >> For dd I'm using iflag=direct for read and oflag=direct for write testing. >> >> My setup, approximately, is: >> >> Two OSD's >> . 1 x 7200RPM SATA disk each >> . 2 x gigabit cluster network interfaces each in a bonded configuration >> directly attached (osd to osd, no switch) >> . 1 x gigabit public network >> . journal on another spindle >> >> Three MON's >> . 1 each on the OSD's >> . 1 on another server, which is also the one used for testing performance >> >> I'm using debian packages from ceph which are version 0.56.4 >> >> For comparison, my existing production storage is 2 servers running DRBD >> with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top >> of the iSCSI. Performance not spectacular but acceptable. The servers in >> question are the same specs as the servers I'm testing on. >> >> Where should I start looking for performance problems? I've tried running >> some of the benchmark stuff in the documentation but I haven't gotten very >> far... > > > Hi James! Sorry to hear about the performance trouble! Is it just > sequential 4KB direct IO writes that are giving you troubles? If you are > using the kernel version of RBD, we don't have any kind of cache implemented > there and since you are bypassing the pagecache on the client, those writes > are being sent to the different OSDs in 4KB chunks over the network. RBD > stores data in blocks that are represented by 4MB objects on one of the > OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD > repeatedly and then moving on to the next one. Hopefully those writes would > get aggregated at the OSD level, but clearly that's not really happening > here given your performance. > > Here's a couple of thoughts: > > 1) If you are working with VMs, using the QEMU/KVM interface with virtio > drivers and RBD cache enabled will give you a huge jump in small sequential > write performance relative to what you are seeing now. > > 2) You may want to try upgrading to 0.60. We made a change to how the > pg_log works that causes fewer disk seeks during small IO, especially with > XFS. Can you point into related commits, if possible? > > 3) If you are still having trouble, testing your network, disk speeds, and > using rados bench to test the object store all may be helpful. > >> >> Thanks >> >> James > > > Good luck! > > >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
On 04/18/2013 06:46 AM, James Harper wrote: I'm doing some basic testing so I'm not really fussed about poor performance, but my write performance appears to be so bad I think I'm doing something wrong. Using dd to test gives me kbytes/second for write performance for 4kb block sizes, while read performance is acceptable (for testing at least). For dd I'm using iflag=direct for read and oflag=direct for write testing. My setup, approximately, is: Two OSD's . 1 x 7200RPM SATA disk each . 2 x gigabit cluster network interfaces each in a bonded configuration directly attached (osd to osd, no switch) . 1 x gigabit public network . journal on another spindle Three MON's . 1 each on the OSD's . 1 on another server, which is also the one used for testing performance I'm using debian packages from ceph which are version 0.56.4 For comparison, my existing production storage is 2 servers running DRBD with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the iSCSI. Performance not spectacular but acceptable. The servers in question are the same specs as the servers I'm testing on. Where should I start looking for performance problems? I've tried running some of the benchmark stuff in the documentation but I haven't gotten very far... Hi James! Sorry to hear about the performance trouble! Is it just sequential 4KB direct IO writes that are giving you troubles? If you are using the kernel version of RBD, we don't have any kind of cache implemented there and since you are bypassing the pagecache on the client, those writes are being sent to the different OSDs in 4KB chunks over the network. RBD stores data in blocks that are represented by 4MB objects on one of the OSDs, so without cache a lot of sequential 4KB writes will be hitting 1 OSD repeatedly and then moving on to the next one. Hopefully those writes would get aggregated at the OSD level, but clearly that's not really happening here given your performance. Here's a couple of thoughts: 1) If you are working with VMs, using the QEMU/KVM interface with virtio drivers and RBD cache enabled will give you a huge jump in small sequential write performance relative to what you are seeing now. 2) You may want to try upgrading to 0.60. We made a change to how the pg_log works that causes fewer disk seeks during small IO, especially with XFS. 3) If you are still having trouble, testing your network, disk speeds, and using rados bench to test the object store all may be helpful. Thanks James Good luck! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor write performance
Hi James, This is just pure speculation, but can you assure that the bonding works correctly? Maybe you have issues there. I have seen a lot of incorrectly configured bonding throughout my life as unix admin. Maybe this could help you a little: http://www.wogri.at/Port-Channeling-802-3ad.338.0.html On 04/18/2013 01:46 PM, James Harper wrote: > I'm doing some basic testing so I'm not really fussed about poor performance, > but my write performance appears to be so bad I think I'm doing something > wrong. > > Using dd to test gives me kbytes/second for write performance for 4kb block > sizes, while read performance is acceptable (for testing at least). For dd > I'm using iflag=direct for read and oflag=direct for write testing. > > My setup, approximately, is: > > Two OSD's > . 1 x 7200RPM SATA disk each > . 2 x gigabit cluster network interfaces each in a bonded configuration > directly attached (osd to osd, no switch) > . 1 x gigabit public network > . journal on another spindle > > Three MON's > . 1 each on the OSD's > . 1 on another server, which is also the one used for testing performance > > I'm using debian packages from ceph which are version 0.56.4 > > For comparison, my existing production storage is 2 servers running DRBD with > iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of > the iSCSI. Performance not spectacular but acceptable. The servers in > question are the same specs as the servers I'm testing on. > > Where should I start looking for performance problems? I've tried running > some of the benchmark stuff in the documentation but I haven't gotten very > far... > > Thanks > > James > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- DI (FH) Wolfgang Hennerbichler Software Development Unit Advanced Computing Technologies RISC Software GmbH A company of the Johannes Kepler University Linz IT-Center Softwarepark 35 4232 Hagenberg Austria Phone: +43 7236 3343 245 Fax: +43 7236 3343 250 wolfgang.hennerbich...@risc-software.at http://www.risc-software.at -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
poor write performance
I'm doing some basic testing so I'm not really fussed about poor performance, but my write performance appears to be so bad I think I'm doing something wrong. Using dd to test gives me kbytes/second for write performance for 4kb block sizes, while read performance is acceptable (for testing at least). For dd I'm using iflag=direct for read and oflag=direct for write testing. My setup, approximately, is: Two OSD's . 1 x 7200RPM SATA disk each . 2 x gigabit cluster network interfaces each in a bonded configuration directly attached (osd to osd, no switch) . 1 x gigabit public network . journal on another spindle Three MON's . 1 each on the OSD's . 1 on another server, which is also the one used for testing performance I'm using debian packages from ceph which are version 0.56.4 For comparison, my existing production storage is 2 servers running DRBD with iSCSI to the initiators which run Xen on top of a (C)LVM volumes on top of the iSCSI. Performance not spectacular but acceptable. The servers in question are the same specs as the servers I'm testing on. Where should I start looking for performance problems? I've tried running some of the benchmark stuff in the documentation but I haven't gotten very far... Thanks James -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mysteriously poor write performance
Sorry for the delayed reply... I've been tracking some issues which cause high latency on our test machines, and it may be responsible for your problems as well. Could you retry those runs with the same debugging and 'journal dio' set to false? Thanks for your patience, -Sam On Sat, Mar 24, 2012 at 12:09 PM, Andrey Korolyov wrote: > http://xdel.ru/downloads/ceph-logs-dbg/ > > On Fri, Mar 23, 2012 at 9:53 PM, Samuel Just wrote: >> (CCing the list) >> >> Actually, can you could re-do the rados bench run with 'debug journal >> = 20' along with the other debugging? That should give us better >> information. >> >> -Sam >> >> On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov wrote: >>> Hi Sam, >>> >>> Can you please suggest on where to start profiling osd? If the >>> bottleneck has related to such non-complex things as directio speed, >>> I`m sure that I was able to catch it long ago, even crossing around by >>> results of other types of benchmarks at host system. I`ve just tried >>> tmpfs under both journals, it has a small boost effect, as expected >>> because of near-zero i/o delay. May be chunk distribution mechanism >>> does not work well on such small amount of nodes but right now I don`t >>> have necessary amount of hardware nodes to prove or disprove that. >>> >>> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov wrote: random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 Starting 1 process Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s] random-rw: (groupid=0, jobs=1): err= 0: pid=9647 write: io=163840KB, bw=37760KB/s, iops=9439, runt= 4339msec clat (usec): min=70, max=39801, avg=104.19, stdev=324.29 bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28 cpu : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued r/w: total=0/40960, short=0/0 lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11% lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just wrote: > Our journal writes are actually sequential. Could you send FIO > results for sequential 4k writes osd.0's journal and osd.1's journal? > -Sam > > On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov wrote: >> FIO output for journal partition, directio enabled, seems good(same >> results for ext4 on other single sata disks). >> >> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 >> Starting 1 process >> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s] >> random-rw: (groupid=0, jobs=1): err= 0: pid=21926 >> write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec >> clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04 >> bw (KB/s) : min= 552, max= 3880, per=100.61%, avg=2341.23, >> stdev=480.05 >> cpu : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >> >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> issued r/w: total=0/40960, short=0/0 >> lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63% >> lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07% >> lat (msec): 500=0.04% >> >> >> >> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just >> wrote: >>> (CCing the list) >>> >>> So, the problem isn't the bandwidth. Before we respond to the client, >>> we write the operation to the journal. In this case, that operation >>> is taking >1s per operation on osd.1. Both rbd and rados bench will >>> only allow a limited number of ops in flight at a time, so this >>> latency is killing your throughput. For comparison, the latency for >>> writing to the journal on osd.0 is < .3s. Can you measure direct io >>> latency for writes to your osd.1 journal file? >>> -Sam >>> >>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov wrote: Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, not Megabits. On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov wrote: > [global] > log dir = /ceph/out > log_file = "" > logger dir = /ceph/log > pid file = /ceph/out/$type$id.pid > [mds] > pid file = /ceph/out/$name.pid > lockdep = 1
Re: Mysteriously poor write performance
http://xdel.ru/downloads/ceph-logs-dbg/ On Fri, Mar 23, 2012 at 9:53 PM, Samuel Just wrote: > (CCing the list) > > Actually, can you could re-do the rados bench run with 'debug journal > = 20' along with the other debugging? That should give us better > information. > > -Sam > > On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov wrote: >> Hi Sam, >> >> Can you please suggest on where to start profiling osd? If the >> bottleneck has related to such non-complex things as directio speed, >> I`m sure that I was able to catch it long ago, even crossing around by >> results of other types of benchmarks at host system. I`ve just tried >> tmpfs under both journals, it has a small boost effect, as expected >> because of near-zero i/o delay. May be chunk distribution mechanism >> does not work well on such small amount of nodes but right now I don`t >> have necessary amount of hardware nodes to prove or disprove that. >> >> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov wrote: >>> random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 >>> Starting 1 process >>> Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s] >>> random-rw: (groupid=0, jobs=1): err= 0: pid=9647 >>> write: io=163840KB, bw=37760KB/s, iops=9439, runt= 4339msec >>> clat (usec): min=70, max=39801, avg=104.19, stdev=324.29 >>> bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28 >>> cpu : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26 >>> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >>> >=64=0.0% >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> >=64=0.0% >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> >=64=0.0% >>> issued r/w: total=0/40960, short=0/0 >>> lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11% >>> lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% >>> >>> >>> On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just wrote: Our journal writes are actually sequential. Could you send FIO results for sequential 4k writes osd.0's journal and osd.1's journal? -Sam On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov wrote: > FIO output for journal partition, directio enabled, seems good(same > results for ext4 on other single sata disks). > > random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 > Starting 1 process > Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s] > random-rw: (groupid=0, jobs=1): err= 0: pid=21926 > write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec > clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04 > bw (KB/s) : min= 552, max= 3880, per=100.61%, avg=2341.23, > stdev=480.05 > cpu : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued r/w: total=0/40960, short=0/0 > lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63% > lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07% > lat (msec): 500=0.04% > > > > On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just > wrote: >> (CCing the list) >> >> So, the problem isn't the bandwidth. Before we respond to the client, >> we write the operation to the journal. In this case, that operation >> is taking >1s per operation on osd.1. Both rbd and rados bench will >> only allow a limited number of ops in flight at a time, so this >> latency is killing your throughput. For comparison, the latency for >> writing to the journal on osd.0 is < .3s. Can you measure direct io >> latency for writes to your osd.1 journal file? >> -Sam >> >> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov wrote: >>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, >>> not Megabits. >>> >>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov >>> wrote: [global] log dir = /ceph/out log_file = "" logger dir = /ceph/log pid file = /ceph/out/$type$id.pid [mds] pid file = /ceph/out/$name.pid lockdep = 1 mds log max segments = 2 [osd] lockdep = 1 filestore_xattr_use_omap = 1 osd data = /ceph/dev/osd$id osd journal = /ceph/meta/journal osd journal size = 100 [mon] lockdep = 1 mon data = /ceph/dev/mon$id [mon.0] host = 172.20.1.32 mon addr = 172.20.1.32:6789 [mon.1]
Re: Mysteriously poor write performance
(CCing the list) Actually, can you could re-do the rados bench run with 'debug journal = 20' along with the other debugging? That should give us better information. -Sam On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov wrote: > Hi Sam, > > Can you please suggest on where to start profiling osd? If the > bottleneck has related to such non-complex things as directio speed, > I`m sure that I was able to catch it long ago, even crossing around by > results of other types of benchmarks at host system. I`ve just tried > tmpfs under both journals, it has a small boost effect, as expected > because of near-zero i/o delay. May be chunk distribution mechanism > does not work well on such small amount of nodes but right now I don`t > have necessary amount of hardware nodes to prove or disprove that. > > On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov wrote: >> random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 >> Starting 1 process >> Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s] >> random-rw: (groupid=0, jobs=1): err= 0: pid=9647 >> write: io=163840KB, bw=37760KB/s, iops=9439, runt= 4339msec >> clat (usec): min=70, max=39801, avg=104.19, stdev=324.29 >> bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28 >> cpu : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> issued r/w: total=0/40960, short=0/0 >> lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11% >> lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% >> >> >> On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just wrote: >>> Our journal writes are actually sequential. Could you send FIO >>> results for sequential 4k writes osd.0's journal and osd.1's journal? >>> -Sam >>> >>> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov wrote: FIO output for journal partition, directio enabled, seems good(same results for ext4 on other single sata disks). random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 Starting 1 process Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s] random-rw: (groupid=0, jobs=1): err= 0: pid=21926 write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04 bw (KB/s) : min= 552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05 cpu : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued r/w: total=0/40960, short=0/0 lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63% lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07% lat (msec): 500=0.04% On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just wrote: > (CCing the list) > > So, the problem isn't the bandwidth. Before we respond to the client, > we write the operation to the journal. In this case, that operation > is taking >1s per operation on osd.1. Both rbd and rados bench will > only allow a limited number of ops in flight at a time, so this > latency is killing your throughput. For comparison, the latency for > writing to the journal on osd.0 is < .3s. Can you measure direct io > latency for writes to your osd.1 journal file? > -Sam > > On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov wrote: >> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, >> not Megabits. >> >> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov wrote: >>> [global] >>> log dir = /ceph/out >>> log_file = "" >>> logger dir = /ceph/log >>> pid file = /ceph/out/$type$id.pid >>> [mds] >>> pid file = /ceph/out/$name.pid >>> lockdep = 1 >>> mds log max segments = 2 >>> [osd] >>> lockdep = 1 >>> filestore_xattr_use_omap = 1 >>> osd data = /ceph/dev/osd$id >>> osd journal = /ceph/meta/journal >>> osd journal size = 100 >>> [mon] >>> lockdep = 1 >>> mon data = /ceph/dev/mon$id >>> [mon.0] >>> host = 172.20.1.32 >>> mon addr = 172.20.1.32:6789 >>> [mon.1] >>> host = 172.20.1.33 >>> mon addr = 172.20.1.33:6789 >>> [mon.2] >>> host = 172.20.1.35 >>> mon addr = 172.20.1.35:6789 >>> [osd.0] >>> host = 172.20.1.32 >>> [osd.1] >>>
Re: Mysteriously poor write performance
random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 Starting 1 process Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s] random-rw: (groupid=0, jobs=1): err= 0: pid=9647 write: io=163840KB, bw=37760KB/s, iops=9439, runt= 4339msec clat (usec): min=70, max=39801, avg=104.19, stdev=324.29 bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28 cpu : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued r/w: total=0/40960, short=0/0 lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11% lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just wrote: > Our journal writes are actually sequential. Could you send FIO > results for sequential 4k writes osd.0's journal and osd.1's journal? > -Sam > > On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov wrote: >> FIO output for journal partition, directio enabled, seems good(same >> results for ext4 on other single sata disks). >> >> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 >> Starting 1 process >> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s] >> random-rw: (groupid=0, jobs=1): err= 0: pid=21926 >> write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec >> clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04 >> bw (KB/s) : min= 552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05 >> cpu : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> >=64=0.0% >> issued r/w: total=0/40960, short=0/0 >> lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63% >> lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07% >> lat (msec): 500=0.04% >> >> >> >> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just wrote: >>> (CCing the list) >>> >>> So, the problem isn't the bandwidth. Before we respond to the client, >>> we write the operation to the journal. In this case, that operation >>> is taking >1s per operation on osd.1. Both rbd and rados bench will >>> only allow a limited number of ops in flight at a time, so this >>> latency is killing your throughput. For comparison, the latency for >>> writing to the journal on osd.0 is < .3s. Can you measure direct io >>> latency for writes to your osd.1 journal file? >>> -Sam >>> >>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov wrote: Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, not Megabits. On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov wrote: > [global] > log dir = /ceph/out > log_file = "" > logger dir = /ceph/log > pid file = /ceph/out/$type$id.pid > [mds] > pid file = /ceph/out/$name.pid > lockdep = 1 > mds log max segments = 2 > [osd] > lockdep = 1 > filestore_xattr_use_omap = 1 > osd data = /ceph/dev/osd$id > osd journal = /ceph/meta/journal > osd journal size = 100 > [mon] > lockdep = 1 > mon data = /ceph/dev/mon$id > [mon.0] > host = 172.20.1.32 > mon addr = 172.20.1.32:6789 > [mon.1] > host = 172.20.1.33 > mon addr = 172.20.1.33:6789 > [mon.2] > host = 172.20.1.35 > mon addr = 172.20.1.35:6789 > [osd.0] > host = 172.20.1.32 > [osd.1] > host = 172.20.1.33 > [mds.a] > host = 172.20.1.32 > > /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr) > /dev/mapper/system-cephmeta on /ceph/meta type ext4 > (rw,barrier=0,user_xattr) > Simple performance tests on those fs shows ~133Mb/s for /ceph and > metadata/. Also both machines do not hold anything else which may > impact osd. > > Also please note of following: > > http://i.imgur.com/ZgFdO.png > > First two peaks are related to running rados bench, then goes cluster > recreation, automated debian install and final peaks are dd test. > Surely I can have more precise graphs, but current one probably enough > to state a situation - rbd utilizing about a quarter of possible > bandwidth(if we can count rados bench as 100%). > > On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just > wrote: >> Hmm, there seem to be writes taking as long as 1.5s to hit journal on >> osd.1... Could you post your ceph.conf? Might there be a problem >> with the osd.1 jo
Re: Mysteriously poor write performance
Our journal writes are actually sequential. Could you send FIO results for sequential 4k writes osd.0's journal and osd.1's journal? -Sam On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov wrote: > FIO output for journal partition, directio enabled, seems good(same > results for ext4 on other single sata disks). > > random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 > Starting 1 process > Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s] > random-rw: (groupid=0, jobs=1): err= 0: pid=21926 > write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec > clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04 > bw (KB/s) : min= 552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05 > cpu : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w: total=0/40960, short=0/0 > lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63% > lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07% > lat (msec): 500=0.04% > > > > On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just wrote: >> (CCing the list) >> >> So, the problem isn't the bandwidth. Before we respond to the client, >> we write the operation to the journal. In this case, that operation >> is taking >1s per operation on osd.1. Both rbd and rados bench will >> only allow a limited number of ops in flight at a time, so this >> latency is killing your throughput. For comparison, the latency for >> writing to the journal on osd.0 is < .3s. Can you measure direct io >> latency for writes to your osd.1 journal file? >> -Sam >> >> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov wrote: >>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, >>> not Megabits. >>> >>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov wrote: [global] log dir = /ceph/out log_file = "" logger dir = /ceph/log pid file = /ceph/out/$type$id.pid [mds] pid file = /ceph/out/$name.pid lockdep = 1 mds log max segments = 2 [osd] lockdep = 1 filestore_xattr_use_omap = 1 osd data = /ceph/dev/osd$id osd journal = /ceph/meta/journal osd journal size = 100 [mon] lockdep = 1 mon data = /ceph/dev/mon$id [mon.0] host = 172.20.1.32 mon addr = 172.20.1.32:6789 [mon.1] host = 172.20.1.33 mon addr = 172.20.1.33:6789 [mon.2] host = 172.20.1.35 mon addr = 172.20.1.35:6789 [osd.0] host = 172.20.1.32 [osd.1] host = 172.20.1.33 [mds.a] host = 172.20.1.32 /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr) /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr) Simple performance tests on those fs shows ~133Mb/s for /ceph and metadata/. Also both machines do not hold anything else which may impact osd. Also please note of following: http://i.imgur.com/ZgFdO.png First two peaks are related to running rados bench, then goes cluster recreation, automated debian install and final peaks are dd test. Surely I can have more precise graphs, but current one probably enough to state a situation - rbd utilizing about a quarter of possible bandwidth(if we can count rados bench as 100%). On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just wrote: > Hmm, there seem to be writes taking as long as 1.5s to hit journal on > osd.1... Could you post your ceph.conf? Might there be a problem > with the osd.1 journal disk? > -Sam > > On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov wrote: >> Oh, sorry - they probably inherited rights from log files, fixed. >> >> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just >> wrote: >>> I get 403 Forbidden when I try to download any of the files. >>> -Sam >>> >>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov >>> wrote: http://xdel.ru/downloads/ceph-logs/ 1/ contains logs related to bench initiated at the osd0 machine and 2/ - at osd1. On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just wrote: > Hmm, I'm seeing some very high latency on ops sent to osd.1. Can you > post osd.1's logs? > -Sam > > On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov > wrote: >> Here, please: http://xdel.ru/downloads/ceph.log.gz >> >> Sometimes 'cur MB/s ' shows zero during rados bench, even if any >> debug >> output disabled and log_file set to
Re: Mysteriously poor write performance
(CCing the list) So, the problem isn't the bandwidth. Before we respond to the client, we write the operation to the journal. In this case, that operation is taking >1s per operation on osd.1. Both rbd and rados bench will only allow a limited number of ops in flight at a time, so this latency is killing your throughput. For comparison, the latency for writing to the journal on osd.0 is < .3s. Can you measure direct io latency for writes to your osd.1 journal file? -Sam On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov wrote: > Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, > not Megabits. > > On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov wrote: >> [global] >> log dir = /ceph/out >> log_file = "" >> logger dir = /ceph/log >> pid file = /ceph/out/$type$id.pid >> [mds] >> pid file = /ceph/out/$name.pid >> lockdep = 1 >> mds log max segments = 2 >> [osd] >> lockdep = 1 >> filestore_xattr_use_omap = 1 >> osd data = /ceph/dev/osd$id >> osd journal = /ceph/meta/journal >> osd journal size = 100 >> [mon] >> lockdep = 1 >> mon data = /ceph/dev/mon$id >> [mon.0] >> host = 172.20.1.32 >> mon addr = 172.20.1.32:6789 >> [mon.1] >> host = 172.20.1.33 >> mon addr = 172.20.1.33:6789 >> [mon.2] >> host = 172.20.1.35 >> mon addr = 172.20.1.35:6789 >> [osd.0] >> host = 172.20.1.32 >> [osd.1] >> host = 172.20.1.33 >> [mds.a] >> host = 172.20.1.32 >> >> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr) >> /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr) >> Simple performance tests on those fs shows ~133Mb/s for /ceph and >> metadata/. Also both machines do not hold anything else which may >> impact osd. >> >> Also please note of following: >> >> http://i.imgur.com/ZgFdO.png >> >> First two peaks are related to running rados bench, then goes cluster >> recreation, automated debian install and final peaks are dd test. >> Surely I can have more precise graphs, but current one probably enough >> to state a situation - rbd utilizing about a quarter of possible >> bandwidth(if we can count rados bench as 100%). >> >> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just wrote: >>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on >>> osd.1... Could you post your ceph.conf? Might there be a problem >>> with the osd.1 journal disk? >>> -Sam >>> >>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov wrote: Oh, sorry - they probably inherited rights from log files, fixed. On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just wrote: > I get 403 Forbidden when I try to download any of the files. > -Sam > > On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov wrote: >> http://xdel.ru/downloads/ceph-logs/ >> >> 1/ contains logs related to bench initiated at the osd0 machine and 2/ >> - at osd1. >> >> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just >> wrote: >>> Hmm, I'm seeing some very high latency on ops sent to osd.1. Can you >>> post osd.1's logs? >>> -Sam >>> >>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov wrote: Here, please: http://xdel.ru/downloads/ceph.log.gz Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug output disabled and log_file set to the empty value, hope it`s okay. On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just wrote: > Can you set osd and filestore debugging to 20, restart the osds, run > rados bench as before, and post the logs? > -Sam Just > > On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov > wrote: >> rados bench 60 write -p data >> >> Total time run: 61.217676 >> Total writes made: 989 >> Write size: 4194304 >> Bandwidth (MB/sec): 64.622 >> >> Average Latency: 0.989608 >> Max latency: 2.21701 >> Min latency: 0.255315 >> >> Here a snip from osd log, seems write size is okay. >> >> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 >> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 >> active+clean] removing repgather(0x31b5360 applying 10'83 >> rep_tid=597 >> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 >> [write >> 1220608~4096] 0.17eb9fd8) v4) >> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 >> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 >> active+clean] q front is repgather(0x31b5360 applying 10'83 >> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533 >> rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4) >> >> So
Re: Mysteriously poor write performance
Can you set osd and filestore debugging to 20, restart the osds, run rados bench as before, and post the logs? -Sam Just On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov wrote: > rados bench 60 write -p data > > Total time run: 61.217676 > Total writes made: 989 > Write size: 4194304 > Bandwidth (MB/sec): 64.622 > > Average Latency: 0.989608 > Max latency: 2.21701 > Min latency: 0.255315 > > Here a snip from osd log, seems write size is okay. > > 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 > (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 > active+clean] removing repgather(0x31b5360 applying 10'83 rep_tid=597 > wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 [write > 1220608~4096] 0.17eb9fd8) v4) > 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 > (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 > active+clean] q front is repgather(0x31b5360 applying 10'83 > rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533 > rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4) > > Sorry for my previous question about rbd chunks, it was really stupid :) > > On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin > wrote: >> On 03/19/2012 11:13 AM, Andrey Korolyov wrote: >>> >>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage >>> mentioned too small value and I`ve changed it to 64M before posting >>> previous message with no success - both 8M and this value cause a >>> performance drop. When I tried to wrote small amount of data that can >>> be compared to writeback cache size(both on raw device and ext3 with >>> sync option), following results were made: >> >> >> I just want to clarify that the writeback window isn't a full writeback >> cache - it doesn't affect reads, and does not help with request merging etc. >> It simply allows a bunch of writes to be in flight while acking the write to >> the guest immediately. We're working on a full-fledged writeback cache that >> to replace the writeback window. >> >> >>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost >>> same without oflag there and in the following samples) >>> 10+0 records in >>> 10+0 records out >>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s >>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct >>> 20+0 records in >>> 20+0 records out >>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s >>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct >>> 30+0 records in >>> 30+0 records out >>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s >>> >>> and so on. Reference test with bs=1M and count=2000 has slightly worse >>> results _with_ writeback cache than without, as I`ve mentioned before. >>> Here the bench results, they`re almost equal on both nodes: >>> >>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec >> >> >> One thing to check is the size of the writes that are actually being sent by >> rbd. The guest is probably splitting them into relatively small (128 or >> 256k) writes. Ideally it would be sending 4k writes, and this should be a >> lot faster. >> >> You can see the writes being sent by adding debug_ms=1 to the client or osd. >> The format is osd_op(.*[write OFFSET~LENGTH]). >> >> >>> Also, because I`ve not mentioned it before, network performance is >>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it >>> is not interrupt problem or something like it - even if ceph-osd, >>> ethernet card queues and kvm instance pinned to different sets of >>> cores, nothing changes. >>> >>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum >>> wrote: It sounds like maybe you're using Xen? The "rbd writeback window" option only works for userspace rbd implementations (eg, KVM). If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 (~8MB). What options are you running dd with? If you run a rados bench from both machines, what do the results look like? Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) -Greg On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: > More strangely, writing speed drops down by fifteen percent when this > option was set in vm` config(instead of result from > http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). > As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been > recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and > 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes > under heavy load. > > On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil (mailto:s...@newdream.net)> wrote: >> >> On Sat, 17 Mar 2012, Andrey Korolyov wrote: >>> >>> Hi, >>> >>> I`ve did some performance test
Re: Mysteriously poor write performance
rados bench 60 write -p data Total time run:61.217676 Total writes made: 989 Write size:4194304 Bandwidth (MB/sec):64.622 Average Latency: 0.989608 Max latency: 2.21701 Min latency: 0.255315 Here a snip from osd log, seems write size is okay. 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 active+clean] removing repgather(0x31b5360 applying 10'83 rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4) 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 active+clean]q front is repgather(0x31b5360 applying 10'83 rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.0040 [write 1220608~4096] 0.17eb9fd8) v4) Sorry for my previous question about rbd chunks, it was really stupid :) On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin wrote: > On 03/19/2012 11:13 AM, Andrey Korolyov wrote: >> >> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage >> mentioned too small value and I`ve changed it to 64M before posting >> previous message with no success - both 8M and this value cause a >> performance drop. When I tried to wrote small amount of data that can >> be compared to writeback cache size(both on raw device and ext3 with >> sync option), following results were made: > > > I just want to clarify that the writeback window isn't a full writeback > cache - it doesn't affect reads, and does not help with request merging etc. > It simply allows a bunch of writes to be in flight while acking the write to > the guest immediately. We're working on a full-fledged writeback cache that > to replace the writeback window. > > >> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost >> same without oflag there and in the following samples) >> 10+0 records in >> 10+0 records out >> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s >> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct >> 20+0 records in >> 20+0 records out >> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s >> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct >> 30+0 records in >> 30+0 records out >> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s >> >> and so on. Reference test with bs=1M and count=2000 has slightly worse >> results _with_ writeback cache than without, as I`ve mentioned before. >> Here the bench results, they`re almost equal on both nodes: >> >> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec > > > One thing to check is the size of the writes that are actually being sent by > rbd. The guest is probably splitting them into relatively small (128 or > 256k) writes. Ideally it would be sending 4k writes, and this should be a > lot faster. > > You can see the writes being sent by adding debug_ms=1 to the client or osd. > The format is osd_op(.*[write OFFSET~LENGTH]). > > >> Also, because I`ve not mentioned it before, network performance is >> enough to hold fair gigabit connectivity with MTU 1500. Seems that it >> is not interrupt problem or something like it - even if ceph-osd, >> ethernet card queues and kvm instance pinned to different sets of >> cores, nothing changes. >> >> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum >> wrote: >>> >>> It sounds like maybe you're using Xen? The "rbd writeback window" option >>> only works for userspace rbd implementations (eg, KVM). >>> If you are using KVM, you probably want 8192 (~80MB) rather than >>> 8192000 (~8MB). >>> >>> What options are you running dd with? If you run a rados bench from both >>> machines, what do the results look like? >>> Also, can you do the ceph osd bench on each of your OSDs, please? >>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) >>> -Greg >>> >>> >>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: >>> More strangely, writing speed drops down by fifteen percent when this option was set in vm` config(instead of result from http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes under heavy load. On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil>>> (mailto:s...@newdream.net)> wrote: > > On Sat, 17 Mar 2012, Andrey Korolyov wrote: >> >> Hi, >> >> I`ve did some performance tests at the following configuration: >> >> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - >> dom0 with three dedicated cores and 1.5G, mostly idle. First three >> disks on each r410 arranged into raid0 and holds osd data when fourth >> holds os and osd` journal partition, all
Re: Mysteriously poor write performance
Thanks to Greg, I have noticed very strange thing - data pool filled with a bunch of objects like rb.0.0.04db with typical size 4194304 when original pool for guest os has size only 112(created as 40g). Seems that something went wrong, because on 0.42 I had more impressive performance on cheaper hardware. For first time, I blamed recent crash and recreated cluster from scratch about a hour ago, but those objects created in a bare data/ pool with only one vm. On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin wrote: > On 03/19/2012 11:13 AM, Andrey Korolyov wrote: >> >> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage >> mentioned too small value and I`ve changed it to 64M before posting >> previous message with no success - both 8M and this value cause a >> performance drop. When I tried to wrote small amount of data that can >> be compared to writeback cache size(both on raw device and ext3 with >> sync option), following results were made: > > > I just want to clarify that the writeback window isn't a full writeback > cache - it doesn't affect reads, and does not help with request merging etc. > It simply allows a bunch of writes to be in flight while acking the write to > the guest immediately. We're working on a full-fledged writeback cache that > to replace the writeback window. > > >> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost >> same without oflag there and in the following samples) >> 10+0 records in >> 10+0 records out >> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s >> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct >> 20+0 records in >> 20+0 records out >> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s >> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct >> 30+0 records in >> 30+0 records out >> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s >> >> and so on. Reference test with bs=1M and count=2000 has slightly worse >> results _with_ writeback cache than without, as I`ve mentioned before. >> Here the bench results, they`re almost equal on both nodes: >> >> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec > > > One thing to check is the size of the writes that are actually being sent by > rbd. The guest is probably splitting them into relatively small (128 or > 256k) writes. Ideally it would be sending 4k writes, and this should be a > lot faster. > > You can see the writes being sent by adding debug_ms=1 to the client or osd. > The format is osd_op(.*[write OFFSET~LENGTH]). > > >> Also, because I`ve not mentioned it before, network performance is >> enough to hold fair gigabit connectivity with MTU 1500. Seems that it >> is not interrupt problem or something like it - even if ceph-osd, >> ethernet card queues and kvm instance pinned to different sets of >> cores, nothing changes. >> >> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum >> wrote: >>> >>> It sounds like maybe you're using Xen? The "rbd writeback window" option >>> only works for userspace rbd implementations (eg, KVM). >>> If you are using KVM, you probably want 8192 (~80MB) rather than >>> 8192000 (~8MB). >>> >>> What options are you running dd with? If you run a rados bench from both >>> machines, what do the results look like? >>> Also, can you do the ceph osd bench on each of your OSDs, please? >>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) >>> -Greg >>> >>> >>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: >>> More strangely, writing speed drops down by fifteen percent when this option was set in vm` config(instead of result from http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes under heavy load. On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil>>> (mailto:s...@newdream.net)> wrote: > > On Sat, 17 Mar 2012, Andrey Korolyov wrote: >> >> Hi, >> >> I`ve did some performance tests at the following configuration: >> >> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - >> dom0 with three dedicated cores and 1.5G, mostly idle. First three >> disks on each r410 arranged into raid0 and holds osd data when fourth >> holds os and osd` journal partition, all ceph-related stuff mounted on >> the ext4 without barriers. >> >> Firstly, I`ve noticed about a difference of benchmark performance and >> write speed through rbd from small kvm instance running on one of >> first two machines - when bench gave me about 110Mb/s, writing zeros >> to raw block device inside vm with dd was at top speed about 45 mb/s, >> for vm`fs (ext4 with default options) performance drops to ~23Mb/s. >> Things get worse, when I`ve started second vm at second hos
Re: Mysteriously poor write performance
On 03/19/2012 11:13 AM, Andrey Korolyov wrote: Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage mentioned too small value and I`ve changed it to 64M before posting previous message with no success - both 8M and this value cause a performance drop. When I tried to wrote small amount of data that can be compared to writeback cache size(both on raw device and ext3 with sync option), following results were made: I just want to clarify that the writeback window isn't a full writeback cache - it doesn't affect reads, and does not help with request merging etc. It simply allows a bunch of writes to be in flight while acking the write to the guest immediately. We're working on a full-fledged writeback cache that to replace the writeback window. dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost same without oflag there and in the following samples) 10+0 records in 10+0 records out 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct 20+0 records in 20+0 records out 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct 30+0 records in 30+0 records out 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s and so on. Reference test with bs=1M and count=2000 has slightly worse results _with_ writeback cache than without, as I`ve mentioned before. Here the bench results, they`re almost equal on both nodes: bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec One thing to check is the size of the writes that are actually being sent by rbd. The guest is probably splitting them into relatively small (128 or 256k) writes. Ideally it would be sending 4k writes, and this should be a lot faster. You can see the writes being sent by adding debug_ms=1 to the client or osd. The format is osd_op(.*[write OFFSET~LENGTH]). Also, because I`ve not mentioned it before, network performance is enough to hold fair gigabit connectivity with MTU 1500. Seems that it is not interrupt problem or something like it - even if ceph-osd, ethernet card queues and kvm instance pinned to different sets of cores, nothing changes. On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum wrote: It sounds like maybe you're using Xen? The "rbd writeback window" option only works for userspace rbd implementations (eg, KVM). If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 (~8MB). What options are you running dd with? If you run a rados bench from both machines, what do the results look like? Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) -Greg On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: More strangely, writing speed drops down by fifteen percent when this option was set in vm` config(instead of result from http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes under heavy load. On Sun, Mar 18, 2012 at 10:22 PM, Sage Weilmailto:s...@newdream.net)> wrote: On Sat, 17 Mar 2012, Andrey Korolyov wrote: Hi, I`ve did some performance tests at the following configuration: mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - dom0 with three dedicated cores and 1.5G, mostly idle. First three disks on each r410 arranged into raid0 and holds osd data when fourth holds os and osd` journal partition, all ceph-related stuff mounted on the ext4 without barriers. Firstly, I`ve noticed about a difference of benchmark performance and write speed through rbd from small kvm instance running on one of first two machines - when bench gave me about 110Mb/s, writing zeros to raw block device inside vm with dd was at top speed about 45 mb/s, for vm`fs (ext4 with default options) performance drops to ~23Mb/s. Things get worse, when I`ve started second vm at second host and tried to continue same dd tests simultaneously - performance fairly divided by half for each instance :). Enabling jumbo frames, playing with cpu affinity for ceph and vm instances and trying different TCP congestion protocols gave no effect at all - with DCTCP I have slightly smoother network load graph and that`s all. Can ml please suggest anything to try to improve performance? Can you try setting rbd writeback window = 8192000 or similar, and see what kind of effect that has? I suspect it'll speed up dd; I'm less sure about ext3. Thanks! sage ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe fr
Re: Mysteriously poor write performance
On Monday, March 19, 2012 at 11:13 AM, Andrey Korolyov wrote: > Nope, I`m using KVM for rbd guests. Ah, okay — I'm not sure what your reference to dom0 and mon2 meant, then? > Surely I`ve been noticed that Sage > mentioned too small value and I`ve changed it to 64M before posting > previous message with no success - both 8M and this value cause a > performance drop. When I tried to wrote small amount of data that can > be compared to writeback cache size(both on raw device and ext3 with > sync option), following results were made: > dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost > same without oflag there and in the following samples) > 10+0 records in > 10+0 records out > 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s > dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct > 20+0 records in > 20+0 records out > 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s > dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct > 30+0 records in > 30+0 records out > 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s > > and so on. Reference test with bs=1M and count=2000 has slightly worse > results _with_ writeback cache than without, as I`ve mentioned before. > Here the bench results, they`re almost equal on both nodes: > > bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec Okay, this is all a little odd to me. Can you send along your ceph.conf (along with any other pool config changes you've made) and the output from a rados bench (60 seconds or so)? -Greg > > Also, because I`ve not mentioned it before, network performance is > enough to hold fair gigabit connectivity with MTU 1500. Seems that it > is not interrupt problem or something like it - even if ceph-osd, > ethernet card queues and kvm instance pinned to different sets of > cores, nothing changes. > > On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum > mailto:gregory.far...@dreamhost.com)> wrote: > > It sounds like maybe you're using Xen? The "rbd writeback window" option > > only works for userspace rbd implementations (eg, KVM). > > If you are using KVM, you probably want 8192 (~80MB) rather than > > 8192000 (~8MB). > > > > What options are you running dd with? If you run a rados bench from both > > machines, what do the results look like? > > Also, can you do the ceph osd bench on each of your OSDs, please? > > (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) > > -Greg > > > > > > On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: > > > > > More strangely, writing speed drops down by fifteen percent when this > > > option was set in vm` config(instead of result from > > > http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). > > > As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been > > > recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and > > > 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes > > > under heavy load. > > > > > > On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil > > (mailto:s...@newdream.net)> wrote: > > > > On Sat, 17 Mar 2012, Andrey Korolyov wrote: > > > > > Hi, > > > > > > > > > > I`ve did some performance tests at the following configuration: > > > > > > > > > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - > > > > > dom0 with three dedicated cores and 1.5G, mostly idle. First three > > > > > disks on each r410 arranged into raid0 and holds osd data when fourth > > > > > holds os and osd` journal partition, all ceph-related stuff mounted on > > > > > the ext4 without barriers. > > > > > > > > > > Firstly, I`ve noticed about a difference of benchmark performance and > > > > > write speed through rbd from small kvm instance running on one of > > > > > first two machines - when bench gave me about 110Mb/s, writing zeros > > > > > to raw block device inside vm with dd was at top speed about 45 mb/s, > > > > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s. > > > > > Things get worse, when I`ve started second vm at second host and tried > > > > > to continue same dd tests simultaneously - performance fairly divided > > > > > by half for each instance :). Enabling jumbo frames, playing with cpu > > > > > affinity for ceph and vm instances and trying different TCP congestion > > > > > protocols gave no effect at all - with DCTCP I have slightly smoother > > > > > network load graph and that`s all. > > > > > > > > > > Can ml please suggest anything to try to improve performance? > > > > > > > > Can you try setting > > > > > > > > rbd writeback window = 8192000 > > > > > > > > or similar, and see what kind of effect that has? I suspect it'll speed > > > > up dd; I'm less sure about ext3. > > > > > > > > Thanks! > > > > sage > > > > > > > > > > > > > > > > > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" >
Re: Mysteriously poor write performance
Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage mentioned too small value and I`ve changed it to 64M before posting previous message with no success - both 8M and this value cause a performance drop. When I tried to wrote small amount of data that can be compared to writeback cache size(both on raw device and ext3 with sync option), following results were made: dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost same without oflag there and in the following samples) 10+0 records in 10+0 records out 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct 20+0 records in 20+0 records out 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct 30+0 records in 30+0 records out 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s and so on. Reference test with bs=1M and count=2000 has slightly worse results _with_ writeback cache than without, as I`ve mentioned before. Here the bench results, they`re almost equal on both nodes: bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec Also, because I`ve not mentioned it before, network performance is enough to hold fair gigabit connectivity with MTU 1500. Seems that it is not interrupt problem or something like it - even if ceph-osd, ethernet card queues and kvm instance pinned to different sets of cores, nothing changes. On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum wrote: > It sounds like maybe you're using Xen? The "rbd writeback window" option only > works for userspace rbd implementations (eg, KVM). > If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 > (~8MB). > > What options are you running dd with? If you run a rados bench from both > machines, what do the results look like? > Also, can you do the ceph osd bench on each of your OSDs, please? > (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) > -Greg > > > On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: > >> More strangely, writing speed drops down by fifteen percent when this >> option was set in vm` config(instead of result from >> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). >> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been >> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and >> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes >> under heavy load. >> >> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil > (mailto:s...@newdream.net)> wrote: >> > On Sat, 17 Mar 2012, Andrey Korolyov wrote: >> > > Hi, >> > > >> > > I`ve did some performance tests at the following configuration: >> > > >> > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - >> > > dom0 with three dedicated cores and 1.5G, mostly idle. First three >> > > disks on each r410 arranged into raid0 and holds osd data when fourth >> > > holds os and osd` journal partition, all ceph-related stuff mounted on >> > > the ext4 without barriers. >> > > >> > > Firstly, I`ve noticed about a difference of benchmark performance and >> > > write speed through rbd from small kvm instance running on one of >> > > first two machines - when bench gave me about 110Mb/s, writing zeros >> > > to raw block device inside vm with dd was at top speed about 45 mb/s, >> > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s. >> > > Things get worse, when I`ve started second vm at second host and tried >> > > to continue same dd tests simultaneously - performance fairly divided >> > > by half for each instance :). Enabling jumbo frames, playing with cpu >> > > affinity for ceph and vm instances and trying different TCP congestion >> > > protocols gave no effect at all - with DCTCP I have slightly smoother >> > > network load graph and that`s all. >> > > >> > > Can ml please suggest anything to try to improve performance? >> > >> > Can you try setting >> > >> > rbd writeback window = 8192000 >> > >> > or similar, and see what kind of effect that has? I suspect it'll speed >> > up dd; I'm less sure about ext3. >> > >> > Thanks! >> > sage >> > >> > >> > > >> > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 >> > > -- >> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> > > the body of a message to majord...@vger.kernel.org >> > > (mailto:majord...@vger.kernel.org) >> > > More majordomo info at http://vger.kernel.org/majordomo-info.html >> > >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> (mailto:majord...@vger.kernel.org) >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mysteriously poor write performance
It sounds like maybe you're using Xen? The "rbd writeback window" option only works for userspace rbd implementations (eg, KVM). If you are using KVM, you probably want 8192 (~80MB) rather than 8192000 (~8MB). What options are you running dd with? If you run a rados bench from both machines, what do the results look like? Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) -Greg On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: > More strangely, writing speed drops down by fifteen percent when this > option was set in vm` config(instead of result from > http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). > As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been > recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and > 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes > under heavy load. > > On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil (mailto:s...@newdream.net)> wrote: > > On Sat, 17 Mar 2012, Andrey Korolyov wrote: > > > Hi, > > > > > > I`ve did some performance tests at the following configuration: > > > > > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - > > > dom0 with three dedicated cores and 1.5G, mostly idle. First three > > > disks on each r410 arranged into raid0 and holds osd data when fourth > > > holds os and osd` journal partition, all ceph-related stuff mounted on > > > the ext4 without barriers. > > > > > > Firstly, I`ve noticed about a difference of benchmark performance and > > > write speed through rbd from small kvm instance running on one of > > > first two machines - when bench gave me about 110Mb/s, writing zeros > > > to raw block device inside vm with dd was at top speed about 45 mb/s, > > > for vm`fs (ext4 with default options) performance drops to ~23Mb/s. > > > Things get worse, when I`ve started second vm at second host and tried > > > to continue same dd tests simultaneously - performance fairly divided > > > by half for each instance :). Enabling jumbo frames, playing with cpu > > > affinity for ceph and vm instances and trying different TCP congestion > > > protocols gave no effect at all - with DCTCP I have slightly smoother > > > network load graph and that`s all. > > > > > > Can ml please suggest anything to try to improve performance? > > > > Can you try setting > > > > rbd writeback window = 8192000 > > > > or similar, and see what kind of effect that has? I suspect it'll speed > > up dd; I'm less sure about ext3. > > > > Thanks! > > sage > > > > > > > > > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majord...@vger.kernel.org > > > (mailto:majord...@vger.kernel.org) > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > (mailto:majord...@vger.kernel.org) > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mysteriously poor write performance
More strangely, writing speed drops down by fifteen percent when this option was set in vm` config(instead of result from http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html). As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes under heavy load. On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil wrote: > On Sat, 17 Mar 2012, Andrey Korolyov wrote: >> Hi, >> >> I`ve did some performance tests at the following configuration: >> >> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - >> dom0 with three dedicated cores and 1.5G, mostly idle. First three >> disks on each r410 arranged into raid0 and holds osd data when fourth >> holds os and osd` journal partition, all ceph-related stuff mounted on >> the ext4 without barriers. >> >> Firstly, I`ve noticed about a difference of benchmark performance and >> write speed through rbd from small kvm instance running on one of >> first two machines - when bench gave me about 110Mb/s, writing zeros >> to raw block device inside vm with dd was at top speed about 45 mb/s, >> for vm`fs (ext4 with default options) performance drops to ~23Mb/s. >> Things get worse, when I`ve started second vm at second host and tried >> to continue same dd tests simultaneously - performance fairly divided >> by half for each instance :). Enabling jumbo frames, playing with cpu >> affinity for ceph and vm instances and trying different TCP congestion >> protocols gave no effect at all - with DCTCP I have slightly smoother >> network load graph and that`s all. >> >> Can ml please suggest anything to try to improve performance? > > Can you try setting > > rbd writeback window = 8192000 > > or similar, and see what kind of effect that has? I suspect it'll speed > up dd; I'm less sure about ext3. > > Thanks! > sage > > >> >> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mysteriously poor write performance
On Sat, 17 Mar 2012, Andrey Korolyov wrote: > Hi, > > I`ve did some performance tests at the following configuration: > > mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - > dom0 with three dedicated cores and 1.5G, mostly idle. First three > disks on each r410 arranged into raid0 and holds osd data when fourth > holds os and osd` journal partition, all ceph-related stuff mounted on > the ext4 without barriers. > > Firstly, I`ve noticed about a difference of benchmark performance and > write speed through rbd from small kvm instance running on one of > first two machines - when bench gave me about 110Mb/s, writing zeros > to raw block device inside vm with dd was at top speed about 45 mb/s, > for vm`fs (ext4 with default options) performance drops to ~23Mb/s. > Things get worse, when I`ve started second vm at second host and tried > to continue same dd tests simultaneously - performance fairly divided > by half for each instance :). Enabling jumbo frames, playing with cpu > affinity for ceph and vm instances and trying different TCP congestion > protocols gave no effect at all - with DCTCP I have slightly smoother > network load graph and that`s all. > > Can ml please suggest anything to try to improve performance? Can you try setting rbd writeback window = 8192000 or similar, and see what kind of effect that has? I suspect it'll speed up dd; I'm less sure about ext3. Thanks! sage > > ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mysteriously poor write performance
Hi, I`ve did some performance tests at the following configuration: mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - dom0 with three dedicated cores and 1.5G, mostly idle. First three disks on each r410 arranged into raid0 and holds osd data when fourth holds os and osd` journal partition, all ceph-related stuff mounted on the ext4 without barriers. Firstly, I`ve noticed about a difference of benchmark performance and write speed through rbd from small kvm instance running on one of first two machines - when bench gave me about 110Mb/s, writing zeros to raw block device inside vm with dd was at top speed about 45 mb/s, for vm`fs (ext4 with default options) performance drops to ~23Mb/s. Things get worse, when I`ve started second vm at second host and tried to continue same dd tests simultaneously - performance fairly divided by half for each instance :). Enabling jumbo frames, playing with cpu affinity for ceph and vm instances and trying different TCP congestion protocols gave no effect at all - with DCTCP I have slightly smoother network load graph and that`s all. Can ml please suggest anything to try to improve performance? ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html