Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)
On Thu, Aug 22, 2013 at 5:23 PM, Greg Poirier greg.poir...@opower.com wrote: On Thu, Aug 22, 2013 at 2:34 PM, Gregory Farnum g...@inktank.com wrote: You don't appear to have accounted for the 2x replication (where all writes go to two OSDs) in these calculations. I assume your pools have Ah. Right. So I should then be looking at: # OSDs * Throughput per disk / 2 / repl factor ? Which makes 300-400 MB/s aggregate throughput actually sort of reasonable. size 2 (or 3?) for these tests. 3 would explain the performance difference entirely; 2x replication leaves it still a bit low but takes the difference down to ~350/600 instead of ~350/1200. :) Yeah. We're doing 2x repl now, and haven't yet made the decision if we're going to move to 3x repl or not. You mentioned that your average osd bench throughput was ~50MB/s; what's the range? 41.9 - 54.7 MB/s The actual average is 47.1 MB/s Okay. It's important to realize that because Ceph distributes data pseudorandomly, each OSD is going to end up with about the same amount of data going to it. If one of your drives is slower than the others, the fast ones can get backed up waiting on the slow one to acknowledge writes, so they end up impacting the cluster throughput a disproportionate amount. :( Anyway, I'm guessing you have 24 OSDs from your math earlier? 47MB/s * 24 / 2 = 564MB/s 41MB/s * 24 / 2 = 492MB/s So taking out or reducing the weight on the slow ones might improve things a little. But that's still quite a ways off from what you're seeing — there are a lot of things that could be impacting this but there's probably something fairly obvious with that much of a gap. What is the exact benchmark you're running? What do your nodes look like? -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)
Ah thanks, Brian. I will do that. I was going off the wiki instructions on performing rados benchmarks. If I have the time later, I will change it there. On Fri, Aug 23, 2013 at 9:37 AM, Brian Andrus brian.and...@inktank.comwrote: Hi Greg, I haven't had any luck with the seq bench. It just errors every time. Can you confirm you are using the --no-cleanup flag with rados write? This will ensure there is actually data to read for subsequent seq tests. ~Brian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)
On Fri, Aug 23, 2013 at 9:53 AM, Gregory Farnum g...@inktank.com wrote: Okay. It's important to realize that because Ceph distributes data pseudorandomly, each OSD is going to end up with about the same amount of data going to it. If one of your drives is slower than the others, the fast ones can get backed up waiting on the slow one to acknowledge writes, so they end up impacting the cluster throughput a disproportionate amount. :( Anyway, I'm guessing you have 24 OSDs from your math earlier? 47MB/s * 24 / 2 = 564MB/s 41MB/s * 24 / 2 = 492MB/s 33 OSDs and 3 hosts in the cluster. So taking out or reducing the weight on the slow ones might improve things a little. But that's still quite a ways off from what you're seeing — there are a lot of things that could be impacting this but there's probably something fairly obvious with that much of a gap. What is the exact benchmark you're running? What do your nodes look like? The write benchmark I am running is Fio with the following configuration: ioengine: libaio iodepth: 16 runtime: 180 numjobs: 16 - name: 128k-500M-write description: 128K block 500M write bs: 128K size: 500M rw: write Sorry for the weird yaml formatting but I'm copying it from the config file of my automation stuff. I run that on powers of 2 VMs up to 32. Each VM is qemu-kvm with a 50 GB RBD-backed Cinder volume attached. They are 2 VCPU, 4 GB RAM VMs. The host machines are Dell C6220s, 16-core, hyperthreaded VMs, 128 GB RAM, with bonded 10 Gbps NICs (mode 4, 20 Gbps throughput -- tested and verified that's working correctly). There are 2 host machines with 16 VMs each. The Ceph cluster is made up of Dell C6220s, same NIC setup, 256 GB RAM, same CPU, 12 disks each (one for os, 11 for OSDs). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)
Hey Greg, I encountered a similar problem and we're just in the process of tracking it down here on the list. Try downgrading your OSD-binaries to 0.61.8 Cuttlefish and re-test. If it's significantly faster on RBD, you're probably experiencing the same problem I have with Dumpling. PS: Only downgrade your OSDs. Cuttlefish-monitors don't seem to want to start with a database that has been touched by a Dumpling-monitor and don't talk to them, either. PPS: I've also had OSDs no longer start with an assert while processing the journal during these upgrade/downgrade-tests, mostly when coming down from Dumpling to Cuttlefish. If you encounter those, delete your journal and re-create with `ceph-osd -i OSD-ID --mkjournal'. Your data-store will be OK, as far as I can tell. Regards, Oliver On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote: I have been benchmarking our Ceph installation for the last week or so, and I've come across an issue that I'm having some difficulty with. Ceph bench reports reasonable write throughput at the OSD level: ceph tell osd.0 bench { bytes_written: 1073741824, blocksize: 4194304, bytes_per_sec: 47288267.00} Running this across all OSDs produces on average 50-55 MB/s, which is fine with us. We were expecting around 100 MB/s / 2 (journal and OSD on same disk, separate partitions). What I wasn't expecting was the following: I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing against 33 OSDs. Aggregate write throughput peaked under 400 MB/s: 1 196.013671875 2 285.8759765625 4 351.9169921875 8 386.455078125 16 363.8583984375 24 353.6298828125 32 348.9697265625 I was hoping to see something closer to # OSDs * Average value for ceph bench (approximately 1.2 GB/s peak aggregate write throughput). We're seeing excellent read, randread performance, but writes are a bit of a bother. Does anyone have any suggestions? We have 20 Gb/s network I used Fio w/ 16 thread concurrency We're running Scientific Linux 6.4 2.6.32 kernel Ceph Dumpling 0.67.1-0.el6 OpenStack Grizzly Libvirt 0.10.2 qemu-kvm 0.12.1.2-2.355.el6.2.cuttlefish (I'm using qemu-kvm from the ceph-extras repository, which doesn't appear to have a -.dumpling version yet). Thanks very much for any assistance. Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)
On Thu, Aug 22, 2013 at 2:23 PM, Oliver Daudey oli...@xs4all.nl wrote: Hey Greg, I encountered a similar problem and we're just in the process of tracking it down here on the list. Try downgrading your OSD-binaries to 0.61.8 Cuttlefish and re-test. If it's significantly faster on RBD, you're probably experiencing the same problem I have with Dumpling. PS: Only downgrade your OSDs. Cuttlefish-monitors don't seem to want to start with a database that has been touched by a Dumpling-monitor and don't talk to them, either. PPS: I've also had OSDs no longer start with an assert while processing the journal during these upgrade/downgrade-tests, mostly when coming down from Dumpling to Cuttlefish. If you encounter those, delete your journal and re-create with `ceph-osd -i OSD-ID --mkjournal'. Your data-store will be OK, as far as I can tell. Careful — deleting the journal is potentially throwing away updates to your data store! If this is a problem you should flush the journal with the dumpling binary before downgrading. Regards, Oliver On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote: I have been benchmarking our Ceph installation for the last week or so, and I've come across an issue that I'm having some difficulty with. Ceph bench reports reasonable write throughput at the OSD level: ceph tell osd.0 bench { bytes_written: 1073741824, blocksize: 4194304, bytes_per_sec: 47288267.00} Running this across all OSDs produces on average 50-55 MB/s, which is fine with us. We were expecting around 100 MB/s / 2 (journal and OSD on same disk, separate partitions). What I wasn't expecting was the following: I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing against 33 OSDs. Aggregate write throughput peaked under 400 MB/s: 1 196.013671875 2 285.8759765625 4 351.9169921875 8 386.455078125 16 363.8583984375 24 353.6298828125 32 348.9697265625 I was hoping to see something closer to # OSDs * Average value for ceph bench (approximately 1.2 GB/s peak aggregate write throughput). We're seeing excellent read, randread performance, but writes are a bit of a bother. Does anyone have any suggestions? You don't appear to have accounted for the 2x replication (where all writes go to two OSDs) in these calculations. I assume your pools have size 2 (or 3?) for these tests. 3 would explain the performance difference entirely; 2x replication leaves it still a bit low but takes the difference down to ~350/600 instead of ~350/1200. :) You mentioned that your average osd bench throughput was ~50MB/s; what's the range? Have you run any rados bench tests? What is your PG count across the cluster? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)
Hey Greg, Thanks for the tip! I was assuming a clean shutdown of the OSD should flush the journal for you and have the OSD try to exit with it's data-store in a clean state? Otherwise, I would first have to stop updates a that particular OSD, then flush the journal, then stop it? Regards, Oliver On do, 2013-08-22 at 14:34 -0700, Gregory Farnum wrote: On Thu, Aug 22, 2013 at 2:23 PM, Oliver Daudey oli...@xs4all.nl wrote: Hey Greg, I encountered a similar problem and we're just in the process of tracking it down here on the list. Try downgrading your OSD-binaries to 0.61.8 Cuttlefish and re-test. If it's significantly faster on RBD, you're probably experiencing the same problem I have with Dumpling. PS: Only downgrade your OSDs. Cuttlefish-monitors don't seem to want to start with a database that has been touched by a Dumpling-monitor and don't talk to them, either. PPS: I've also had OSDs no longer start with an assert while processing the journal during these upgrade/downgrade-tests, mostly when coming down from Dumpling to Cuttlefish. If you encounter those, delete your journal and re-create with `ceph-osd -i OSD-ID --mkjournal'. Your data-store will be OK, as far as I can tell. Careful — deleting the journal is potentially throwing away updates to your data store! If this is a problem you should flush the journal with the dumpling binary before downgrading. Regards, Oliver On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote: I have been benchmarking our Ceph installation for the last week or so, and I've come across an issue that I'm having some difficulty with. Ceph bench reports reasonable write throughput at the OSD level: ceph tell osd.0 bench { bytes_written: 1073741824, blocksize: 4194304, bytes_per_sec: 47288267.00} Running this across all OSDs produces on average 50-55 MB/s, which is fine with us. We were expecting around 100 MB/s / 2 (journal and OSD on same disk, separate partitions). What I wasn't expecting was the following: I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing against 33 OSDs. Aggregate write throughput peaked under 400 MB/s: 1 196.013671875 2 285.8759765625 4 351.9169921875 8 386.455078125 16 363.8583984375 24 353.6298828125 32 348.9697265625 I was hoping to see something closer to # OSDs * Average value for ceph bench (approximately 1.2 GB/s peak aggregate write throughput). We're seeing excellent read, randread performance, but writes are a bit of a bother. Does anyone have any suggestions? You don't appear to have accounted for the 2x replication (where all writes go to two OSDs) in these calculations. I assume your pools have size 2 (or 3?) for these tests. 3 would explain the performance difference entirely; 2x replication leaves it still a bit low but takes the difference down to ~350/600 instead of ~350/1200. :) You mentioned that your average osd bench throughput was ~50MB/s; what's the range? Have you run any rados bench tests? What is your PG count across the cluster? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)
Hey Greg, I didn't know that option, but I'm always careful to downgrade and upgrade the OSDs one by one and wait for the cluster to report healthy again before proceeding to the next, so, as you said, chances of losing data should have been minimal. Will flush the journals too next time. Thanks! Regards, Oliver On do, 2013-08-22 at 14:52 -0700, Gregory Farnum wrote: On Thu, Aug 22, 2013 at 2:47 PM, Oliver Daudey oli...@xs4all.nl wrote: Hey Greg, Thanks for the tip! I was assuming a clean shutdown of the OSD should flush the journal for you and have the OSD try to exit with it's data-store in a clean state? Otherwise, I would first have to stop updates a that particular OSD, then flush the journal, then stop it? Nope, clean shutdown doesn't force a flush as it could potentially block on the filesystem. --flush-journal is a CLI option, so you would turn off the OSD, then run it with that option (it won't join the cluster or anything, just look at and update local disk state), then downgrade the binary. In all likelihood this won't have caused you to lose any data because in many/most situations the OSD actually will have written out everything in the journal to the local FS before you tell it to shut down, and as long as one of the other OSDs either did that or turned back on without crashing then it will propagate the newer updates to everybody. But wiping the journal without flushing is certainly not the sort of thing you should get in the habit of doing. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)
On Thu, Aug 22, 2013 at 2:34 PM, Gregory Farnum g...@inktank.com wrote: You don't appear to have accounted for the 2x replication (where all writes go to two OSDs) in these calculations. I assume your pools have Ah. Right. So I should then be looking at: # OSDs * Throughput per disk / 2 / repl factor ? Which makes 300-400 MB/s aggregate throughput actually sort of reasonable. size 2 (or 3?) for these tests. 3 would explain the performance difference entirely; 2x replication leaves it still a bit low but takes the difference down to ~350/600 instead of ~350/1200. :) Yeah. We're doing 2x repl now, and haven't yet made the decision if we're going to move to 3x repl or not. You mentioned that your average osd bench throughput was ~50MB/s; what's the range? 41.9 - 54.7 MB/s The actual average is 47.1 MB/s Have you run any rados bench tests? Yessir. rados bench write: 2013-08-23 00:18:51.933594min lat: 0.071682 max lat: 1.77006 avg lat: 0.196411 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 900 14 73322 73308 325.764 316 0.13978 0.196411 Total time run: 900.239317 Total writes made: 73322 Write size: 4194304 Bandwidth (MB/sec): 325.789 Stddev Bandwidth: 35.102 Max bandwidth (MB/sec): 440 Min bandwidth (MB/sec): 0 Average Latency:0.196436 Stddev Latency: 0.121463 Max latency:1.77006 Min latency:0.071682 I haven't had any luck with the seq bench. It just errors every time. What is your PG count across the cluster? pgmap v18263: 1650 pgs: 1650 active+clean; 946 GB data, 1894 GB used, 28523 GB / 30417 GB avail; 498MB/s wr, 124op/s Thanks again. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com