Re: how to debug slow rbd block device
On 05/23/2012 02:03 AM, Andrey Korolyov wrote: Hi Josh, Can you please answer to list on this question? It is important when someone wants to build HA KVM cluster on the rbd backend and needs to wc cache. Thanks! On Wed, May 23, 2012 at 10:30 AM, Josh Durgin wrote: On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote: Hi, So try enabling RBD writeback caching — see http://marc.info /?l=ceph-devel&m=133758599712768&w=2 will test tomorrow. Thanks. Can we path this to the qemu-drive option? Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400 The normal qemu cache=writeback/writethrough/none option will work in qemu 1.2. Josh By the way, is it possible to flush cache outside? I may need that for VM live migration and such hook will be helpful. Qemu will do that for you in many cases, but it looks like we need to implement bdrv_invalidate_cache to make live migration work. http://tracker.newdream.net/issues/2467 librbd itself flushes the cache when a snapshot is created or the image is closed, but there's no way to trigger it directly right now. http://tracker.newdream.net/issues/2468 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
On 5/23/12 2:22 AM, Andrey Korolyov wrote: Hi, For Stefan: Increasing socket memory gave me about some percents on fio tests inside VM(I have measured 'max-iops-until-ceph-throws-message-about-delayed-write' parameter). What is more important, osd process, if possible, should be pinned to dedicated core or two, and all other processes should not use this core(you may do it via cg or manually), because even one four-core non-pinned VM process during test causes a drop of osd` throughput almost twice, same for any other heavy process on the host. Very interesting! Thanks for sharing. net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 On Wed, May 23, 2012 at 10:30 AM, Josh Durgin wrote: On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote: Hi, So try enabling RBD writeback caching — see http://marc.info /?l=ceph-devel&m=133758599712768&w=2 will test tomorrow. Thanks. Can we path this to the qemu-drive option? Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400 The normal qemu cache=writeback/writethrough/none option will work in qemu 1.2. Josh By the way, is it possible to flush cache outside? I may need that for VM live migration and such hook will be helpful. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Am 23.05.2012 10:30, schrieb Stefan Priebe - Profihost AG: > Am 22.05.2012 23:11, schrieb Greg Farnum: >> On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote: >>> Am 22.05.2012 22:49, schrieb Greg Farnum: Anyway, it looks like you're just paying a synchronous write penalty >>> >>> >>> What does that exactly mean? Shouldn't one threaded write to four >>> 260MB/s devices gives at least 100Mb/s? >> >> Well, with dd you've got a single thread issuing synchronous IO requests to >> the kernel. We could have it set up so that those synchronous requests get >> split up, but they aren't, and between the kernel and KVM it looks like when >> it needs to make a write out to disk it sends one request at a time to the >> Ceph backend. So you aren't writing to four 260MB/s devices; you are writing >> to one 260MB/s device without any pipelining — meaning you send off a 4MB >> write, then wait until it's done, then send off a second 4MB write, then >> wait until it's done, etc. >> Frankly I'm surprised you aren't getting a bit more throughput than you're >> seeing (I remember other people getting much more out of less beefy boxes), >> but it doesn't much matter because what you really want to do is enable the >> client-side writeback cache in RBD, which will dispatch multiple requests at >> once and not force writes to be committed before reporting back to the >> kernel. Then you should indeed be writing to four 260MB/s devices at once. :) > > OK i understand that but still the question where is the bottlenek in > this case. I mean i see not more than 40% network load, not more than > 10% cpu load and only 40MB/s to the SSD. I would still expect a network > load of 70-90%. *gr* i found a broken SATA cable ;-( this is now with the replaced SATA cable and with rbd cache turned on: systembootimage:/mnt# dd if=/dev/zero of=test bs=4M count=1000 1000+0 records in 1000+0 records out 4194304000 bytes (4,2 GB) copied, 57,9194 s, 72,4 MB/s systembootimage:/mnt# dd if=test of=/dev/null bs=4M count=1000 1000+0 records in 1000+0 records out 4194304000 bytes (4,2 GB) copied, 46,3499 s, 90,5 MB/s rados write bench 8 threads: Total time run:60.222947 Total writes made: 1519 Write size:4194304 Bandwidth (MB/sec):100.892 Average Latency: 0.317098 Max latency: 1.88908 Min latency: 0.089681 Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Am 22.05.2012 23:11, schrieb Greg Farnum: > On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote: >> Am 22.05.2012 22:49, schrieb Greg Farnum: >>> Anyway, it looks like you're just paying a synchronous write penalty >> >> >> What does that exactly mean? Shouldn't one threaded write to four >> 260MB/s devices gives at least 100Mb/s? > > Well, with dd you've got a single thread issuing synchronous IO requests to > the kernel. We could have it set up so that those synchronous requests get > split up, but they aren't, and between the kernel and KVM it looks like when > it needs to make a write out to disk it sends one request at a time to the > Ceph backend. So you aren't writing to four 260MB/s devices; you are writing > to one 260MB/s device without any pipelining — meaning you send off a 4MB > write, then wait until it's done, then send off a second 4MB write, then wait > until it's done, etc. > Frankly I'm surprised you aren't getting a bit more throughput than you're > seeing (I remember other people getting much more out of less beefy boxes), > but it doesn't much matter because what you really want to do is enable the > client-side writeback cache in RBD, which will dispatch multiple requests at > once and not force writes to be committed before reporting back to the > kernel. Then you should indeed be writing to four 260MB/s devices at once. :) OK i understand that but still the question where is the bottlenek in this case. I mean i see not more than 40% network load, not more than 10% cpu load and only 40MB/s to the SSD. I would still expect a network load of 70-90%. Greets and thanks, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
On 05/23/2012 01:20 AM, Stefan Priebe - Profihost AG wrote: Am 23.05.2012 09:19, schrieb Josh Durgin: On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote: Am 23.05.2012 08:30, schrieb Josh Durgin: On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote: Hi, So try enabling RBD writeback caching — see http://marc.info /?l=ceph-devel&m=133758599712768&w=2 will test tomorrow. Thanks. Can we path this to the qemu-drive option? Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400 I'm sorry, i still don't get it where to pass rbd_cache_max_dirty, ... can i add it just to the ceph.conf? Even with qemu 1.0? You can use any of the rbd-specific options (like rbd_cache_max_dirty) with qemu>= 0.15. You can set them in a global ceph.conf file, or specify them on the qemu command line like: qemu -m 512 -drive file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio So this is enough for testing on kvm host? /etc/ceph/ceph.conf [global] auth supported = cephx keyring = /etc/ceph/$name.keyring rbd_cache = true rbd_cache_size = 32M This should be a number in bytes - M/G/k/etc aren't parsed. Assuming you have the monitors listed below, this is fine. If you're not using the admin user, you'll need to add :id=name to the -drive string - it can't be set in the config file. rbd_cache_max_age = 2.0 ... Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Am 23.05.2012 09:19, schrieb Josh Durgin: > On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote: >> Am 23.05.2012 08:30, schrieb Josh Durgin: >>> On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote: Hi, >> So try enabling RBD writeback caching — see http://marc.info >> /?l=ceph-devel&m=133758599712768&w=2 >> will test tomorrow. Thanks. Can we path this to the qemu-drive option? >>> >>> Yup, see >>> http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400 >> I'm sorry, i still don't get it where to pass rbd_cache_max_dirty, ... >> can i add it just to the ceph.conf? Even with qemu 1.0? > > You can use any of the rbd-specific options (like rbd_cache_max_dirty) > with qemu >= 0.15. > > You can set them in a global ceph.conf file, or specify them on the qemu > command line like: > > qemu -m 512 -drive > file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio So this is enough for testing on kvm host? /etc/ceph/ceph.conf [global] auth supported = cephx keyring = /etc/ceph/$name.keyring rbd_cache = true rbd_cache_size = 32M rbd_cache_max_age = 2.0 ... Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Am 23.05.2012 09:22, schrieb Andrey Korolyov: > Hi, > > For Stefan: > > Increasing socket memory gave me about some percents on fio tests > inside VM(I have measured > 'max-iops-until-ceph-throws-message-about-delayed-write' parameter). > What is more important, osd process, if possible, should be pinned to > dedicated core or two, and all other processes should not use this > core(you may do it via cg or manually), because even one four-core > non-pinned VM process during test causes a drop of osd` throughput > almost twice, same for any other heavy process on the host. Tried that using taskset but i didn't get any noticable boost. Also the kernel already prevents jumping from core 2 core whenever possible. As these machines are dedicated to osd there is no other load. > net.core.rmem_max = 16777216 > net.core.wmem_max = 16777216 > net.ipv4.tcp_rmem = 4096 87380 16777216 > net.ipv4.tcp_wmem = 4096 65536 16777216 This gaves me around 3-4 MB/s. Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
On 05/23/2012 12:22 AM, Stefan Priebe - Profihost AG wrote: Am 23.05.2012 09:19, schrieb Josh Durgin: On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote: You can use any of the rbd-specific options (like rbd_cache_max_dirty) with qemu>= 0.15. You can set them in a global ceph.conf file, or specify them on the qemu command line like: qemu -m 512 -drive file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio ah thanks and sorry. Is there a way to verify which options are active / working on a specific rbd block device? There's no way to ask which options it's using while it's running. That would probably be a good thing to add (maybe as an admin socket command). Until then, if you want to know the exact settings of all your rbd disks, you can specify all the necessary options on the qemu command line, and not have a ceph.conf file. Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Hi, For Stefan: Increasing socket memory gave me about some percents on fio tests inside VM(I have measured 'max-iops-until-ceph-throws-message-about-delayed-write' parameter). What is more important, osd process, if possible, should be pinned to dedicated core or two, and all other processes should not use this core(you may do it via cg or manually), because even one four-core non-pinned VM process during test causes a drop of osd` throughput almost twice, same for any other heavy process on the host. net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 On Wed, May 23, 2012 at 10:30 AM, Josh Durgin wrote: > On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote: >> >> Hi, >> So try enabling RBD writeback caching — see http://marc.info /?l=ceph-devel&m=133758599712768&w=2 will test tomorrow. Thanks. >> >> Can we path this to the qemu-drive option? > > > Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400 > > The normal qemu cache=writeback/writethrough/none option will work in qemu > 1.2. > > Josh By the way, is it possible to flush cache outside? I may need that for VM live migration and such hook will be helpful. > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Am 23.05.2012 09:19, schrieb Josh Durgin: > On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote: > You can use any of the rbd-specific options (like rbd_cache_max_dirty) > with qemu >= 0.15. > > You can set them in a global ceph.conf file, or specify them on the qemu > command line like: > > qemu -m 512 -drive > file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio ah thanks and sorry. Is there a way to verify which options are active / working on a specific rbd block device? Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
On 05/23/2012 12:01 AM, Stefan Priebe - Profihost AG wrote: Am 23.05.2012 08:30, schrieb Josh Durgin: On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote: Hi, So try enabling RBD writeback caching — see http://marc.info /?l=ceph-devel&m=133758599712768&w=2 will test tomorrow. Thanks. Can we path this to the qemu-drive option? Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400 I'm sorry, i still don't get it where to pass rbd_cache_max_dirty, ... can i add it just to the ceph.conf? Even with qemu 1.0? You can use any of the rbd-specific options (like rbd_cache_max_dirty) with qemu >= 0.15. You can set them in a global ceph.conf file, or specify them on the qemu command line like: qemu -m 512 -drive file=rbd:pool/image:rbd_cache_max_dirty=0:rbd_cache=true,format=raw,if=virtio Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Am 23.05.2012 08:30, schrieb Josh Durgin: > On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote: >> Hi, >> So try enabling RBD writeback caching — see http://marc.info /?l=ceph-devel&m=133758599712768&w=2 will test tomorrow. Thanks. >> Can we path this to the qemu-drive option? > > Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400 I'm sorry, i still don't get it where to pass rbd_cache_max_dirty, ... can i add it just to the ceph.conf? Even with qemu 1.0? Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
On 05/22/2012 11:18 PM, Stefan Priebe - Profihost AG wrote: Hi, So try enabling RBD writeback caching — see http://marc.info /?l=ceph-devel&m=133758599712768&w=2 will test tomorrow. Thanks. Can we path this to the qemu-drive option? Yup, see http://article.gmane.org/gmane.comp.file-systems.ceph.devel/6400 The normal qemu cache=writeback/writethrough/none option will work in qemu 1.2. Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Hi, >> So try enabling RBD writeback caching — see http://marc.info >> /?l=ceph-devel&m=133758599712768&w=2 >> will test tomorrow. Thanks. Can we path this to the qemu-drive option? Stefan Am 22.05.2012 23:11, schrieb Greg Farnum: > On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote: >> Am 22.05.2012 22:49, schrieb Greg Farnum: >>> Anyway, it looks like you're just paying a synchronous write penalty >> >> >> What does that exactly mean? Shouldn't one threaded write to four >> 260MB/s devices gives at least 100Mb/s? > > Well, with dd you've got a single thread issuing synchronous IO requests to > the kernel. We could have it set up so that those synchronous requests get > split up, but they aren't, and between the kernel and KVM it looks like when > it needs to make a write out to disk it sends one request at a time to the > Ceph backend. So you aren't writing to four 260MB/s devices; you are writing > to one 260MB/s device without any pipelining — meaning you send off a 4MB > write, then wait until it's done, then send off a second 4MB write, then wait > until it's done, etc. > Frankly I'm surprised you aren't getting a bit more throughput than you're > seeing (I remember other people getting much more out of less beefy boxes), > but it doesn't much matter because what you really want to do is enable the > client-side writeback cache in RBD, which will dispatch multiple requests at > once and not force writes to be committed before reporting back to the > kernel. Then you should indeed be writing to four 260MB/s devices at once. :) > >> >>> since with 1 write at a time you're getting 30-40MB/s out of rados bench, >>> but with 16 you're getting>100MB/s. >>> (If you bump up past 16 or increase the size of each with -b you may >> >> find yourself getting even more.) >> yep noticed that. >> >>> So try enabling RBD writeback caching — see >>> http://marc.info/?l=ceph-devel&m=133758599712768&w=2 >> will test tomorrow. Thanks. >> >> Stefan > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
On Tuesday, May 22, 2012 at 2:00 PM, Stefan Priebe wrote: > Am 22.05.2012 22:49, schrieb Greg Farnum: > > Anyway, it looks like you're just paying a synchronous write penalty > > > What does that exactly mean? Shouldn't one threaded write to four > 260MB/s devices gives at least 100Mb/s? Well, with dd you've got a single thread issuing synchronous IO requests to the kernel. We could have it set up so that those synchronous requests get split up, but they aren't, and between the kernel and KVM it looks like when it needs to make a write out to disk it sends one request at a time to the Ceph backend. So you aren't writing to four 260MB/s devices; you are writing to one 260MB/s device without any pipelining — meaning you send off a 4MB write, then wait until it's done, then send off a second 4MB write, then wait until it's done, etc. Frankly I'm surprised you aren't getting a bit more throughput than you're seeing (I remember other people getting much more out of less beefy boxes), but it doesn't much matter because what you really want to do is enable the client-side writeback cache in RBD, which will dispatch multiple requests at once and not force writes to be committed before reporting back to the kernel. Then you should indeed be writing to four 260MB/s devices at once. :) > > > since with 1 write at a time you're getting 30-40MB/s out of rados bench, > > but with 16 you're getting>100MB/s. > > (If you bump up past 16 or increase the size of each with -b you may > > find yourself getting even more.) > yep noticed that. > > > So try enabling RBD writeback caching — see > > http://marc.info/?l=ceph-devel&m=133758599712768&w=2 > will test tomorrow. Thanks. > > Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Am 22.05.2012 22:49, schrieb Greg Farnum: Anyway, it looks like you're just paying a synchronous write penalty What does that exactly mean? Shouldn't one threaded write to four 260MB/s devices gives at least 100Mb/s? since with 1 write at a time you're getting 30-40MB/s out of rados bench, but with 16 you're getting>100MB/s. >(If you bump up past 16 or increase the size of each with -b you may find yourself getting even more.) yep noticed that. So try enabling RBD writeback caching — see http://marc.info/?l=ceph-devel&m=133758599712768&w=2 will test tomorrow. Thanks. Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Am 22.05.2012 22:48, schrieb Mark Nelson: Can you use something like iostat or collectl to check and see if the write throughput to each SSD is roughly equal during your tests? It is but just around 20-40MB/s. But they can write 260MB/s with sequential writes. > Also, what FS are you using and how did you format/mount it? just: mkfs.xfs /dev/sdb1 mount options: noatime,nodiratime,nobarrier,discard Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
On Tuesday, May 22, 2012 at 1:30 PM, Stefan Priebe wrote: > Am 22.05.2012 21:52, schrieb Greg Farnum: > > On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote: > > Huh. That's less than I would expect. Especially since it ought to be going > > through the page cache. > > What version of RBD is KVM using here? > > > v0.47.1 > > > Can you (from the KVM host) run > > "rados -p data bench seq 60 -t 1" > > "rados -p data bench seq 60 -t 16" > > and paste the final output from both? > > > OK here it is first with write then with seq read. > > # rados -p data bench 60 write -t 1 > # rados -p data bench 60 write -t 16 > # rados -p data bench 60 seq -t 1 > # rados -p data bench 60 seq -t 16 > > Output is here: > http://pastebin.com/iFy8GS7i Heh, yep, sorry about the commands — haven't run them personally in a while. :) Anyway, it looks like you're just paying a synchronous write penalty, since with 1 write at a time you're getting 30-40MB/s out of rados bench, but with 16 you're getting >100MB/s. (If you bump up past 16 or increase the size of each with -b you may find yourself getting even more.) So try enabling RBD writeback caching — see http://marc.info/?l=ceph-devel&m=133758599712768&w=2 -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
On 05/22/2012 03:30 PM, Stefan Priebe wrote: Am 22.05.2012 21:52, schrieb Greg Farnum: On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote: Huh. That's less than I would expect. Especially since it ought to be going through the page cache. What version of RBD is KVM using here? v0.47.1 Can you (from the KVM host) run "rados -p data bench seq 60 -t 1" "rados -p data bench seq 60 -t 16" and paste the final output from both? OK here it is first with write then with seq read. # rados -p data bench 60 write -t 1 # rados -p data bench 60 write -t 16 # rados -p data bench 60 seq -t 1 # rados -p data bench 60 seq -t 16 Output is here: http://pastebin.com/iFy8GS7i Thanks! Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Hi Stefan, Can you use something like iostat or collectl to check and see if the write throughput to each SSD is roughly equal during your tests? Also, what FS are you using and how did you format/mount it? I've been doing some tests internally using 2 nodes with 5 OSDs each backed by SSDs for both data and journal and am seeing about 600MB/s from the client (over 10GE) on a fresh ceph fs. Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Am 22.05.2012 21:52, schrieb Greg Farnum: On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote: Huh. That's less than I would expect. Especially since it ought to be going through the page cache. What version of RBD is KVM using here? v0.47.1 Can you (from the KVM host) run "rados -p data bench seq 60 -t 1" "rados -p data bench seq 60 -t 16" and paste the final output from both? OK here it is first with write then with seq read. # rados -p data bench 60 write -t 1 # rados -p data bench 60 write -t 16 # rados -p data bench 60 seq -t 1 # rados -p data bench 60 seq -t 16 Output is here: http://pastebin.com/iFy8GS7i Thanks! Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Am 22.05.2012 21:52, schrieb Greg Farnum: On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote: Huh. That's less than I would expect. Especially since it ought to be going through the page cache. What version of RBD is KVM using here? v0.47.1 Can you (from the KVM host) run "rados -p data bench seq 60 -t 1" "rados -p data bench seq 60 -t 16" and paste the final output from both? I think your meant: rados -p data bench 60 seq -t 1 ? but even that results in: ~# rados -p data bench 60 seq -t 1 Must write data before running a read benchmark! error during benchmark: -2 error 2: (2) No such file or directory Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
On Tuesday, May 22, 2012 at 12:40 PM, Stefan Priebe wrote: > Am 22.05.2012 21:35, schrieb Greg Farnum: > > What does your test look like? With multiple large IOs in flight we can > > regularly fill up a 1GbE link on our test clusters. With smaller or fewer > > IOs in flight performance degrades accordingly. > > > > iperf shows 950Mbit/s so this is OK (from KVM host to OSDs) > > sorry: > dd if=/dev/zero of=test bs=4M count=1000; dd if=test of=/dev/null bs=4M > count=1000; > 1000+0 records in > 1000+0 records out > 4194304000 bytes (4,2 GB) copied, 99,7352 s, 42,1 MB/s > > 1000+0 records in > 1000+0 records out > 4194304000 bytes (4,2 GB) copied, 47,4493 s, 88,4 MB/s Huh. That's less than I would expect. Especially since it ought to be going through the page cache. What version of RBD is KVM using here? Can you (from the KVM host) run "rados -p data bench seq 60 -t 1" "rados -p data bench seq 60 -t 16" and paste the final output from both? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Am 22.05.2012 21:35, schrieb Greg Farnum: What does your test look like? With multiple large IOs in flight we can regularly fill up a 1GbE link on our test clusters. With smaller or fewer IOs in flight performance degrades accordingly. iperf shows 950Mbit/s so this is OK (from KVM host to OSDs) sorry: dd if=/dev/zero of=test bs=4M count=1000; dd if=test of=/dev/null bs=4M count=1000; 1000+0 records in 1000+0 records out 4194304000 bytes (4,2 GB) copied, 99,7352 s, 42,1 MB/s 1000+0 records in 1000+0 records out 4194304000 bytes (4,2 GB) copied, 47,4493 s, 88,4 MB/s Greets Stefan On Tuesday, May 22, 2012 at 5:45 AM, Stefan Priebe - Profihost AG wrote: Hi list, my ceph block testcluster is now running fine. Setup: 4x ceph servers - 3x mon with /mon on local os SATA disk - 4x OSD with /journal on tmpfs and /srv on intel ssd all of them use 2x 1Gbit/s lacp trunk. 1x KVM Host system (2x 1Gbit/s lacp trunk) With one KVM i do not get more than 40MB/s and my network link is just at 40% of 1Gbit/s. Is this expected? If not where can i start searching / debugging? Thanks, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org (mailto:majord...@vger.kernel.org) More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
What does your test look like? With multiple large IOs in flight we can regularly fill up a 1GbE link on our test clusters. With smaller or fewer IOs in flight performance degrades accordingly. On Tuesday, May 22, 2012 at 5:45 AM, Stefan Priebe - Profihost AG wrote: > Hi list, > > my ceph block testcluster is now running fine. > > Setup: > 4x ceph servers > - 3x mon with /mon on local os SATA disk > - 4x OSD with /journal on tmpfs and /srv on intel ssd > > all of them use 2x 1Gbit/s lacp trunk. > > 1x KVM Host system (2x 1Gbit/s lacp trunk) > > With one KVM i do not get more than 40MB/s and my network link is just > at 40% of 1Gbit/s. > > Is this expected? If not where can i start searching / debugging? > > Thanks, > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > (mailto:majord...@vger.kernel.org) > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Am 22.05.2012 16:52, schrieb Andrey Korolyov: Hi, I`ve run in almost same problem about two months ago, and there is a couple of corner cases: near-default tcp parameters, small journal size, disks that are not backed by controller with NVRAM cache and high load on osd` cpu caused by side processes. Finally, I have able to achieve 115Mb/s for large linear writes on raw rbd block inside VM with journal on tmpfs and osds on RAID0 built on top of three sata disks. which tcp parameters could you recommand? The journal size is 1Gb on tmpfs right now. Instead of 3 sata disks i'm using one intel ssd. the CPU is loaded by 10% on each osd max. A "ceph osd tell X bench" shows me 260MB/s write on each OSD (intel ssd). Greets Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug slow rbd block device
Hi, I`ve run in almost same problem about two months ago, and there is a couple of corner cases: near-default tcp parameters, small journal size, disks that are not backed by controller with NVRAM cache and high load on osd` cpu caused by side processes. Finally, I have able to achieve 115Mb/s for large linear writes on raw rbd block inside VM with journal on tmpfs and osds on RAID0 built on top of three sata disks. On Tue, May 22, 2012 at 4:45 PM, Stefan Priebe - Profihost AG wrote: > Hi list, > > my ceph block testcluster is now running fine. > > Setup: > 4x ceph servers > - 3x mon with /mon on local os SATA disk > - 4x OSD with /journal on tmpfs and /srv on intel ssd > > all of them use 2x 1Gbit/s lacp trunk. > > 1x KVM Host system (2x 1Gbit/s lacp trunk) > > With one KVM i do not get more than 40MB/s and my network link is just > at 40% of 1Gbit/s. > > Is this expected? If not where can i start searching / debugging? > > Thanks, > Stefan > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
how to debug slow rbd block device
Hi list, my ceph block testcluster is now running fine. Setup: 4x ceph servers - 3x mon with /mon on local os SATA disk - 4x OSD with /journal on tmpfs and /srv on intel ssd all of them use 2x 1Gbit/s lacp trunk. 1x KVM Host system (2x 1Gbit/s lacp trunk) With one KVM i do not get more than 40MB/s and my network link is just at 40% of 1Gbit/s. Is this expected? If not where can i start searching / debugging? Thanks, Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html