Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
network results below is that the latency happens within the OSD processes. Regards, Christian When I suggested other tests, I meant with and without Ceph. One particular one is OSD bench. That should be interesting to try at a variety of block sizes. You could also try runnin RADOS bench and smalliobench at a few different sizes. -Greg On Wednesday, May 7, 2014, Alexandre DERUMIER wrote: Hi Christian, Do you have tried without raid6, to have more osd ? (how many disks do you have begin the raid6 ?) Aslo, I known that direct ios can be quite slow with ceph, maybe can you try without --direct=1 and also enable rbd_cache ceph.conf [client] rbd cache = true ----- Mail original - De: "Christian Balzer" > À: "Gregory Farnum" >, ceph-users@lists.ceph.com Envoyé: Jeudi 8 Mai 2014 04:49:16 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > wrote: Hello, ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882 with 4GB of cache. Running this fio: fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128 results in: 30k IOPS on the journal SSD (as expected) 110k IOPS on the OSD (it fits neatly into the cache, no surprise there) 3200 IOPS from a VM using userspace RBD 2900 IOPS from a host kernelspace mounted RBD When running the fio from the VM RBD the utilization of the journals is about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some obvious merging). The OSD processes are quite busy, reading well over 200% on atop, but the system is not CPU or otherwise resource starved at that moment. Running multiple instances of this test from several VMs on different hosts changes nothing, as in the aggregated IOPS for the whole cluster will still be around 3200 IOPS. Now clearly RBD has to deal with latency here, but the network is IPoIB with the associated low latency and the journal SSDs are the (consistently) fasted ones around. I guess what I am wondering about is if this is normal and to be expected or if not where all that potential performance got lost. Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) Yes, but going down to 32 doesn't change things one iota. Also note the multiple instances I mention up there, so that would be 256 IOs at a time, coming from different hosts over different links and nothing changes. that's about 40ms of latency per op (for userspace RBD), which seems awfully long. You should check what your client-side objecter settings are; it might be limiting you to fewer outstanding ops than that. Googling for client-side objecter gives a few hits on ceph devel and bugs and nothing at all as far as configuration options are concerned. Care to enlighten me where one can find those? Also note the kernelspace (3.13 if it matters) speed, which is very much in the same (junior league) ballpark. If it's available to you, testing with Firefly or even master would be interesting — there's some performance work that should reduce latencies. Not an option, this is going into production next week. But a well-tuned (or even default-tuned, I thought) Ceph cluster certainly doesn't require 40ms/op, so you should probably run a wider array of experiments to try and figure out where it's coming from. I think we can rule out the network, NPtcp gives me: --- 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec --- For comparison at about 512KB it reaches maximum throughput and still isn't that laggy: --- 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec --- So with the network performing as well as my lengthy experience with IPoIB led me to believe, what else is there to look at? The storage nodes perform just as expected, indicated by the local fio tests. That pretty much leaves only Ceph/RBD to look at and I'm not really sure what experiments I should run on that. ^o^ Regards, Christian -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- Christian Balzer Network/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
ont of a RAID/JBODs or using SSDs for final storage? If so, what results do you get out of the fio statement below per OSD? In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which is of course vastly faster than the normal indvidual HDDs could do. So I'm wondering if I'm hitting some inherent limitation of how fast a single OSD (as in the software) can handle IOPS, given that everything else has been ruled out from where I stand. This would also explain why none of the option changes or the use of RBD caching has any measurable effect in the test case below. As in, a slow OSD aka single HDD with journal on the same disk would clearly benefit from even the small 32MB standard RBD cache, while in my test case the only time the caching becomes noticeable is if I increase the cache size to something larger than the test data size. ^o^ On the other hand if people here regularly get thousands or tens of thousands IOPS per OSD with the appropriate HW I'm stumped. Christian On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: Oh, I didn't notice that. I bet you aren't getting the expected throughput on the RAID array with OSD access patterns, and that's applying back pressure on the journal. In the a "picture" being worth a thousand words tradition, I give you this iostat -x output taken during a fio run: avg-cpu: %user %nice %system %iowait %steal %idle 50.820.00 19.430.170.00 29.58 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.0051.500.00 1633.50 0.00 7460.00 9.13 0.180.110.000.11 0.01 1.40 sdb 0.00 0.000.00 1240.50 0.00 5244.00 8.45 0.30 0.250.000.25 0.02 2.00 sdc 0.00 5.00 0.00 2468.50 0.00 13419.0010.87 0.240.100.00 0.10 0.09 22.00 sdd 0.00 6.500.00 1913.00 0.00 10313.0010.78 0.200.100.000.10 0.09 16.60 The %user CPU utilization is pretty much entirely the 2 OSD processes, note the nearly complete absence of iowait. sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. Look at these numbers, the lack of queues, the low wait and service times (this is in ms) plus overall utilization. The only conclusion I can draw from these numbers and the network results below is that the latency happens within the OSD processes. Regards, Christian When I suggested other tests, I meant with and without Ceph. One particular one is OSD bench. That should be interesting to try at a variety of block sizes. You could also try runnin RADOS bench and smalliobench at a few different sizes. -Greg On Wednesday, May 7, 2014, Alexandre DERUMIER wrote: Hi Christian, Do you have tried without raid6, to have more osd ? (how many disks do you have begin the raid6 ?) Aslo, I known that direct ios can be quite slow with ceph, maybe can you try without --direct=1 and also enable rbd_cache ceph.conf [client] rbd cache = true ----- Mail original - De: "Christian Balzer" > À: "Gregory Farnum" >, ceph-users@lists.ceph.com Envoyé: Jeudi 8 Mai 2014 04:49:16 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > wrote: Hello, ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882 with 4GB of cache. Running this fio: fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128 results in: 30k IOPS on the journal SSD (as expected) 110k IOPS on the OSD (it fits neatly into the cache, no surprise there) 3200 IOPS from a VM using userspace RBD 2900 IOPS from a host kernelspace mounted RBD When running the fio from the VM RBD the utilization of the journals is about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some obvious merging). The OSD processes are quite busy, reading well over 200% on atop, but the system is not CPU or otherwise resource starved at that moment. Running multiple instances of this test from several VMs on different hosts changes nothing, as in the aggregated IOPS for the whole cluster will still be around 3200 IOPS. Now clearly RBD has to deal with latency here, but the network is IPoIB with the associated low latency and the journal SSDs are the (consistently) fasted ones around. I guess what I am wondering about is if this is normal and to be expected or if not where all that potential performance got lost. Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) Yes, but going down to 32 doesn't change thing
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
s wrqm/s r/s w/srkB/swkB/s > > >> avgrq-sz avgqu-sz await r_await w_await svctm %util > > >> sda 0.0051.500.00 1633.50 0.00 7460.00 > > >> 9.13 0.180.110.000.11 0.01 1.40 sdb > > >> 0.00 0.000.00 1240.50 0.00 5244.00 8.45 0.30 > > >> 0.250.000.25 0.02 2.00 sdc 0.00 5.00 > > >> 0.00 2468.50 0.00 13419.0010.87 0.240.100.00 > > >> 0.10 0.09 22.00 sdd 0.00 6.500.00 1913.00 > > >> 0.00 10313.0010.78 0.200.100.000.10 0.09 > > >> 16.60 > > >> > > >> The %user CPU utilization is pretty much entirely the 2 OSD > > >> processes, note the nearly complete absence of iowait. > > >> > > >> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. > > >> Look at these numbers, the lack of queues, the low wait and service > > >> times (this is in ms) plus overall utilization. > > >> > > >> The only conclusion I can draw from these numbers and the network > > >> results below is that the latency happens within the OSD processes. > > >> > > >> Regards, > > >> > > >> Christian > > >>> When I suggested other tests, I meant with and without Ceph. One > > >>> particular one is OSD bench. That should be interesting to try at a > > >>> variety of block sizes. You could also try runnin RADOS bench and > > >>> smalliobench at a few different sizes. > > >>> -Greg > > >>> > > >>> On Wednesday, May 7, 2014, Alexandre DERUMIER > > >>> wrote: > > >>> > > >>>> Hi Christian, > > >>>> > > >>>> Do you have tried without raid6, to have more osd ? > > >>>> (how many disks do you have begin the raid6 ?) > > >>>> > > >>>> > > >>>> Aslo, I known that direct ios can be quite slow with ceph, > > >>>> > > >>>> maybe can you try without --direct=1 > > >>>> > > >>>> and also enable rbd_cache > > >>>> > > >>>> ceph.conf > > >>>> [client] > > >>>> rbd cache = true > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> - Mail original - > > >>>> > > >>>> De: "Christian Balzer" > > > >>>> À: "Gregory Farnum" >, > > >>>> ceph-users@lists.ceph.com > > >>>> Envoyé: Jeudi 8 Mai 2014 04:49:16 > > >>>> Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and > > >>>> backing devices > > >>>> > > >>>> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > > >>>> > > >>>>> On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > > >>>>> > > > >>>> wrote: > > >>>>>> Hello, > > >>>>>> > > >>>>>> ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. > > >>>>>> The journals are on (separate) DC 3700s, the actual OSDs are > > >>>>>> RAID6 behind an Areca 1882 with 4GB of cache. > > >>>>>> > > >>>>>> Running this fio: > > >>>>>> > > >>>>>> fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > >>>>>> --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k > > >>>>>> --iodepth=128 > > >>>>>> > > >>>>>> results in: > > >>>>>> > > >>>>>> 30k IOPS on the journal SSD (as expected) > > >>>>>> 110k IOPS on the OSD (it fits neatly into the cache, no surprise > > >>>>>> there) 3200 IOPS from a VM using userspace RBD > > >>>>>> 2900 IOPS from a host kernelspace mounted RBD > > >>>>>> > > >>>>>> When running the fio from the VM RBD the utilization of the > > >>>>>> journals is about 20% (2400 IOPS) and the OSDs are bored at 2% > > >>>>>> (1500 IOPS after some obvious merging). > > >>>>>> The OSD processes ar
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
;>> >>>> That would be interesting indeed. >>>> Given what I've seen (with the journal at 20% utilization and the actual >>>> filestore ataround 5%) I'd expect Ceph to be the culprit. >>>> >>>>> I'll get back to you with the results, hopefully I'll manage to get them >>>>> done during this night. >>>>> >>>> Looking forward to that. ^^ >>>> >>>> >>>> Christian >>>>> Cheers, >>>>> Josef >>>>> >>>>> On 13/05/14 11:03, Christian Balzer wrote: >>>>>> I'm clearly talking to myself, but whatever. >>>>>> >>>>>> For Greg, I've played with all the pertinent journal and filestore >>>>>> options and TCP nodelay, no changes at all. >>>>>> >>>>>> Is there anybody on this ML who's running a Ceph cluster with a fast >>>>>> network and FAST filestore, so like me with a big HW cache in front of >>>>>> a RAID/JBODs or using SSDs for final storage? >>>>>> >>>>>> If so, what results do you get out of the fio statement below per OSD? >>>>>> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, >>>>>> which is of course vastly faster than the normal indvidual HDDs could >>>>>> do. >>>>>> >>>>>> So I'm wondering if I'm hitting some inherent limitation of how fast a >>>>>> single OSD (as in the software) can handle IOPS, given that everything >>>>>> else has been ruled out from where I stand. >>>>>> >>>>>> This would also explain why none of the option changes or the use of >>>>>> RBD caching has any measurable effect in the test case below. >>>>>> As in, a slow OSD aka single HDD with journal on the same disk would >>>>>> clearly benefit from even the small 32MB standard RBD cache, while in >>>>>> my test case the only time the caching becomes noticeable is if I >>>>>> increase the cache size to something larger than the test data size. >>>>>> ^o^ >>>>>> >>>>>> On the other hand if people here regularly get thousands or tens of >>>>>> thousands IOPS per OSD with the appropriate HW I'm stumped. >>>>>> >>>>>> Christian >>>>>> >>>>>> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: >>>>>> >>>>>>> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: >>>>>>> >>>>>>>> Oh, I didn't notice that. I bet you aren't getting the expected >>>>>>>> throughput on the RAID array with OSD access patterns, and that's >>>>>>>> applying back pressure on the journal. >>>>>>>> >>>>>>> In the a "picture" being worth a thousand words tradition, I give you >>>>>>> this iostat -x output taken during a fio run: >>>>>>> >>>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>>>50.820.00 19.430.170.00 29.58 >>>>>>> >>>>>>> Device: rrqm/s wrqm/s r/s w/srkB/swkB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> sda 0.00 51.500.00 1633.50 0.00 7460.00 >>>>>>> 9.13 0.180.110.000.11 0.01 1.40 sdb >>>>>>> 0.00 0.000.00 1240.50 0.00 5244.00 8.45 0.30 >>>>>>> 0.250.000.25 0.02 2.00 sdc 0.00 5.00 >>>>>>> 0.00 2468.50 0.00 13419.0010.87 0.240.100.00 >>>>>>> 0.10 0.09 22.00 sdd 0.00 6.500.00 1913.00 >>>>>>> 0.00 10313.0010.78 0.200.100.000.10 0.09 16.60 >>>>>>> >>>>>>> The %user CPU utilization is pretty much entirely the 2 OSD processes, >>>>>>> note the nearly complete absence of iowait. >>>>>>> >>>>>>> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. >>>>>>> Look at these numbers, the lack of queues, the low wait and service >>>>>>
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
zer wrote: >>>>> I'm clearly talking to myself, but whatever. >>>>> >>>>> For Greg, I've played with all the pertinent journal and filestore >>>>> options and TCP nodelay, no changes at all. >>>>> >>>>> Is there anybody on this ML who's running a Ceph cluster with a fast >>>>> network and FAST filestore, so like me with a big HW cache in front of >>>>> a RAID/JBODs or using SSDs for final storage? >>>>> >>>>> If so, what results do you get out of the fio statement below per OSD? >>>>> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, >>>>> which is of course vastly faster than the normal indvidual HDDs could >>>>> do. >>>>> >>>>> So I'm wondering if I'm hitting some inherent limitation of how fast a >>>>> single OSD (as in the software) can handle IOPS, given that everything >>>>> else has been ruled out from where I stand. >>>>> >>>>> This would also explain why none of the option changes or the use of >>>>> RBD caching has any measurable effect in the test case below. >>>>> As in, a slow OSD aka single HDD with journal on the same disk would >>>>> clearly benefit from even the small 32MB standard RBD cache, while in >>>>> my test case the only time the caching becomes noticeable is if I >>>>> increase the cache size to something larger than the test data size. >>>>> ^o^ >>>>> >>>>> On the other hand if people here regularly get thousands or tens of >>>>> thousands IOPS per OSD with the appropriate HW I'm stumped. >>>>> >>>>> Christian >>>>> >>>>> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: >>>>> >>>>>> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: >>>>>> >>>>>>> Oh, I didn't notice that. I bet you aren't getting the expected >>>>>>> throughput on the RAID array with OSD access patterns, and that's >>>>>>> applying back pressure on the journal. >>>>>>> >>>>>> In the a "picture" being worth a thousand words tradition, I give you >>>>>> this iostat -x output taken during a fio run: >>>>>> >>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>>50.820.00 19.430.170.00 29.58 >>>>>> >>>>>> Device: rrqm/s wrqm/s r/s w/srkB/swkB/s >>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>> sda 0.0051.500.00 1633.50 0.00 7460.00 >>>>>> 9.13 0.180.110.000.11 0.01 1.40 sdb >>>>>> 0.00 0.000.00 1240.50 0.00 5244.00 8.45 0.30 >>>>>> 0.250.000.25 0.02 2.00 sdc 0.00 5.00 >>>>>> 0.00 2468.50 0.00 13419.0010.87 0.24 0.10 0.00 >>>>>> 0.10 0.09 22.00 sdd 0.00 6.500.00 1913.00 >>>>>> 0.00 10313.0010.78 0.200.100.000.10 0.09 16.60 >>>>>> >>>>>> The %user CPU utilization is pretty much entirely the 2 OSD processes, >>>>>> note the nearly complete absence of iowait. >>>>>> >>>>>> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. >>>>>> Look at these numbers, the lack of queues, the low wait and service >>>>>> times (this is in ms) plus overall utilization. >>>>>> >>>>>> The only conclusion I can draw from these numbers and the network >>>>>> results below is that the latency happens within the OSD processes. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Christian >>>>>>> When I suggested other tests, I meant with and without Ceph. One >>>>>>> particular one is OSD bench. That should be interesting to try at a >>>>>>> variety of block sizes. You could also try runnin RADOS bench and >>>>>>> smalliobench at a few different sizes. >>>>>>> -Greg >>>>>>> >>>>>>> On Wednesday, May 7, 2014, Alexandre DERUMIER >>>>>>
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
t; >>>> If so, what results do you get out of the fio statement below per OSD? >>>> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, >>>> which is of course vastly faster than the normal indvidual HDDs could >>>> do. >>>> >>>> So I'm wondering if I'm hitting some inherent limitation of how fast a >>>> single OSD (as in the software) can handle IOPS, given that everything >>>> else has been ruled out from where I stand. >>>> >>>> This would also explain why none of the option changes or the use of >>>> RBD caching has any measurable effect in the test case below. >>>> As in, a slow OSD aka single HDD with journal on the same disk would >>>> clearly benefit from even the small 32MB standard RBD cache, while in >>>> my test case the only time the caching becomes noticeable is if I >>>> increase the cache size to something larger than the test data size. >>>> ^o^ >>>> >>>> On the other hand if people here regularly get thousands or tens of >>>> thousands IOPS per OSD with the appropriate HW I'm stumped. >>>> >>>> Christian >>>> >>>> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: >>>> >>>>> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: >>>>> >>>>>> Oh, I didn't notice that. I bet you aren't getting the expected >>>>>> throughput on the RAID array with OSD access patterns, and that's >>>>>> applying back pressure on the journal. >>>>>> >>>>> In the a "picture" being worth a thousand words tradition, I give you >>>>> this iostat -x output taken during a fio run: >>>>> >>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>50.820.00 19.430.170.00 29.58 >>>>> >>>>> Device: rrqm/s wrqm/s r/s w/srkB/swkB/s >>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>> sda 0.0051.500.00 1633.50 0.00 7460.00 >>>>> 9.13 0.180.110.000.11 0.01 1.40 sdb >>>>> 0.00 0.000.00 1240.50 0.00 5244.00 8.45 0.30 >>>>> 0.250.000.25 0.02 2.00 sdc 0.00 5.00 >>>>> 0.00 2468.50 0.00 13419.0010.87 0.240.100.00 >>>>> 0.10 0.09 22.00 sdd 0.00 6.500.00 1913.00 >>>>> 0.00 10313.0010.78 0.200.100.000.10 0.09 16.60 >>>>> >>>>> The %user CPU utilization is pretty much entirely the 2 OSD processes, >>>>> note the nearly complete absence of iowait. >>>>> >>>>> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. >>>>> Look at these numbers, the lack of queues, the low wait and service >>>>> times (this is in ms) plus overall utilization. >>>>> >>>>> The only conclusion I can draw from these numbers and the network >>>>> results below is that the latency happens within the OSD processes. >>>>> >>>>> Regards, >>>>> >>>>> Christian >>>>>> When I suggested other tests, I meant with and without Ceph. One >>>>>> particular one is OSD bench. That should be interesting to try at a >>>>>> variety of block sizes. You could also try runnin RADOS bench and >>>>>> smalliobench at a few different sizes. >>>>>> -Greg >>>>>> >>>>>> On Wednesday, May 7, 2014, Alexandre DERUMIER >>>>>> wrote: >>>>>> >>>>>>> Hi Christian, >>>>>>> >>>>>>> Do you have tried without raid6, to have more osd ? >>>>>>> (how many disks do you have begin the raid6 ?) >>>>>>> >>>>>>> >>>>>>> Aslo, I known that direct ios can be quite slow with ceph, >>>>>>> >>>>>>> maybe can you try without --direct=1 >>>>>>> >>>>>>> and also enable rbd_cache >>>>>>> >>>>>>> ceph.conf >>>>>>> [client] >>>>>>> rbd cache = true >>>>>>> >>>>>>> &g
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Thu, May 8, 2014 at 9:37 AM, Gregory Farnum wrote: > > Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) > that's about 40ms of latency per op (for userspace RBD), which seems > awfully long. Maybe this is off the topic, AFAIK "--iodepth=128" doesn't submits 128 IOs at a time. There is a option of fio "iodepth_batch_submit=int", which defaults to 1, makes fio submit each IO as soon as it is available. See more: http://www.bluestop.org/fio/HOWTO.txt ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
stumped. Christian On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: Oh, I didn't notice that. I bet you aren't getting the expected throughput on the RAID array with OSD access patterns, and that's applying back pressure on the journal. In the a "picture" being worth a thousand words tradition, I give you this iostat -x output taken during a fio run: avg-cpu: %user %nice %system %iowait %steal %idle 50.820.00 19.430.170.00 29.58 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.0051.500.00 1633.50 0.00 7460.00 9.13 0.180.110.000.11 0.01 1.40 sdb 0.00 0.000.00 1240.50 0.00 5244.00 8.45 0.30 0.250.000.25 0.02 2.00 sdc 0.00 5.00 0.00 2468.50 0.00 13419.0010.87 0.240.100.00 0.10 0.09 22.00 sdd 0.00 6.500.00 1913.00 0.00 10313.0010.78 0.200.100.000.10 0.09 16.60 The %user CPU utilization is pretty much entirely the 2 OSD processes, note the nearly complete absence of iowait. sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. Look at these numbers, the lack of queues, the low wait and service times (this is in ms) plus overall utilization. The only conclusion I can draw from these numbers and the network results below is that the latency happens within the OSD processes. Regards, Christian When I suggested other tests, I meant with and without Ceph. One particular one is OSD bench. That should be interesting to try at a variety of block sizes. You could also try runnin RADOS bench and smalliobench at a few different sizes. -Greg On Wednesday, May 7, 2014, Alexandre DERUMIER wrote: Hi Christian, Do you have tried without raid6, to have more osd ? (how many disks do you have begin the raid6 ?) Aslo, I known that direct ios can be quite slow with ceph, maybe can you try without --direct=1 and also enable rbd_cache ceph.conf [client] rbd cache = true - Mail original ----- De: "Christian Balzer" > À: "Gregory Farnum" >, ceph-users@lists.ceph.com Envoyé: Jeudi 8 Mai 2014 04:49:16 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > wrote: Hello, ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882 with 4GB of cache. Running this fio: fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128 results in: 30k IOPS on the journal SSD (as expected) 110k IOPS on the OSD (it fits neatly into the cache, no surprise there) 3200 IOPS from a VM using userspace RBD 2900 IOPS from a host kernelspace mounted RBD When running the fio from the VM RBD the utilization of the journals is about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some obvious merging). The OSD processes are quite busy, reading well over 200% on atop, but the system is not CPU or otherwise resource starved at that moment. Running multiple instances of this test from several VMs on different hosts changes nothing, as in the aggregated IOPS for the whole cluster will still be around 3200 IOPS. Now clearly RBD has to deal with latency here, but the network is IPoIB with the associated low latency and the journal SSDs are the (consistently) fasted ones around. I guess what I am wondering about is if this is normal and to be expected or if not where all that potential performance got lost. Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) Yes, but going down to 32 doesn't change things one iota. Also note the multiple instances I mention up there, so that would be 256 IOs at a time, coming from different hosts over different links and nothing changes. that's about 40ms of latency per op (for userspace RBD), which seems awfully long. You should check what your client-side objecter settings are; it might be limiting you to fewer outstanding ops than that. Googling for client-side objecter gives a few hits on ceph devel and bugs and nothing at all as far as configuration options are concerned. Care to enlighten me where one can find those? Also note the kernelspace (3.13 if it matters) speed, which is very much in the same (junior league) ballpark. If it's available to you, testing with Firefly or even master would be interesting — there's some performance work that should reduce latencies. Not an option, this is going into production next week. But a well-tuned (or even default-tuned, I thought) Ceph cluster certainly doesn't require 40ms/
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
ards, > >> > >> Christian > >>> When I suggested other tests, I meant with and without Ceph. One > >>> particular one is OSD bench. That should be interesting to try at a > >>> variety of block sizes. You could also try runnin RADOS bench and > >>> smalliobench at a few different sizes. > >>> -Greg > >>> > >>> On Wednesday, May 7, 2014, Alexandre DERUMIER > >>> wrote: > >>> > >>>> Hi Christian, > >>>> > >>>> Do you have tried without raid6, to have more osd ? > >>>> (how many disks do you have begin the raid6 ?) > >>>> > >>>> > >>>> Aslo, I known that direct ios can be quite slow with ceph, > >>>> > >>>> maybe can you try without --direct=1 > >>>> > >>>> and also enable rbd_cache > >>>> > >>>> ceph.conf > >>>> [client] > >>>> rbd cache = true > >>>> > >>>> > >>>> > >>>> > >>>> - Mail original - > >>>> > >>>> De: "Christian Balzer" > > >>>> À: "Gregory Farnum" >, > >>>> ceph-users@lists.ceph.com > >>>> Envoyé: Jeudi 8 Mai 2014 04:49:16 > >>>> Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and > >>>> backing devices > >>>> > >>>> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > >>>> > >>>>> On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > >>>>> > > >>>> wrote: > >>>>>> Hello, > >>>>>> > >>>>>> ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The > >>>>>> journals are on (separate) DC 3700s, the actual OSDs are RAID6 > >>>>>> behind an Areca 1882 with 4GB of cache. > >>>>>> > >>>>>> Running this fio: > >>>>>> > >>>>>> fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > >>>>>> --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k > >>>>>> --iodepth=128 > >>>>>> > >>>>>> results in: > >>>>>> > >>>>>> 30k IOPS on the journal SSD (as expected) > >>>>>> 110k IOPS on the OSD (it fits neatly into the cache, no surprise > >>>>>> there) 3200 IOPS from a VM using userspace RBD > >>>>>> 2900 IOPS from a host kernelspace mounted RBD > >>>>>> > >>>>>> When running the fio from the VM RBD the utilization of the > >>>>>> journals is about 20% (2400 IOPS) and the OSDs are bored at 2% > >>>>>> (1500 IOPS after some obvious merging). > >>>>>> The OSD processes are quite busy, reading well over 200% on atop, > >>>>>> but the system is not CPU or otherwise resource starved at that > >>>>>> moment. > >>>>>> > >>>>>> Running multiple instances of this test from several VMs on > >>>>>> different hosts changes nothing, as in the aggregated IOPS for > >>>>>> the whole cluster will still be around 3200 IOPS. > >>>>>> > >>>>>> Now clearly RBD has to deal with latency here, but the network is > >>>>>> IPoIB with the associated low latency and the journal SSDs are > >>>>>> the (consistently) fasted ones around. > >>>>>> > >>>>>> I guess what I am wondering about is if this is normal and to be > >>>>>> expected or if not where all that potential performance got lost. > >>>>> Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) > >>>> Yes, but going down to 32 doesn't change things one iota. > >>>> Also note the multiple instances I mention up there, so that would > >>>> be 256 IOs at a time, coming from different hosts over different > >>>> links and nothing changes. > >>>> > >>>>> that's about 40ms of latency per op (for userspace RBD), which > >>>>> seems awfully long. You should check what your client-side objecter > >>>>> settings are; it might be limiting you to fewer outstanding ops > >>>>> than tha
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
sted other tests, I meant with and without Ceph. One >>>> particular one is OSD bench. That should be interesting to try at a >>>> variety of block sizes. You could also try runnin RADOS bench and >>>> smalliobench at a few different sizes. >>>> -Greg >>>> >>>> On Wednesday, May 7, 2014, Alexandre DERUMIER >>>> wrote: >>>> >>>>> Hi Christian, >>>>> >>>>> Do you have tried without raid6, to have more osd ? >>>>> (how many disks do you have begin the raid6 ?) >>>>> >>>>> >>>>> Aslo, I known that direct ios can be quite slow with ceph, >>>>> >>>>> maybe can you try without --direct=1 >>>>> >>>>> and also enable rbd_cache >>>>> >>>>> ceph.conf >>>>> [client] >>>>> rbd cache = true >>>>> >>>>> >>>>> >>>>> >>>>> - Mail original - >>>>> >>>>> De: "Christian Balzer" > >>>>> À: "Gregory Farnum" >, >>>>> ceph-users@lists.ceph.com >>>>> Envoyé: Jeudi 8 Mai 2014 04:49:16 >>>>> Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and >>>>> backing devices >>>>> >>>>> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: >>>>> >>>>>> On Wed, May 7, 2014 at 5:57 PM, Christian Balzer >>>>>> > >>>>> wrote: >>>>>>> Hello, >>>>>>> >>>>>>> ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The >>>>>>> journals are on (separate) DC 3700s, the actual OSDs are RAID6 >>>>>>> behind an Areca 1882 with 4GB of cache. >>>>>>> >>>>>>> Running this fio: >>>>>>> >>>>>>> fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 >>>>>>> --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k >>>>>>> --iodepth=128 >>>>>>> >>>>>>> results in: >>>>>>> >>>>>>> 30k IOPS on the journal SSD (as expected) >>>>>>> 110k IOPS on the OSD (it fits neatly into the cache, no surprise >>>>>>> there) 3200 IOPS from a VM using userspace RBD >>>>>>> 2900 IOPS from a host kernelspace mounted RBD >>>>>>> >>>>>>> When running the fio from the VM RBD the utilization of the >>>>>>> journals is about 20% (2400 IOPS) and the OSDs are bored at 2% >>>>>>> (1500 IOPS after some obvious merging). >>>>>>> The OSD processes are quite busy, reading well over 200% on atop, >>>>>>> but the system is not CPU or otherwise resource starved at that >>>>>>> moment. >>>>>>> >>>>>>> Running multiple instances of this test from several VMs on >>>>>>> different hosts changes nothing, as in the aggregated IOPS for >>>>>>> the whole cluster will still be around 3200 IOPS. >>>>>>> >>>>>>> Now clearly RBD has to deal with latency here, but the network is >>>>>>> IPoIB with the associated low latency and the journal SSDs are >>>>>>> the (consistently) fasted ones around. >>>>>>> >>>>>>> I guess what I am wondering about is if this is normal and to be >>>>>>> expected or if not where all that potential performance got lost. >>>>>> Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) >>>>> Yes, but going down to 32 doesn't change things one iota. >>>>> Also note the multiple instances I mention up there, so that would be >>>>> 256 IOs at a time, coming from different hosts over different links >>>>> and nothing changes. >>>>> >>>>>> that's about 40ms of latency per op (for userspace RBD), which >>>>>> seems awfully long. You should check what your client-side objecter >>>>>> settings are; it might be limiting you to fewer outstanding ops >>>>>> than that. >>>>> Googling for client-side objecter gives a few hits on ceph devel and >>>>> bugs and nothing at all as far as configuration options are >>>>> concerned. Care to enlighten me where one can find those? >>>>> >>>>> Also note the kernelspace (3.13 if it matters) speed, which is very >>>>> much in the same (junior league) ballpark. >>>>> >>>>>> If >>>>>> it's available to you, testing with Firefly or even master would be >>>>>> interesting — there's some performance work that should reduce >>>>>> latencies. >>>>>> >>>>> Not an option, this is going into production next week. >>>>> >>>>>> But a well-tuned (or even default-tuned, I thought) Ceph cluster >>>>>> certainly doesn't require 40ms/op, so you should probably run a >>>>>> wider array of experiments to try and figure out where it's coming >>>>>> from. >>>>> I think we can rule out the network, NPtcp gives me: >>>>> --- >>>>> 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec >>>>> --- >>>>> >>>>> For comparison at about 512KB it reaches maximum throughput and still >>>>> isn't that laggy: >>>>> --- >>>>> 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec >>>>> --- >>>>> >>>>> So with the network performing as well as my lengthy experience with >>>>> IPoIB led me to believe, what else is there to look at? >>>>> The storage nodes perform just as expected, indicated by the local >>>>> fio tests. >>>>> >>>>> That pretty much leaves only Ceph/RBD to look at and I'm not really >>>>> sure what experiments I should run on that. ^o^ >>>>> >>>>> Regards, >>>>> >>>>> Christian >>>>> >>>>>> -Greg >>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>>>>> >>>>> >>>>> -- >>>>> Christian Balzer Network/Systems Engineer >>>>> ch...@gol.com Global OnLine Japan/Fusion >>>>> Communications http://www.gol.com/ >>>>> ___ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> >>> >> > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hi Christian, I missed this thread, haven't been reading the list that well the last weeks. You already know my setup, since we discussed it in an earlier thread. I don't have a fast backing store, but I see the slow IOPS when doing randwrite inside the VM, with rbd cache. Still running dumpling here though. A thought struck me that I could test with a pool that consists of OSDs that have tempfs-based disks, think I have a bit more latency than your IPoIB but I've pushed 100k IOPS with the same network devices before. This would verify if the problem is with the journal disks. I'll also try to run the journal devices in tempfs as well, as it would test purely Ceph itself. I'll get back to you with the results, hopefully I'll manage to get them done during this night. Cheers, Josef On 13/05/14 11:03, Christian Balzer wrote: > I'm clearly talking to myself, but whatever. > > For Greg, I've played with all the pertinent journal and filestore options > and TCP nodelay, no changes at all. > > Is there anybody on this ML who's running a Ceph cluster with a fast > network and FAST filestore, so like me with a big HW cache in front of a > RAID/JBODs or using SSDs for final storage? > > If so, what results do you get out of the fio statement below per OSD? > In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which > is of course vastly faster than the normal indvidual HDDs could do. > > So I'm wondering if I'm hitting some inherent limitation of how fast a > single OSD (as in the software) can handle IOPS, given that everything else > has been ruled out from where I stand. > > This would also explain why none of the option changes or the use of > RBD caching has any measurable effect in the test case below. > As in, a slow OSD aka single HDD with journal on the same disk would > clearly benefit from even the small 32MB standard RBD cache, while in my > test case the only time the caching becomes noticeable is if I increase > the cache size to something larger than the test data size. ^o^ > > On the other hand if people here regularly get thousands or tens of > thousands IOPS per OSD with the appropriate HW I'm stumped. > > Christian > > On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: > >> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: >> >>> Oh, I didn't notice that. I bet you aren't getting the expected >>> throughput on the RAID array with OSD access patterns, and that's >>> applying back pressure on the journal. >>> >> In the a "picture" being worth a thousand words tradition, I give you >> this iostat -x output taken during a fio run: >> >> avg-cpu: %user %nice %system %iowait %steal %idle >> 50.820.00 19.430.170.00 29.58 >> >> Device: rrqm/s wrqm/s r/s w/srkB/swkB/s >> avgrq-sz avgqu-sz await r_await w_await svctm %util >> sda 0.0051.500.00 1633.50 0.00 7460.00 >> 9.13 0.180.110.000.11 0.01 1.40 sdb >> 0.00 0.000.00 1240.50 0.00 5244.00 8.45 0.30 >> 0.250.000.25 0.02 2.00 sdc 0.00 5.00 >> 0.00 2468.50 0.00 13419.0010.87 0.240.100.00 >> 0.10 0.09 22.00 sdd 0.00 6.500.00 1913.00 >> 0.00 10313.0010.78 0.200.100.000.10 0.09 16.60 >> >> The %user CPU utilization is pretty much entirely the 2 OSD processes, >> note the nearly complete absence of iowait. >> >> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. >> Look at these numbers, the lack of queues, the low wait and service >> times (this is in ms) plus overall utilization. >> >> The only conclusion I can draw from these numbers and the network results >> below is that the latency happens within the OSD processes. >> >> Regards, >> >> Christian >>> When I suggested other tests, I meant with and without Ceph. One >>> particular one is OSD bench. That should be interesting to try at a >>> variety of block sizes. You could also try runnin RADOS bench and >>> smalliobench at a few different sizes. >>> -Greg >>> >>> On Wednesday, May 7, 2014, Alexandre DERUMIER >>> wrote: >>> >>>> Hi Christian, >>>> >>>> Do you have tried without raid6, to have more osd ? >>>> (how many disks do you have begin the raid6 ?) >>>> >>>> >>>> Aslo, I known that direct ios can be quite slow with ceph, >>>> >>>> maybe can y
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
>>You didn't specify what you did, but i assume you did read test. yes, indeed >>Those scale, as in running fio in multiple VMs in parallel gives me about >>6200 IOPS each, so much better than the 7200 for a single one. >>And yes, the client CPU is quite busy. oh ok ! >>However my real, original question is about writes. And they are stuck at >>3200 IOPS, cluster wide, no matter how many parallel VMs are running fio... Sorry, can test for write, don't have ssd journal for now. I'll try to send result when I'll have my ssd cluster. (But I remember some talk from Sage saying than indeed small direct write could be pretty slow, that why rbd_cache is recommended, to aggregate small writes in bigger one) - Mail original - De: "Christian Balzer" À: "Alexandre DERUMIER" Cc: ceph-users@lists.ceph.com Envoyé: Mardi 13 Mai 2014 18:31:18 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices On Tue, 13 May 2014 18:10:25 +0200 (CEST) Alexandre DERUMIER wrote: > I have just done some test, > > with fio-rbd, > (http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html) > > > directly from the kvm host,(not from the vm). > > > 1 fio job: around 8000iops > 2 differents parralel fio job (on different rbd volume) : around > 8000iops by fio job ! > > cpu on client is at 100% > cpu of osd are around 70%/1core now. > > > So, seem to have a bottleneck client side somewhere. > You didn't specify what you did, but i assume you did read test. Those scale, as in running fio in multiple VMs in parallel gives me about 6200 IOPS each, so much better than the 7200 for a single one. And yes, the client CPU is quite busy. However my real, original question is about writes. And they are stuck at 3200 IOPS, cluster wide, no matter how many parallel VMs are running fio... Christian > (I remember some tests from Stefan Priebe on this mailing, with a full > ssd cluster, having almost same results) > > > > - Mail original - > > De: "Alexandre DERUMIER" > À: "Christian Balzer" > Cc: ceph-users@lists.ceph.com > Envoyé: Mardi 13 Mai 2014 17:16:25 > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing > devices > > >>Actually check your random read output again, you gave it the wrong > >>parameter, it needs to be randread, not rand-read. > > oops, sorry. I got around 7500iops with randread. > > >>Your cluster isn't that old (the CPUs are in the same ballpark) > Yes, this is 6-7 year old server. (this xeons were released in 2007...) > > So, it miss some features like crc32 and sse4 for examples, which can > help a lot ceph > > > > (I'll try to do some osd tuning (threads,...) to see if I can improve > performance. > > > - Mail original - > > De: "Christian Balzer" > À: "Alexandre DERUMIER" > Cc: ceph-users@lists.ceph.com > Envoyé: Mardi 13 Mai 2014 16:39:58 > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing > devices > > On Tue, 13 May 2014 16:09:28 +0200 (CEST) Alexandre DERUMIER wrote: > > > >>For what it's worth, my cluster gives me 4100 IOPS with the > > >>sequential fio run below and 7200 when doing random reads (go > > >>figure). Of course I made sure these came come the pagecache of the > > >>storage nodes, no disk I/O reported at all and the CPUs used just 1 > > >>core per OSD. --- > > >>fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > >>--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 --- > > > > This seem pretty low, > > > > I can get around 6000iops seq or rand read, > Actually check your random read output again, you gave it the wrong > parameter, it needs to be randread, not rand-read. > > > with a pretty old cluster > > > Your cluster isn't that old (the CPUs are in the same ballpark) and has > 12 OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^ > > Remember, all this is coming from RAM, so what it boils down is CPU > (memory and bus transfer speeds) and of course your network. > Which is probably why your cluster isn't even more faster than mine. > > Either way, that number isn't anywhere near 4000 read IOPS per OSD > either, yours is about 500, mine about 1000... > > Christian > > > 3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning > > in ceph.conf > > > > each node: > > -
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Tue, 13 May 2014 18:10:25 +0200 (CEST) Alexandre DERUMIER wrote: > I have just done some test, > > with fio-rbd, > (http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html) > > directly from the kvm host,(not from the vm). > > > 1 fio job: around 8000iops > 2 differents parralel fio job (on different rbd volume) : around > 8000iops by fio job ! > > cpu on client is at 100% > cpu of osd are around 70%/1core now. > > > So, seem to have a bottleneck client side somewhere. > You didn't specify what you did, but i assume you did read test. Those scale, as in running fio in multiple VMs in parallel gives me about 6200 IOPS each, so much better than the 7200 for a single one. And yes, the client CPU is quite busy. However my real, original question is about writes. And they are stuck at 3200 IOPS, cluster wide, no matter how many parallel VMs are running fio... Christian > (I remember some tests from Stefan Priebe on this mailing, with a full > ssd cluster, having almost same results) > > > > - Mail original - > > De: "Alexandre DERUMIER" > À: "Christian Balzer" > Cc: ceph-users@lists.ceph.com > Envoyé: Mardi 13 Mai 2014 17:16:25 > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing > devices > > >>Actually check your random read output again, you gave it the wrong > >>parameter, it needs to be randread, not rand-read. > > oops, sorry. I got around 7500iops with randread. > > >>Your cluster isn't that old (the CPUs are in the same ballpark) > Yes, this is 6-7 year old server. (this xeons were released in 2007...) > > So, it miss some features like crc32 and sse4 for examples, which can > help a lot ceph > > > > (I'll try to do some osd tuning (threads,...) to see if I can improve > performance. > > > - Mail original - > > De: "Christian Balzer" > À: "Alexandre DERUMIER" > Cc: ceph-users@lists.ceph.com > Envoyé: Mardi 13 Mai 2014 16:39:58 > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing > devices > > On Tue, 13 May 2014 16:09:28 +0200 (CEST) Alexandre DERUMIER wrote: > > > >>For what it's worth, my cluster gives me 4100 IOPS with the > > >>sequential fio run below and 7200 when doing random reads (go > > >>figure). Of course I made sure these came come the pagecache of the > > >>storage nodes, no disk I/O reported at all and the CPUs used just 1 > > >>core per OSD. --- > > >>fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > >>--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 --- > > > > This seem pretty low, > > > > I can get around 6000iops seq or rand read, > Actually check your random read output again, you gave it the wrong > parameter, it needs to be randread, not rand-read. > > > with a pretty old cluster > > > Your cluster isn't that old (the CPUs are in the same ballpark) and has > 12 OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^ > > Remember, all this is coming from RAM, so what it boils down is CPU > (memory and bus transfer speeds) and of course your network. > Which is probably why your cluster isn't even more faster than mine. > > Either way, that number isn't anywhere near 4000 read IOPS per OSD > either, yours is about 500, mine about 1000... > > Christian > > > 3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning > > in ceph.conf > > > > each node: > > -- > > -2x quad xeon E5430 @ 2.66GHz > > -4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal > > on same disk than osd, no dedicated ssd) -2 gigabit link (lacp) > > -switch cisco 2960 > > > > > > > > each osd process are around 30% 1core during benchmark > > no disk access (pagecache on ceph nodes) > > > > > > > > sequential > > -- > > # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 > > --filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, > > ioengine=libaio, iodepth=64 2.0.8 Starting 1 process > > Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta > > 00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158 > > read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec > > slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72 > > clat (msec): min=1 , max
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
I have just done some test, with fio-rbd, (http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html) directly from the kvm host,(not from the vm). 1 fio job: around 8000iops 2 differents parralel fio job (on different rbd volume) : around 8000iops by fio job ! cpu on client is at 100% cpu of osd are around 70%/1core now. So, seem to have a bottleneck client side somewhere. (I remember some tests from Stefan Priebe on this mailing, with a full ssd cluster, having almost same results) - Mail original - De: "Alexandre DERUMIER" À: "Christian Balzer" Cc: ceph-users@lists.ceph.com Envoyé: Mardi 13 Mai 2014 17:16:25 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices >>Actually check your random read output again, you gave it the wrong >>parameter, it needs to be randread, not rand-read. oops, sorry. I got around 7500iops with randread. >>Your cluster isn't that old (the CPUs are in the same ballpark) Yes, this is 6-7 year old server. (this xeons were released in 2007...) So, it miss some features like crc32 and sse4 for examples, which can help a lot ceph (I'll try to do some osd tuning (threads,...) to see if I can improve performance. - Mail original - De: "Christian Balzer" À: "Alexandre DERUMIER" Cc: ceph-users@lists.ceph.com Envoyé: Mardi 13 Mai 2014 16:39:58 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices On Tue, 13 May 2014 16:09:28 +0200 (CEST) Alexandre DERUMIER wrote: > >>For what it's worth, my cluster gives me 4100 IOPS with the sequential > >>fio run below and 7200 when doing random reads (go figure). Of course > >>I made sure these came come the pagecache of the storage nodes, no > >>disk I/O reported at all and the CPUs used just 1 core per OSD. > >>--- > >>fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > >>--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 --- > > This seem pretty low, > > I can get around 6000iops seq or rand read, Actually check your random read output again, you gave it the wrong parameter, it needs to be randread, not rand-read. > with a pretty old cluster > Your cluster isn't that old (the CPUs are in the same ballpark) and has 12 OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^ Remember, all this is coming from RAM, so what it boils down is CPU (memory and bus transfer speeds) and of course your network. Which is probably why your cluster isn't even more faster than mine. Either way, that number isn't anywhere near 4000 read IOPS per OSD either, yours is about 500, mine about 1000... Christian > 3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning > in ceph.conf > > each node: > -- > -2x quad xeon E5430 @ 2.66GHz > -4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal > on same disk than osd, no dedicated ssd) -2 gigabit link (lacp) > -switch cisco 2960 > > > > each osd process are around 30% 1core during benchmark > no disk access (pagecache on ceph nodes) > > > > sequential > -- > # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 > --filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, > ioengine=libaio, iodepth=64 2.0.8 Starting 1 process > Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta > 00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158 > read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec > slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72 > clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10 > lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10 > clat percentiles (msec): > | 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10], > | 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12], > | 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15], > | 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404], > | 99.99th=[ 404] > bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06, > stdev=3341.21 lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%, > 50=0.23% lat (msec) : 250=0.13%, 500=0.06% > cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, > >=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0 > > > Run status group 0 (all jobs): > READ: io=409600KB, aggrb=
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
>>Actually check your random read output again, you gave it the wrong >>parameter, it needs to be randread, not rand-read. oops, sorry. I got around 7500iops with randread. >>Your cluster isn't that old (the CPUs are in the same ballpark) Yes, this is 6-7 year old server. (this xeons were released in 2007...) So, it miss some features like crc32 and sse4 for examples, which can help a lot ceph (I'll try to do some osd tuning (threads,...) to see if I can improve performance. - Mail original - De: "Christian Balzer" À: "Alexandre DERUMIER" Cc: ceph-users@lists.ceph.com Envoyé: Mardi 13 Mai 2014 16:39:58 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices On Tue, 13 May 2014 16:09:28 +0200 (CEST) Alexandre DERUMIER wrote: > >>For what it's worth, my cluster gives me 4100 IOPS with the sequential > >>fio run below and 7200 when doing random reads (go figure). Of course > >>I made sure these came come the pagecache of the storage nodes, no > >>disk I/O reported at all and the CPUs used just 1 core per OSD. > >>--- > >>fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > >>--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 --- > > This seem pretty low, > > I can get around 6000iops seq or rand read, Actually check your random read output again, you gave it the wrong parameter, it needs to be randread, not rand-read. > with a pretty old cluster > Your cluster isn't that old (the CPUs are in the same ballpark) and has 12 OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^ Remember, all this is coming from RAM, so what it boils down is CPU (memory and bus transfer speeds) and of course your network. Which is probably why your cluster isn't even more faster than mine. Either way, that number isn't anywhere near 4000 read IOPS per OSD either, yours is about 500, mine about 1000... Christian > 3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning > in ceph.conf > > each node: > -- > -2x quad xeon E5430 @ 2.66GHz > -4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal > on same disk than osd, no dedicated ssd) -2 gigabit link (lacp) > -switch cisco 2960 > > > > each osd process are around 30% 1core during benchmark > no disk access (pagecache on ceph nodes) > > > > sequential > -- > # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 > --filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, > ioengine=libaio, iodepth=64 2.0.8 Starting 1 process > Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta > 00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158 > read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec > slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72 > clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10 > lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10 > clat percentiles (msec): > | 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10], > | 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12], > | 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15], > | 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404], > | 99.99th=[ 404] > bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06, > stdev=3341.21 lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%, > 50=0.23% lat (msec) : 250=0.13%, 500=0.06% > cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, > >=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0 > > > Run status group 0 (all jobs): > READ: io=409600KB, aggrb=22256KB/s, minb=22256KB/s, maxb=22256KB/s, > mint=18404msec, maxt=18404msec > > > Disk stats (read/write): > vdb: ios=101076/0, merge=0/0, ticks=1157172/0, in_queue=1157380, > util=99.58% > > > random read > --- > # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > --numjobs=1 --rw=rand-read --name=fiojob --blocksize=4k --iodepth=64 > --filename=/dev/vdb valid values: read Sequential read : write > Sequential write : randread Random read > : randwrite Random write > : rw Sequential read and write mix > : readwrite Sequential read and write mix > : randrw Random read and write mix > > > fio: failed parsing rw=rand-read > fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengin
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
, per=100.00%, avg=22967.77, > stdev=2657.48 lat (msec) : 2=0.05%, 4=0.46%, 10=22.83%, 20=76.34%, > 50=0.21% lat (msec) : 100=0.05%, 250=0.01%, 500=0.06% > cpu : usr=4.14%, sys=10.01%, ctx=44760, majf=0, minf=88 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, > >=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0 > > > Run status group 0 (all jobs): > READ: io=409600KB, aggrb=22886KB/s, minb=22886KB/s, maxb=22886KB/s, > mint=17897msec, maxt=17897msec > > > Disk stats (read/write): > vdb: ios=100981/0, merge=0/0, ticks=1124768/0, in_queue=1125492, > util=99.57% > > > > > > > MonSiteEstLent.com - Blog dédié à la webperformance et la gestion de > pics de trafic > > - Mail original - > > De: "Christian Balzer" > À: "Alexandre DERUMIER" > Cc: ceph-users@lists.ceph.com > Envoyé: Mardi 13 Mai 2014 14:38:57 > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing > devices > > > Hello, > > On Tue, 13 May 2014 13:36:49 +0200 (CEST) Alexandre DERUMIER wrote: > > > >>It might, but at the IOPS I'm seeing anybody using SSD for file > > >>storage should have screamed out already. > > >>Also given the CPU usage I'm seeing during that test run such a > > >>setup would probably require 32+ cores. > > > > Just found this: > > > > https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf > > > > > That's and interesting find indeed. > > The CPU to OSD chart clearly assumes the OSD to be backed by spinning > rust or doing 4MB block transactions. > As stated before, at the 4KB blocksize below one OSD eats up slightly > over 2 cores on the 4332HE at full speed. > > > page12: > > > > " Note: As of Ceph Dumpling release (10/2013), a per-OSD read > > performance is approximately 4,000 IOPS and a per node limit of around > > 35,000 IOPS when doing reads directly from pagecache. This appears to > > indicate that Ceph can make good use of spinning disks for data > > storage and may benefit from SSD backed OSDs, though may also be > > limited on high performance SSDs." > > > Node that this a read test and like nearly all IOPS statements utterly > worthless unless qualified by things as block size, working set size, > type of I/O (random or sequential). > > For what it's worth, my cluster gives me 4100 IOPS with the sequential > fio run below and 7200 when doing random reads (go figure). Of course I > made sure these came come the pagecache of the storage nodes, no disk > I/O reported at all and the CPUs used just 1 core per OSD. > --- > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 > --rw=read --name=fiojob --blocksize=4k --iodepth=64 --- > > > Christian > > > > > Maybe Intank could comment about the 4000iops by osd ? > > > > > > > Alexandre Derumier Ingénieur système et stockage Fixe : 03 20 68 90 88 > Fax : 03 20 68 90 81 45 Bvd du Général Leclerc 59100 Roubaix 12 rue > Marivaux 75002 Paris MonSiteEstLent.com - Blog dédié à la webperformance > et la gestion de pics de trafic - Mail original - > > > > De: "Christian Balzer" > > À: ceph-users@lists.ceph.com > > Cc: "Alexandre DERUMIER" > > Envoyé: Mardi 13 Mai 2014 11:51:37 > > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and > > backing devices > > > > > > Hello, > > > > On Tue, 13 May 2014 11:33:27 +0200 (CEST) Alexandre DERUMIER wrote: > > > > > Hi Christian, > > > > > > I'm going to test a full ssd cluster in coming months, > > > I'll send result on the mailing. > > > > > Looking forward to that. > > > > > > > > Do you have tried to use 1 osd by physical disk ? (without raid6) > > > > > No, if you look back to the last year December "Sanity check..." > > thread by me, it gives the reasons. > > In short, highest density (thus replication of 2 and to make that safe > > based on RAID6) and operational maintainability (it is a remote data > > center, so replacing broken disks is a pain). > > > > That cluster is fast enough for my purposes and that fio test isn't a > > typical load for it when it goes into production. > > But for designing a gener
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
>>For what it's worth, my cluster gives me 4100 IOPS with the sequential fio >>run below and 7200 when doing random reads (go figure). Of course I made >>sure these came come the pagecache of the storage nodes, no disk I/O >>reported at all and the CPUs used just 1 core per OSD. >>--- >>fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 >>--rw=read --name=fiojob --blocksize=4k --iodepth=64 >>--- This seem pretty low, I can get around 6000iops seq or rand read, with a pretty old cluster 3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning in ceph.conf each node: -- -2x quad xeon E5430 @ 2.66GHz -4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal on same disk than osd, no dedicated ssd) -2 gigabit link (lacp) -switch cisco 2960 each osd process are around 30% 1core during benchmark no disk access (pagecache on ceph nodes) sequential -- # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 --filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64 2.0.8 Starting 1 process Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta 00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158 read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72 clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10 lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10 clat percentiles (msec): | 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10], | 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15], | 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404], | 99.99th=[ 404] bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06, stdev=3341.21 lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%, 50=0.23% lat (msec) : 250=0.13%, 500=0.06% cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=409600KB, aggrb=22256KB/s, minb=22256KB/s, maxb=22256KB/s, mint=18404msec, maxt=18404msec Disk stats (read/write): vdb: ios=101076/0, merge=0/0, ticks=1157172/0, in_queue=1157380, util=99.58% random read --- # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=rand-read --name=fiojob --blocksize=4k --iodepth=64 --filename=/dev/vdb valid values: read Sequential read : write Sequential write : randread Random read : randwrite Random write : rw Sequential read and write mix : readwrite Sequential read and write mix : randrw Random read and write mix fio: failed parsing rw=rand-read fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64 2.0.8 Starting 1 process Jobs: 1 (f=1): [R] [94.7% done] [23752K/0K /s] [5938 /0 iops] [eta 00m:01s] fiojob: (groupid=0, jobs=1): err= 0: pid=4172 read : io=409600KB, bw=22887KB/s, iops=5721 , runt= 17897msec slat (usec): min=3 , max=929 , avg=11.75, stdev=11.38 clat (msec): min=1 , max=407 , avg=11.17, stdev= 9.24 lat (msec): min=1 , max=407 , avg=11.18, stdev= 9.24 clat percentiles (msec): | 1.00th=[ 6], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10], | 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14], | 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 60], 99.95th=[ 359], | 99.99th=[ 404] bw (KB/s) : min= 8112, max=25120, per=100.00%, avg=22967.77, stdev=2657.48 lat (msec) : 2=0.05%, 4=0.46%, 10=22.83%, 20=76.34%, 50=0.21% lat (msec) : 100=0.05%, 250=0.01%, 500=0.06% cpu : usr=4.14%, sys=10.01%, ctx=44760, majf=0, minf=88 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=409600KB, aggrb=22886KB/s, minb=22886KB/s, maxb=22886KB/s, mint=17897msec, maxt=17897msec Disk stats (read/write): vdb: ios=100981/0, merge=0/0, ticks=1124768/0, in_queue=1125492, util=99.57% MonSiteEstLent.com - Blog dédié à la webperformance et la gestion de pics de trafic - Mail original - De: "Christian Balzer" À: "Alexandre DERUMIER" Cc: ceph-users@lists.ceph.com Envoyé: Mardi 13 Mai 2014 14:38:57 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices Hello, On Tue, 13 May 2014 13:36:49 +0200 (CEST) Alex
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello, On Tue, 13 May 2014 13:36:49 +0200 (CEST) Alexandre DERUMIER wrote: > >>It might, but at the IOPS I'm seeing anybody using SSD for file > >>storage should have screamed out already. > >>Also given the CPU usage I'm seeing during that test run such a setup > >>would probably require 32+ cores. > > Just found this: > > https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf > That's and interesting find indeed. The CPU to OSD chart clearly assumes the OSD to be backed by spinning rust or doing 4MB block transactions. As stated before, at the 4KB blocksize below one OSD eats up slightly over 2 cores on the 4332HE at full speed. > page12: > > " Note: As of Ceph Dumpling release (10/2013), a per-OSD read > performance is approximately 4,000 IOPS and a per node limit of around > 35,000 IOPS when doing reads directly from pagecache. This appears to > indicate that Ceph can make good use of spinning disks for data storage > and may benefit from SSD backed OSDs, though may also be limited on high > performance SSDs." > Node that this a read test and like nearly all IOPS statements utterly worthless unless qualified by things as block size, working set size, type of I/O (random or sequential). For what it's worth, my cluster gives me 4100 IOPS with the sequential fio run below and 7200 when doing random reads (go figure). Of course I made sure these came come the pagecache of the storage nodes, no disk I/O reported at all and the CPUs used just 1 core per OSD. --- fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 --- Christian > > Maybe Intank could comment about the 4000iops by osd ? > > > - Mail original - > > De: "Christian Balzer" > À: ceph-users@lists.ceph.com > Cc: "Alexandre DERUMIER" > Envoyé: Mardi 13 Mai 2014 11:51:37 > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing > devices > > > Hello, > > On Tue, 13 May 2014 11:33:27 +0200 (CEST) Alexandre DERUMIER wrote: > > > Hi Christian, > > > > I'm going to test a full ssd cluster in coming months, > > I'll send result on the mailing. > > > Looking forward to that. > > > > > Do you have tried to use 1 osd by physical disk ? (without raid6) > > > No, if you look back to the last year December "Sanity check..." thread > by me, it gives the reasons. > In short, highest density (thus replication of 2 and to make that safe > based on RAID6) and operational maintainability (it is a remote data > center, so replacing broken disks is a pain). > > That cluster is fast enough for my purposes and that fio test isn't a > typical load for it when it goes into production. > But for designing a general purpose or high performance Ceph cluster in > the future I'd really love to have this mystery solved. > > > Maybe they are bottleneck in osd daemon, > > and using osd daemon by disk could help. > > > It might, but at the IOPS I'm seeing anybody using SSD for file storage > should have screamed out already. > Also given the CPU usage I'm seeing during that test run such a setup > would probably require 32+ cores. > > Christian > > > > > > > > > - Mail original - > > > > De: "Christian Balzer" > > À: ceph-users@lists.ceph.com > > Envoyé: Mardi 13 Mai 2014 11:03:47 > > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and > > backing devices > > > > > > I'm clearly talking to myself, but whatever. > > > > For Greg, I've played with all the pertinent journal and filestore > > options and TCP nodelay, no changes at all. > > > > Is there anybody on this ML who's running a Ceph cluster with a fast > > network and FAST filestore, so like me with a big HW cache in front of > > a RAID/JBODs or using SSDs for final storage? > > > > If so, what results do you get out of the fio statement below per OSD? > > In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, > > which is of course vastly faster than the normal indvidual HDDs could > > do. > > > > So I'm wondering if I'm hitting some inherent limitation of how fast a > > single OSD (as in the software) can handle IOPS, given that everything > > else has been ruled out from where I stand. > > > > This would also explain why none of the option changes or the use of > > RBD caching has any measurable effect
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
>>It might, but at the IOPS I'm seeing anybody using SSD for file storage >>should have screamed out already. >>Also given the CPU usage I'm seeing during that test run such a setup >>would probably require 32+ cores. Just found this: https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf page12: " Note: As of Ceph Dumpling release (10/2013), a per-OSD read performance is approximately 4,000 IOPS and a per node limit of around 35,000 IOPS when doing reads directly from pagecache. This appears to indicate that Ceph can make good use of spinning disks for data storage and may benefit from SSD backed OSDs, though may also be limited on high performance SSDs." Maybe Intank could comment about the 4000iops by osd ? - Mail original - De: "Christian Balzer" À: ceph-users@lists.ceph.com Cc: "Alexandre DERUMIER" Envoyé: Mardi 13 Mai 2014 11:51:37 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices Hello, On Tue, 13 May 2014 11:33:27 +0200 (CEST) Alexandre DERUMIER wrote: > Hi Christian, > > I'm going to test a full ssd cluster in coming months, > I'll send result on the mailing. > Looking forward to that. > > Do you have tried to use 1 osd by physical disk ? (without raid6) > No, if you look back to the last year December "Sanity check..." thread by me, it gives the reasons. In short, highest density (thus replication of 2 and to make that safe based on RAID6) and operational maintainability (it is a remote data center, so replacing broken disks is a pain). That cluster is fast enough for my purposes and that fio test isn't a typical load for it when it goes into production. But for designing a general purpose or high performance Ceph cluster in the future I'd really love to have this mystery solved. > Maybe they are bottleneck in osd daemon, > and using osd daemon by disk could help. > It might, but at the IOPS I'm seeing anybody using SSD for file storage should have screamed out already. Also given the CPU usage I'm seeing during that test run such a setup would probably require 32+ cores. Christian > > > > ----- Mail original - > > De: "Christian Balzer" > À: ceph-users@lists.ceph.com > Envoyé: Mardi 13 Mai 2014 11:03:47 > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing > devices > > > I'm clearly talking to myself, but whatever. > > For Greg, I've played with all the pertinent journal and filestore > options and TCP nodelay, no changes at all. > > Is there anybody on this ML who's running a Ceph cluster with a fast > network and FAST filestore, so like me with a big HW cache in front of a > RAID/JBODs or using SSDs for final storage? > > If so, what results do you get out of the fio statement below per OSD? > In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, > which is of course vastly faster than the normal indvidual HDDs could > do. > > So I'm wondering if I'm hitting some inherent limitation of how fast a > single OSD (as in the software) can handle IOPS, given that everything > else has been ruled out from where I stand. > > This would also explain why none of the option changes or the use of > RBD caching has any measurable effect in the test case below. > As in, a slow OSD aka single HDD with journal on the same disk would > clearly benefit from even the small 32MB standard RBD cache, while in my > test case the only time the caching becomes noticeable is if I increase > the cache size to something larger than the test data size. ^o^ > > On the other hand if people here regularly get thousands or tens of > thousands IOPS per OSD with the appropriate HW I'm stumped. > > Christian > > On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: > > > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: > > > > > Oh, I didn't notice that. I bet you aren't getting the expected > > > throughput on the RAID array with OSD access patterns, and that's > > > applying back pressure on the journal. > > > > > > > In the a "picture" being worth a thousand words tradition, I give you > > this iostat -x output taken during a fio run: > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > 50.82 0.00 19.43 0.17 0.00 29.58 > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > avgrq-sz avgqu-sz await r_await w_await svctm %util > > sda 0.00 51.50 0.00 1633.50 0.00 7460.00 > > 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb > > 0.00 0.00 0.
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello, On Tue, 13 May 2014 11:33:27 +0200 (CEST) Alexandre DERUMIER wrote: > Hi Christian, > > I'm going to test a full ssd cluster in coming months, > I'll send result on the mailing. > Looking forward to that. > > Do you have tried to use 1 osd by physical disk ? (without raid6) > No, if you look back to the last year December "Sanity check..." thread by me, it gives the reasons. In short, highest density (thus replication of 2 and to make that safe based on RAID6) and operational maintainability (it is a remote data center, so replacing broken disks is a pain). That cluster is fast enough for my purposes and that fio test isn't a typical load for it when it goes into production. But for designing a general purpose or high performance Ceph cluster in the future I'd really love to have this mystery solved. > Maybe they are bottleneck in osd daemon, > and using osd daemon by disk could help. > It might, but at the IOPS I'm seeing anybody using SSD for file storage should have screamed out already. Also given the CPU usage I'm seeing during that test run such a setup would probably require 32+ cores. Christian > > > > - Mail original - > > De: "Christian Balzer" > À: ceph-users@lists.ceph.com > Envoyé: Mardi 13 Mai 2014 11:03:47 > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing > devices > > > I'm clearly talking to myself, but whatever. > > For Greg, I've played with all the pertinent journal and filestore > options and TCP nodelay, no changes at all. > > Is there anybody on this ML who's running a Ceph cluster with a fast > network and FAST filestore, so like me with a big HW cache in front of a > RAID/JBODs or using SSDs for final storage? > > If so, what results do you get out of the fio statement below per OSD? > In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, > which is of course vastly faster than the normal indvidual HDDs could > do. > > So I'm wondering if I'm hitting some inherent limitation of how fast a > single OSD (as in the software) can handle IOPS, given that everything > else has been ruled out from where I stand. > > This would also explain why none of the option changes or the use of > RBD caching has any measurable effect in the test case below. > As in, a slow OSD aka single HDD with journal on the same disk would > clearly benefit from even the small 32MB standard RBD cache, while in my > test case the only time the caching becomes noticeable is if I increase > the cache size to something larger than the test data size. ^o^ > > On the other hand if people here regularly get thousands or tens of > thousands IOPS per OSD with the appropriate HW I'm stumped. > > Christian > > On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: > > > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: > > > > > Oh, I didn't notice that. I bet you aren't getting the expected > > > throughput on the RAID array with OSD access patterns, and that's > > > applying back pressure on the journal. > > > > > > > In the a "picture" being worth a thousand words tradition, I give you > > this iostat -x output taken during a fio run: > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > 50.82 0.00 19.43 0.17 0.00 29.58 > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > avgrq-sz avgqu-sz await r_await w_await svctm %util > > sda 0.00 51.50 0.00 1633.50 0.00 7460.00 > > 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb > > 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 > > 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00 > > 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 > > 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00 > > 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60 > > > > The %user CPU utilization is pretty much entirely the 2 OSD processes, > > note the nearly complete absence of iowait. > > > > sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. > > Look at these numbers, the lack of queues, the low wait and service > > times (this is in ms) plus overall utilization. > > > > The only conclusion I can draw from these numbers and the network > > results below is that the latency happens within the OSD processes. > > > > Regards, > > > > Christian > > > When I suggested other tests, I meant with and without Ceph. One > > > particular one is OSD bench. That should be interesting to try at a > > > variety of block sizes. You could also
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hi Christian, I'm going to test a full ssd cluster in coming months, I'll send result on the mailing. Do you have tried to use 1 osd by physical disk ? (without raid6) Maybe they are bottleneck in osd daemon, and using osd daemon by disk could help. - Mail original - De: "Christian Balzer" À: ceph-users@lists.ceph.com Envoyé: Mardi 13 Mai 2014 11:03:47 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices I'm clearly talking to myself, but whatever. For Greg, I've played with all the pertinent journal and filestore options and TCP nodelay, no changes at all. Is there anybody on this ML who's running a Ceph cluster with a fast network and FAST filestore, so like me with a big HW cache in front of a RAID/JBODs or using SSDs for final storage? If so, what results do you get out of the fio statement below per OSD? In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which is of course vastly faster than the normal indvidual HDDs could do. So I'm wondering if I'm hitting some inherent limitation of how fast a single OSD (as in the software) can handle IOPS, given that everything else has been ruled out from where I stand. This would also explain why none of the option changes or the use of RBD caching has any measurable effect in the test case below. As in, a slow OSD aka single HDD with journal on the same disk would clearly benefit from even the small 32MB standard RBD cache, while in my test case the only time the caching becomes noticeable is if I increase the cache size to something larger than the test data size. ^o^ On the other hand if people here regularly get thousands or tens of thousands IOPS per OSD with the appropriate HW I'm stumped. Christian On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: > > > Oh, I didn't notice that. I bet you aren't getting the expected > > throughput on the RAID array with OSD access patterns, and that's > > applying back pressure on the journal. > > > > In the a "picture" being worth a thousand words tradition, I give you > this iostat -x output taken during a fio run: > > avg-cpu: %user %nice %system %iowait %steal %idle > 50.82 0.00 19.43 0.17 0.00 29.58 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > sda 0.00 51.50 0.00 1633.50 0.00 7460.00 > 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb > 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 > 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00 > 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 > 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00 > 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60 > > The %user CPU utilization is pretty much entirely the 2 OSD processes, > note the nearly complete absence of iowait. > > sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. > Look at these numbers, the lack of queues, the low wait and service > times (this is in ms) plus overall utilization. > > The only conclusion I can draw from these numbers and the network results > below is that the latency happens within the OSD processes. > > Regards, > > Christian > > When I suggested other tests, I meant with and without Ceph. One > > particular one is OSD bench. That should be interesting to try at a > > variety of block sizes. You could also try runnin RADOS bench and > > smalliobench at a few different sizes. > > -Greg > > > > On Wednesday, May 7, 2014, Alexandre DERUMIER > > wrote: > > > > > Hi Christian, > > > > > > Do you have tried without raid6, to have more osd ? > > > (how many disks do you have begin the raid6 ?) > > > > > > > > > Aslo, I known that direct ios can be quite slow with ceph, > > > > > > maybe can you try without --direct=1 > > > > > > and also enable rbd_cache > > > > > > ceph.conf > > > [client] > > > rbd cache = true > > > > > > > > > > > > > > > - Mail original - > > > > > > De: "Christian Balzer" > > > > À: "Gregory Farnum" >, > > > ceph-users@lists.ceph.com > > > Envoyé: Jeudi 8 Mai 2014 04:49:16 > > > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and > > > backing devices > > > > > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > > > > > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > > > > > > > > wrote: &g
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
I'm clearly talking to myself, but whatever. For Greg, I've played with all the pertinent journal and filestore options and TCP nodelay, no changes at all. Is there anybody on this ML who's running a Ceph cluster with a fast network and FAST filestore, so like me with a big HW cache in front of a RAID/JBODs or using SSDs for final storage? If so, what results do you get out of the fio statement below per OSD? In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which is of course vastly faster than the normal indvidual HDDs could do. So I'm wondering if I'm hitting some inherent limitation of how fast a single OSD (as in the software) can handle IOPS, given that everything else has been ruled out from where I stand. This would also explain why none of the option changes or the use of RBD caching has any measurable effect in the test case below. As in, a slow OSD aka single HDD with journal on the same disk would clearly benefit from even the small 32MB standard RBD cache, while in my test case the only time the caching becomes noticeable is if I increase the cache size to something larger than the test data size. ^o^ On the other hand if people here regularly get thousands or tens of thousands IOPS per OSD with the appropriate HW I'm stumped. Christian On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: > > > Oh, I didn't notice that. I bet you aren't getting the expected > > throughput on the RAID array with OSD access patterns, and that's > > applying back pressure on the journal. > > > > In the a "picture" being worth a thousand words tradition, I give you > this iostat -x output taken during a fio run: > > avg-cpu: %user %nice %system %iowait %steal %idle > 50.820.00 19.430.170.00 29.58 > > Device: rrqm/s wrqm/s r/s w/srkB/swkB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > sda 0.0051.500.00 1633.50 0.00 7460.00 > 9.13 0.180.110.000.11 0.01 1.40 sdb > 0.00 0.000.00 1240.50 0.00 5244.00 8.45 0.30 > 0.250.000.25 0.02 2.00 sdc 0.00 5.00 > 0.00 2468.50 0.00 13419.0010.87 0.240.100.00 > 0.10 0.09 22.00 sdd 0.00 6.500.00 1913.00 > 0.00 10313.0010.78 0.200.100.000.10 0.09 16.60 > > The %user CPU utilization is pretty much entirely the 2 OSD processes, > note the nearly complete absence of iowait. > > sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. > Look at these numbers, the lack of queues, the low wait and service > times (this is in ms) plus overall utilization. > > The only conclusion I can draw from these numbers and the network results > below is that the latency happens within the OSD processes. > > Regards, > > Christian > > When I suggested other tests, I meant with and without Ceph. One > > particular one is OSD bench. That should be interesting to try at a > > variety of block sizes. You could also try runnin RADOS bench and > > smalliobench at a few different sizes. > > -Greg > > > > On Wednesday, May 7, 2014, Alexandre DERUMIER > > wrote: > > > > > Hi Christian, > > > > > > Do you have tried without raid6, to have more osd ? > > > (how many disks do you have begin the raid6 ?) > > > > > > > > > Aslo, I known that direct ios can be quite slow with ceph, > > > > > > maybe can you try without --direct=1 > > > > > > and also enable rbd_cache > > > > > > ceph.conf > > > [client] > > > rbd cache = true > > > > > > > > > > > > > > > - Mail original - > > > > > > De: "Christian Balzer" > > > > À: "Gregory Farnum" >, > > > ceph-users@lists.ceph.com > > > Envoyé: Jeudi 8 Mai 2014 04:49:16 > > > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and > > > backing devices > > > > > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > > > > > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > > > > > > > > wrote: > > > > > > > > > > Hello, > > > > > > > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The > > > > > journals are on (separate) DC 3700s, the actual OSDs are RAID6 > > > > > behind an Areca 1882 with 4GB of cache. > > > > > > >
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: > Oh, I didn't notice that. I bet you aren't getting the expected > throughput on the RAID array with OSD access patterns, and that's > applying back pressure on the journal. > In the a "picture" being worth a thousand words tradition, I give you this iostat -x output taken during a fio run: avg-cpu: %user %nice %system %iowait %steal %idle 50.820.00 19.430.170.00 29.58 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.0051.500.00 1633.50 0.00 7460.00 9.13 0.180.110.000.11 0.01 1.40 sdb 0.00 0.000.00 1240.50 0.00 5244.00 8.45 0.300.250.000.25 0.02 2.00 sdc 0.00 5.000.00 2468.50 0.00 13419.0010.87 0.240.100.000.10 0.09 22.00 sdd 0.00 6.500.00 1913.00 0.00 10313.0010.78 0.200.100.000.10 0.09 16.60 The %user CPU utilization is pretty much entirely the 2 OSD processes, note the nearly complete absence of iowait. sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. Look at these numbers, the lack of queues, the low wait and service times (this is in ms) plus overall utilization. The only conclusion I can draw from these numbers and the network results below is that the latency happens within the OSD processes. Regards, Christian > When I suggested other tests, I meant with and without Ceph. One > particular one is OSD bench. That should be interesting to try at a > variety of block sizes. You could also try runnin RADOS bench and > smalliobench at a few different sizes. > -Greg > > On Wednesday, May 7, 2014, Alexandre DERUMIER > wrote: > > > Hi Christian, > > > > Do you have tried without raid6, to have more osd ? > > (how many disks do you have begin the raid6 ?) > > > > > > Aslo, I known that direct ios can be quite slow with ceph, > > > > maybe can you try without --direct=1 > > > > and also enable rbd_cache > > > > ceph.conf > > [client] > > rbd cache = true > > > > > > > > > > ----- Mail original - > > > > De: "Christian Balzer" > > > À: "Gregory Farnum" >, > > ceph-users@lists.ceph.com > > Envoyé: Jeudi 8 Mai 2014 04:49:16 > > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and > > backing devices > > > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > > > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > > > > > > wrote: > > > > > > > > Hello, > > > > > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The > > > > journals are on (separate) DC 3700s, the actual OSDs are RAID6 > > > > behind an Areca 1882 with 4GB of cache. > > > > > > > > Running this fio: > > > > > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k > > > > --iodepth=128 > > > > > > > > results in: > > > > > > > > 30k IOPS on the journal SSD (as expected) > > > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise > > > > there) 3200 IOPS from a VM using userspace RBD > > > > 2900 IOPS from a host kernelspace mounted RBD > > > > > > > > When running the fio from the VM RBD the utilization of the > > > > journals is about 20% (2400 IOPS) and the OSDs are bored at 2% > > > > (1500 IOPS after some obvious merging). > > > > The OSD processes are quite busy, reading well over 200% on atop, > > > > but the system is not CPU or otherwise resource starved at that > > > > moment. > > > > > > > > Running multiple instances of this test from several VMs on > > > > different hosts changes nothing, as in the aggregated IOPS for the > > > > whole cluster will still be around 3200 IOPS. > > > > > > > > Now clearly RBD has to deal with latency here, but the network is > > > > IPoIB with the associated low latency and the journal SSDs are the > > > > (consistently) fasted ones around. > > > > > > > > I guess what I am wondering about is if this is normal and to be > > > > expected or if not where all that potential performance got lost. > > > > > > Hmm, with 128 IOs at a t
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello, On Thu, 08 May 2014 17:20:59 +0200 Udo Lembke wrote: > Hi, > I think not that's related, but how full is your ceph-cluster? Perhaps > it's has something to do with the fragmentation on the xfs-filesystem > (xfs_db -c frag -r device)? > As I wrote, this cluster will go into production next week, so it's neither full nor fragmented. I'd also think any severe fragmentation would show up in high device utilization, something I stated that's not present. In fact after all the initial testing I did defrag the OSDs a few days ago, not that they actually needed it. Because for starters it is ext4, not xfs, see: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg08619.html For what it's worth, I never got an answer to the actual question in that mail. Christian > Udo > > Am 08.05.2014 02:57, schrieb Christian Balzer: > > > > Hello, > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The > > journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind > > an Areca 1882 with 4GB of cache. > > > > Running this fio: > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128 > > > > results in: > > > > 30k IOPS on the journal SSD (as expected) > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise > > there) 3200 IOPS from a VM using userspace RBD > > 2900 IOPS from a host kernelspace mounted RBD > > > > When running the fio from the VM RBD the utilization of the journals is > > about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after > > some obvious merging). > > The OSD processes are quite busy, reading well over 200% on atop, but > > the system is not CPU or otherwise resource starved at that moment. > > > > Running multiple instances of this test from several VMs on different > > hosts changes nothing, as in the aggregated IOPS for the whole cluster > > will still be around 3200 IOPS. > > > > Now clearly RBD has to deal with latency here, but the network is IPoIB > > with the associated low latency and the journal SSDs are the > > (consistently) fasted ones around. > > > > I guess what I am wondering about is if this is normal and to be > > expected or if not where all that potential performance got lost. > > > > Regards, > > > > Christian > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hi again, sorry, too fast - but this can't be an problem due to your 4GB cache... Udo Am 08.05.2014 17:20, schrieb Udo Lembke: > Hi, > I think not that's related, but how full is your ceph-cluster? Perhaps > it's has something to do with the fragmentation on the xfs-filesystem > (xfs_db -c frag -r device)? > > Udo > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hi, I think not that's related, but how full is your ceph-cluster? Perhaps it's has something to do with the fragmentation on the xfs-filesystem (xfs_db -c frag -r device)? Udo Am 08.05.2014 02:57, schrieb Christian Balzer: > > Hello, > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals > are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882 > with 4GB of cache. > > Running this fio: > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 > --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128 > > results in: > > 30k IOPS on the journal SSD (as expected) > 110k IOPS on the OSD (it fits neatly into the cache, no surprise there) > 3200 IOPS from a VM using userspace RBD > 2900 IOPS from a host kernelspace mounted RBD > > When running the fio from the VM RBD the utilization of the journals is > about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some > obvious merging). > The OSD processes are quite busy, reading well over 200% on atop, but > the system is not CPU or otherwise resource starved at that moment. > > Running multiple instances of this test from several VMs on different hosts > changes nothing, as in the aggregated IOPS for the whole cluster will > still be around 3200 IOPS. > > Now clearly RBD has to deal with latency here, but the network is IPoIB > with the associated low latency and the journal SSDs are the > (consistently) fasted ones around. > > I guess what I am wondering about is if this is normal and to be expected > or if not where all that potential performance got lost. > > Regards, > > Christian > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello, On Thu, 08 May 2014 11:31:54 +0200 (CEST) Alexandre DERUMIER wrote: > > The OSD processes are quite busy, reading well over 200% on atop, but > > the system is not CPU or otherwise resource starved at that moment. > > osd use 2 threads by default (could explain the 200%) > > maybe can you try to put in ceph.conf > > osd op threads = 8 > Already at 10 (for some weeks now). ^o^ How that setting relates to the actual 220 threads per OSD process is a mystery for another day. > > (don't known how many cores you have) > 6. The OSDs get busy (CPU, not IOWAIT), but there still are 1-2 cores idle at that point. > > > - Mail original - > > De: "Christian Balzer" > À: ceph-users@lists.ceph.com > Cc: "Alexandre DERUMIER" > Envoyé: Jeudi 8 Mai 2014 08:52:15 > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing > devices > > On Thu, 08 May 2014 08:41:54 +0200 (CEST) Alexandre DERUMIER wrote: > > > Stupid question : Is your areca 4GB cache shared between ssd journal > > and osd ? > > > Not a stupid question. > I made that mistake about 3 years ago in a DRBD setup, OS and activity > log SSDs on the same controller as the storage disks. > > > or only use by osds ? > > > Only used by the OSDs (2 in total, 11x3TB HDD in RAID6). > I keep repeating myself, neither the journal devices nor the OSDs seem > to be under any particular load or pressure (utilization) according > iostat and atop during the tests. > > Christian > > > > > > > ----- Mail original - > > > > De: "Christian Balzer" > > À: ceph-users@lists.ceph.com > > Envoyé: Jeudi 8 Mai 2014 08:26:33 > > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and > > backing devices > > > > > > Hello, > > > > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: > > > > > Oh, I didn't notice that. I bet you aren't getting the expected > > > throughput on the RAID array with OSD access patterns, and that's > > > applying back pressure on the journal. > > > > > I doubt that based on what I see in terms of local performance and > > actual utilization figures according to iostat and atop during the > > tests. > > > > But if that were to be true, how would one see if that's the case, as > > in where in the plethora of data from: > > > > ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump > > > > is the data I'd be looking for? > > > > > When I suggested other tests, I meant with and without Ceph. One > > > particular one is OSD bench. That should be interesting to try at a > > > variety of block sizes. You could also try runnin RADOS bench and > > > smalliobench at a few different sizes. > > > > > I already did the local tests, as in w/o Ceph, see the original mail > > below. > > > > And you might recall me doing rados benches as well in another thread > > 2 weeks ago or so. > > > > In either case, osd benching gives me: > > --- > > # time ceph tell osd.0 bench > > { "bytes_written": 1073741824, > > "blocksize": 4194304, > > "bytes_per_sec": "247102026.00"} > > > > > > real 0m4.483s > > --- > > This is quite a bit slower than this particular SSD (200GB DC 3700) > > should be able to write, but I will let that slide. > > Note that it is the journal SSD that gets under pressure here (nearly > > 900% util) while the OSD is bored at around 15%. Which is no surprise, > > as it can write data at up to 1600MB/s. > > > > at 4k blocks we see: > > --- > > # time ceph tell osd.0 bench 1073741824 4096 > > { "bytes_written": 1073741824, > > "blocksize": 4096, > > "bytes_per_sec": "9004316.00"} > > > > > > real 1m59.368s > > --- > > Here we get a more balanced picture between journal and storage > > utilization, hovering around 40-50%. > > So clearly not overtaxing either component. > > But yet, this looks like 2100 IOPS to me, if my math is half right. > > > > Rados at 4k gives us this: > > --- > > Total time run: 30.912786 > > Total writes made: 44490 > > Write size: 4096 > > Bandwidth (MB/sec): 5.622 > > > > Stddev Bandwidth: 3.31452 > > Max bandwidth (MB/sec): 9.92578 > > Min bandwidth (MB/sec): 0 > > Average
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
> The OSD processes are quite busy, reading well over 200% on atop, but > the system is not CPU or otherwise resource starved at that moment. osd use 2 threads by default (could explain the 200%) maybe can you try to put in ceph.conf osd op threads = 8 (don't known how many cores you have) - Mail original - De: "Christian Balzer" À: ceph-users@lists.ceph.com Cc: "Alexandre DERUMIER" Envoyé: Jeudi 8 Mai 2014 08:52:15 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices On Thu, 08 May 2014 08:41:54 +0200 (CEST) Alexandre DERUMIER wrote: > Stupid question : Is your areca 4GB cache shared between ssd journal and > osd ? > Not a stupid question. I made that mistake about 3 years ago in a DRBD setup, OS and activity log SSDs on the same controller as the storage disks. > or only use by osds ? > Only used by the OSDs (2 in total, 11x3TB HDD in RAID6). I keep repeating myself, neither the journal devices nor the OSDs seem to be under any particular load or pressure (utilization) according iostat and atop during the tests. Christian > > > - Mail original - > > De: "Christian Balzer" > À: ceph-users@lists.ceph.com > Envoyé: Jeudi 8 Mai 2014 08:26:33 > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing > devices > > > Hello, > > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: > > > Oh, I didn't notice that. I bet you aren't getting the expected > > throughput on the RAID array with OSD access patterns, and that's > > applying back pressure on the journal. > > > I doubt that based on what I see in terms of local performance and > actual utilization figures according to iostat and atop during the > tests. > > But if that were to be true, how would one see if that's the case, as in > where in the plethora of data from: > > ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump > > is the data I'd be looking for? > > > When I suggested other tests, I meant with and without Ceph. One > > particular one is OSD bench. That should be interesting to try at a > > variety of block sizes. You could also try runnin RADOS bench and > > smalliobench at a few different sizes. > > > I already did the local tests, as in w/o Ceph, see the original mail > below. > > And you might recall me doing rados benches as well in another thread 2 > weeks ago or so. > > In either case, osd benching gives me: > --- > # time ceph tell osd.0 bench > { "bytes_written": 1073741824, > "blocksize": 4194304, > "bytes_per_sec": "247102026.00"} > > > real 0m4.483s > --- > This is quite a bit slower than this particular SSD (200GB DC 3700) > should be able to write, but I will let that slide. > Note that it is the journal SSD that gets under pressure here (nearly > 900% util) while the OSD is bored at around 15%. Which is no surprise, > as it can write data at up to 1600MB/s. > > at 4k blocks we see: > --- > # time ceph tell osd.0 bench 1073741824 4096 > { "bytes_written": 1073741824, > "blocksize": 4096, > "bytes_per_sec": "9004316.00"} > > > real 1m59.368s > --- > Here we get a more balanced picture between journal and storage > utilization, hovering around 40-50%. > So clearly not overtaxing either component. > But yet, this looks like 2100 IOPS to me, if my math is half right. > > Rados at 4k gives us this: > --- > Total time run: 30.912786 > Total writes made: 44490 > Write size: 4096 > Bandwidth (MB/sec): 5.622 > > Stddev Bandwidth: 3.31452 > Max bandwidth (MB/sec): 9.92578 > Min bandwidth (MB/sec): 0 > Average Latency: 0.0444653 > Stddev Latency: 0.121887 > Max latency: 2.80917 > Min latency: 0.001958 > --- > So this is even worse, just about 1500 IOPS. > > Regards, > > Christian > > > -Greg > > > > On Wednesday, May 7, 2014, Alexandre DERUMIER > > wrote: > > > > > Hi Christian, > > > > > > Do you have tried without raid6, to have more osd ? > > > (how many disks do you have begin the raid6 ?) > > > > > > > > > Aslo, I known that direct ios can be quite slow with ceph, > > > > > > maybe can you try without --direct=1 > > > > > > and also enable rbd_cache > > > > > > ceph.conf > > > [client] > > > rbd cache = true > > > > > > > > > > > > >
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Thu, 08 May 2014 08:41:54 +0200 (CEST) Alexandre DERUMIER wrote: > Stupid question : Is your areca 4GB cache shared between ssd journal and > osd ? > Not a stupid question. I made that mistake about 3 years ago in a DRBD setup, OS and activity log SSDs on the same controller as the storage disks. > or only use by osds ? > Only used by the OSDs (2 in total, 11x3TB HDD in RAID6). I keep repeating myself, neither the journal devices nor the OSDs seem to be under any particular load or pressure (utilization) according iostat and atop during the tests. Christian > > > - Mail original - > > De: "Christian Balzer" > À: ceph-users@lists.ceph.com > Envoyé: Jeudi 8 Mai 2014 08:26:33 > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing > devices > > > Hello, > > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: > > > Oh, I didn't notice that. I bet you aren't getting the expected > > throughput on the RAID array with OSD access patterns, and that's > > applying back pressure on the journal. > > > I doubt that based on what I see in terms of local performance and > actual utilization figures according to iostat and atop during the > tests. > > But if that were to be true, how would one see if that's the case, as in > where in the plethora of data from: > > ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump > > is the data I'd be looking for? > > > When I suggested other tests, I meant with and without Ceph. One > > particular one is OSD bench. That should be interesting to try at a > > variety of block sizes. You could also try runnin RADOS bench and > > smalliobench at a few different sizes. > > > I already did the local tests, as in w/o Ceph, see the original mail > below. > > And you might recall me doing rados benches as well in another thread 2 > weeks ago or so. > > In either case, osd benching gives me: > --- > # time ceph tell osd.0 bench > { "bytes_written": 1073741824, > "blocksize": 4194304, > "bytes_per_sec": "247102026.00"} > > > real 0m4.483s > --- > This is quite a bit slower than this particular SSD (200GB DC 3700) > should be able to write, but I will let that slide. > Note that it is the journal SSD that gets under pressure here (nearly > 900% util) while the OSD is bored at around 15%. Which is no surprise, > as it can write data at up to 1600MB/s. > > at 4k blocks we see: > --- > # time ceph tell osd.0 bench 1073741824 4096 > { "bytes_written": 1073741824, > "blocksize": 4096, > "bytes_per_sec": "9004316.00"} > > > real 1m59.368s > --- > Here we get a more balanced picture between journal and storage > utilization, hovering around 40-50%. > So clearly not overtaxing either component. > But yet, this looks like 2100 IOPS to me, if my math is half right. > > Rados at 4k gives us this: > --- > Total time run: 30.912786 > Total writes made: 44490 > Write size: 4096 > Bandwidth (MB/sec): 5.622 > > Stddev Bandwidth: 3.31452 > Max bandwidth (MB/sec): 9.92578 > Min bandwidth (MB/sec): 0 > Average Latency: 0.0444653 > Stddev Latency: 0.121887 > Max latency: 2.80917 > Min latency: 0.001958 > --- > So this is even worse, just about 1500 IOPS. > > Regards, > > Christian > > > -Greg > > > > On Wednesday, May 7, 2014, Alexandre DERUMIER > > wrote: > > > > > Hi Christian, > > > > > > Do you have tried without raid6, to have more osd ? > > > (how many disks do you have begin the raid6 ?) > > > > > > > > > Aslo, I known that direct ios can be quite slow with ceph, > > > > > > maybe can you try without --direct=1 > > > > > > and also enable rbd_cache > > > > > > ceph.conf > > > [client] > > > rbd cache = true > > > > > > > > > > > > > > > - Mail original - > > > > > > De: "Christian Balzer" > > > > À: "Gregory Farnum" >, > > > ceph-users@lists.ceph.com > > > Envoyé: Jeudi 8 Mai 2014 04:49:16 > > > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and > > > backing devices > > > > > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > > > > > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > > > > > > > > wrote: &g
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Stupid question : Is your areca 4GB cache shared between ssd journal and osd ? or only use by osds ? - Mail original - De: "Christian Balzer" À: ceph-users@lists.ceph.com Envoyé: Jeudi 8 Mai 2014 08:26:33 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices Hello, On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: > Oh, I didn't notice that. I bet you aren't getting the expected > throughput on the RAID array with OSD access patterns, and that's > applying back pressure on the journal. > I doubt that based on what I see in terms of local performance and actual utilization figures according to iostat and atop during the tests. But if that were to be true, how would one see if that's the case, as in where in the plethora of data from: ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump is the data I'd be looking for? > When I suggested other tests, I meant with and without Ceph. One > particular one is OSD bench. That should be interesting to try at a > variety of block sizes. You could also try runnin RADOS bench and > smalliobench at a few different sizes. > I already did the local tests, as in w/o Ceph, see the original mail below. And you might recall me doing rados benches as well in another thread 2 weeks ago or so. In either case, osd benching gives me: --- # time ceph tell osd.0 bench { "bytes_written": 1073741824, "blocksize": 4194304, "bytes_per_sec": "247102026.00"} real 0m4.483s --- This is quite a bit slower than this particular SSD (200GB DC 3700) should be able to write, but I will let that slide. Note that it is the journal SSD that gets under pressure here (nearly 900% util) while the OSD is bored at around 15%. Which is no surprise, as it can write data at up to 1600MB/s. at 4k blocks we see: --- # time ceph tell osd.0 bench 1073741824 4096 { "bytes_written": 1073741824, "blocksize": 4096, "bytes_per_sec": "9004316.00"} real 1m59.368s --- Here we get a more balanced picture between journal and storage utilization, hovering around 40-50%. So clearly not overtaxing either component. But yet, this looks like 2100 IOPS to me, if my math is half right. Rados at 4k gives us this: --- Total time run: 30.912786 Total writes made: 44490 Write size: 4096 Bandwidth (MB/sec): 5.622 Stddev Bandwidth: 3.31452 Max bandwidth (MB/sec): 9.92578 Min bandwidth (MB/sec): 0 Average Latency: 0.0444653 Stddev Latency: 0.121887 Max latency: 2.80917 Min latency: 0.001958 --- So this is even worse, just about 1500 IOPS. Regards, Christian > -Greg > > On Wednesday, May 7, 2014, Alexandre DERUMIER > wrote: > > > Hi Christian, > > > > Do you have tried without raid6, to have more osd ? > > (how many disks do you have begin the raid6 ?) > > > > > > Aslo, I known that direct ios can be quite slow with ceph, > > > > maybe can you try without --direct=1 > > > > and also enable rbd_cache > > > > ceph.conf > > [client] > > rbd cache = true > > > > > > > > > > - Mail original - > > > > De: "Christian Balzer" > > > À: "Gregory Farnum" >, > > ceph-users@lists.ceph.com > > Envoyé: Jeudi 8 Mai 2014 04:49:16 > > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and > > backing devices > > > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > > > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > > > > > > wrote: > > > > > > > > Hello, > > > > > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The > > > > journals are on (separate) DC 3700s, the actual OSDs are RAID6 > > > > behind an Areca 1882 with 4GB of cache. > > > > > > > > Running this fio: > > > > > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k > > > > --iodepth=128 > > > > > > > > results in: > > > > > > > > 30k IOPS on the journal SSD (as expected) > > > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise > > > > there) 3200 IOPS from a VM using userspace RBD > > > > 2900 IOPS from a host kernelspace mounted RBD > > > > > > > > When running the fio from the VM RBD the utilization of the > > > > journals is about 20% (2400 IOPS) and the OSDs are bored at 2% > > > &g
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello, On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: > Oh, I didn't notice that. I bet you aren't getting the expected > throughput on the RAID array with OSD access patterns, and that's > applying back pressure on the journal. > I doubt that based on what I see in terms of local performance and actual utilization figures according to iostat and atop during the tests. But if that were to be true, how would one see if that's the case, as in where in the plethora of data from: ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump is the data I'd be looking for? > When I suggested other tests, I meant with and without Ceph. One > particular one is OSD bench. That should be interesting to try at a > variety of block sizes. You could also try runnin RADOS bench and > smalliobench at a few different sizes. > I already did the local tests, as in w/o Ceph, see the original mail below. And you might recall me doing rados benches as well in another thread 2 weeks ago or so. In either case, osd benching gives me: --- # time ceph tell osd.0 bench { "bytes_written": 1073741824, "blocksize": 4194304, "bytes_per_sec": "247102026.00"} real0m4.483s --- This is quite a bit slower than this particular SSD (200GB DC 3700) should be able to write, but I will let that slide. Note that it is the journal SSD that gets under pressure here (nearly 900% util) while the OSD is bored at around 15%. Which is no surprise, as it can write data at up to 1600MB/s. at 4k blocks we see: --- # time ceph tell osd.0 bench 1073741824 4096 { "bytes_written": 1073741824, "blocksize": 4096, "bytes_per_sec": "9004316.00"} real1m59.368s --- Here we get a more balanced picture between journal and storage utilization, hovering around 40-50%. So clearly not overtaxing either component. But yet, this looks like 2100 IOPS to me, if my math is half right. Rados at 4k gives us this: --- Total time run: 30.912786 Total writes made: 44490 Write size: 4096 Bandwidth (MB/sec): 5.622 Stddev Bandwidth: 3.31452 Max bandwidth (MB/sec): 9.92578 Min bandwidth (MB/sec): 0 Average Latency:0.0444653 Stddev Latency: 0.121887 Max latency:2.80917 Min latency:0.001958 --- So this is even worse, just about 1500 IOPS. Regards, Christian > -Greg > > On Wednesday, May 7, 2014, Alexandre DERUMIER > wrote: > > > Hi Christian, > > > > Do you have tried without raid6, to have more osd ? > > (how many disks do you have begin the raid6 ?) > > > > > > Aslo, I known that direct ios can be quite slow with ceph, > > > > maybe can you try without --direct=1 > > > > and also enable rbd_cache > > > > ceph.conf > > [client] > > rbd cache = true > > > > > > > > > > - Mail original - > > > > De: "Christian Balzer" > > > À: "Gregory Farnum" >, > > ceph-users@lists.ceph.com > > Envoyé: Jeudi 8 Mai 2014 04:49:16 > > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and > > backing devices > > > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > > > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > > > > > > wrote: > > > > > > > > Hello, > > > > > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The > > > > journals are on (separate) DC 3700s, the actual OSDs are RAID6 > > > > behind an Areca 1882 with 4GB of cache. > > > > > > > > Running this fio: > > > > > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k > > > > --iodepth=128 > > > > > > > > results in: > > > > > > > > 30k IOPS on the journal SSD (as expected) > > > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise > > > > there) 3200 IOPS from a VM using userspace RBD > > > > 2900 IOPS from a host kernelspace mounted RBD > > > > > > > > When running the fio from the VM RBD the utilization of the > > > > journals is about 20% (2400 IOPS) and the OSDs are bored at 2% > > > > (1500 IOPS after some obvious merging). > > > > The OSD processes are quite busy, reading well over 200% on atop, > > > > but the system is not CPU or otherwise resource starved at that > > > > moment. > > > > > > > > Running multiple instances of this test from several VMs on
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hello, On Thu, 08 May 2014 06:33:51 +0200 (CEST) Alexandre DERUMIER wrote: > Hi Christian, > > Do you have tried without raid6, to have more osd ? No and that is neither an option nor the reason for any performance issues here. If you re-read my original mail it clearly states that the same fio can achieve 11 IOPS on that raid and that it is not busy at all during the test. > (how many disks do you have begin the raid6 ?) > 11 per OSD. This will affect the amount of sustainable IOPS of course, but in this test case every last bit should (and does) fit into the caches. From the RBD client the transaction should be finished once the primary and secondary OSD for the PG in question have ACK'ed things. > > Aslo, I known that direct ios can be quite slow with ceph, > > maybe can you try without --direct=1 > I can, but that is not the test case here. For the record that pushes it to 12k IOPS, with the journal SSDs reaching about 30% utilization and the actual OSDs up to 5%. So much better, but still quite some capacity for improvement. > and also enable rbd_cache > > ceph.conf > [client] > rbd cache = true > I have that set of course, as well as specifically "writeback" for the KVM instance in question. Interestingly I see no difference at all with a KVM instance that is set explicitly to "none", but that's not part of this particular inquiry either. Christian > > > > - Mail original - > > De: "Christian Balzer" > À: "Gregory Farnum" , ceph-users@lists.ceph.com > Envoyé: Jeudi 8 Mai 2014 04:49:16 > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing > devices > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > > wrote: > > > > > > Hello, > > > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The > > > journals are on (separate) DC 3700s, the actual OSDs are RAID6 > > > behind an Areca 1882 with 4GB of cache. > > > > > > Running this fio: > > > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k > > > --iodepth=128 > > > > > > results in: > > > > > > 30k IOPS on the journal SSD (as expected) > > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise > > > there) 3200 IOPS from a VM using userspace RBD > > > 2900 IOPS from a host kernelspace mounted RBD > > > > > > When running the fio from the VM RBD the utilization of the journals > > > is about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS > > > after some obvious merging). > > > The OSD processes are quite busy, reading well over 200% on atop, > > > but the system is not CPU or otherwise resource starved at that > > > moment. > > > > > > Running multiple instances of this test from several VMs on > > > different hosts changes nothing, as in the aggregated IOPS for the > > > whole cluster will still be around 3200 IOPS. > > > > > > Now clearly RBD has to deal with latency here, but the network is > > > IPoIB with the associated low latency and the journal SSDs are the > > > (consistently) fasted ones around. > > > > > > I guess what I am wondering about is if this is normal and to be > > > expected or if not where all that potential performance got lost. > > > > Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) > Yes, but going down to 32 doesn't change things one iota. > Also note the multiple instances I mention up there, so that would be > 256 IOs at a time, coming from different hosts over different links and > nothing changes. > > > that's about 40ms of latency per op (for userspace RBD), which seems > > awfully long. You should check what your client-side objecter settings > > are; it might be limiting you to fewer outstanding ops than that. > > Googling for client-side objecter gives a few hits on ceph devel and > bugs and nothing at all as far as configuration options are concerned. > Care to enlighten me where one can find those? > > Also note the kernelspace (3.13 if it matters) speed, which is very much > in the same (junior league) ballpark. > > > If > > it's available to you, testing with Firefly or even master would be > > interesting — there's some performance work that should reduce > > latencies. > > > Not an option, this is going i
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Oh, I didn't notice that. I bet you aren't getting the expected throughput on the RAID array with OSD access patterns, and that's applying back pressure on the journal. When I suggested other tests, I meant with and without Ceph. One particular one is OSD bench. That should be interesting to try at a variety of block sizes. You could also try runnin RADOS bench and smalliobench at a few different sizes. -Greg On Wednesday, May 7, 2014, Alexandre DERUMIER wrote: > Hi Christian, > > Do you have tried without raid6, to have more osd ? > (how many disks do you have begin the raid6 ?) > > > Aslo, I known that direct ios can be quite slow with ceph, > > maybe can you try without --direct=1 > > and also enable rbd_cache > > ceph.conf > [client] > rbd cache = true > > > > > - Mail original - > > De: "Christian Balzer" > > À: "Gregory Farnum" >, > ceph-users@lists.ceph.com > Envoyé: Jeudi 8 Mai 2014 04:49:16 > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing > devices > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > > > > wrote: > > > > > > Hello, > > > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The > > > journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind > > > an Areca 1882 with 4GB of cache. > > > > > > Running this fio: > > > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128 > > > > > > results in: > > > > > > 30k IOPS on the journal SSD (as expected) > > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise > > > there) 3200 IOPS from a VM using userspace RBD > > > 2900 IOPS from a host kernelspace mounted RBD > > > > > > When running the fio from the VM RBD the utilization of the journals is > > > about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after > > > some obvious merging). > > > The OSD processes are quite busy, reading well over 200% on atop, but > > > the system is not CPU or otherwise resource starved at that moment. > > > > > > Running multiple instances of this test from several VMs on different > > > hosts changes nothing, as in the aggregated IOPS for the whole cluster > > > will still be around 3200 IOPS. > > > > > > Now clearly RBD has to deal with latency here, but the network is IPoIB > > > with the associated low latency and the journal SSDs are the > > > (consistently) fasted ones around. > > > > > > I guess what I am wondering about is if this is normal and to be > > > expected or if not where all that potential performance got lost. > > > > Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) > Yes, but going down to 32 doesn't change things one iota. > Also note the multiple instances I mention up there, so that would be 256 > IOs at a time, coming from different hosts over different links and > nothing changes. > > > that's about 40ms of latency per op (for userspace RBD), which seems > > awfully long. You should check what your client-side objecter settings > > are; it might be limiting you to fewer outstanding ops than that. > > Googling for client-side objecter gives a few hits on ceph devel and bugs > and nothing at all as far as configuration options are concerned. > Care to enlighten me where one can find those? > > Also note the kernelspace (3.13 if it matters) speed, which is very much > in the same (junior league) ballpark. > > > If > > it's available to you, testing with Firefly or even master would be > > interesting — there's some performance work that should reduce > > latencies. > > > Not an option, this is going into production next week. > > > But a well-tuned (or even default-tuned, I thought) Ceph cluster > > certainly doesn't require 40ms/op, so you should probably run a wider > > array of experiments to try and figure out where it's coming from. > > I think we can rule out the network, NPtcp gives me: > --- > 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec > --- > > For comparison at about 512KB it reaches maximum throughput and still > isn't that laggy: > --- > 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec > --- > > So with the network performing as well as my lengthy experience with IPoIB > led me to be
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
Hi Christian, Do you have tried without raid6, to have more osd ? (how many disks do you have begin the raid6 ?) Aslo, I known that direct ios can be quite slow with ceph, maybe can you try without --direct=1 and also enable rbd_cache ceph.conf [client] rbd cache = true - Mail original - De: "Christian Balzer" À: "Gregory Farnum" , ceph-users@lists.ceph.com Envoyé: Jeudi 8 Mai 2014 04:49:16 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer wrote: > > > > Hello, > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The > > journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind > > an Areca 1882 with 4GB of cache. > > > > Running this fio: > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128 > > > > results in: > > > > 30k IOPS on the journal SSD (as expected) > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise > > there) 3200 IOPS from a VM using userspace RBD > > 2900 IOPS from a host kernelspace mounted RBD > > > > When running the fio from the VM RBD the utilization of the journals is > > about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after > > some obvious merging). > > The OSD processes are quite busy, reading well over 200% on atop, but > > the system is not CPU or otherwise resource starved at that moment. > > > > Running multiple instances of this test from several VMs on different > > hosts changes nothing, as in the aggregated IOPS for the whole cluster > > will still be around 3200 IOPS. > > > > Now clearly RBD has to deal with latency here, but the network is IPoIB > > with the associated low latency and the journal SSDs are the > > (consistently) fasted ones around. > > > > I guess what I am wondering about is if this is normal and to be > > expected or if not where all that potential performance got lost. > > Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) Yes, but going down to 32 doesn't change things one iota. Also note the multiple instances I mention up there, so that would be 256 IOs at a time, coming from different hosts over different links and nothing changes. > that's about 40ms of latency per op (for userspace RBD), which seems > awfully long. You should check what your client-side objecter settings > are; it might be limiting you to fewer outstanding ops than that. Googling for client-side objecter gives a few hits on ceph devel and bugs and nothing at all as far as configuration options are concerned. Care to enlighten me where one can find those? Also note the kernelspace (3.13 if it matters) speed, which is very much in the same (junior league) ballpark. > If > it's available to you, testing with Firefly or even master would be > interesting — there's some performance work that should reduce > latencies. > Not an option, this is going into production next week. > But a well-tuned (or even default-tuned, I thought) Ceph cluster > certainly doesn't require 40ms/op, so you should probably run a wider > array of experiments to try and figure out where it's coming from. I think we can rule out the network, NPtcp gives me: --- 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec --- For comparison at about 512KB it reaches maximum throughput and still isn't that laggy: --- 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec --- So with the network performing as well as my lengthy experience with IPoIB led me to believe, what else is there to look at? The storage nodes perform just as expected, indicated by the local fio tests. That pretty much leaves only Ceph/RBD to look at and I'm not really sure what experiments I should run on that. ^o^ Regards, Christian > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > -- Christian Balzer Network/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer wrote: > > > > Hello, > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The > > journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind > > an Areca 1882 with 4GB of cache. > > > > Running this fio: > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128 > > > > results in: > > > > 30k IOPS on the journal SSD (as expected) > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise > > there) 3200 IOPS from a VM using userspace RBD > > 2900 IOPS from a host kernelspace mounted RBD > > > > When running the fio from the VM RBD the utilization of the journals is > > about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after > > some obvious merging). > > The OSD processes are quite busy, reading well over 200% on atop, but > > the system is not CPU or otherwise resource starved at that moment. > > > > Running multiple instances of this test from several VMs on different > > hosts changes nothing, as in the aggregated IOPS for the whole cluster > > will still be around 3200 IOPS. > > > > Now clearly RBD has to deal with latency here, but the network is IPoIB > > with the associated low latency and the journal SSDs are the > > (consistently) fasted ones around. > > > > I guess what I am wondering about is if this is normal and to be > > expected or if not where all that potential performance got lost. > > Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) Yes, but going down to 32 doesn't change things one iota. Also note the multiple instances I mention up there, so that would be 256 IOs at a time, coming from different hosts over different links and nothing changes. > that's about 40ms of latency per op (for userspace RBD), which seems > awfully long. You should check what your client-side objecter settings > are; it might be limiting you to fewer outstanding ops than that. Googling for client-side objecter gives a few hits on ceph devel and bugs and nothing at all as far as configuration options are concerned. Care to enlighten me where one can find those? Also note the kernelspace (3.13 if it matters) speed, which is very much in the same (junior league) ballpark. > If > it's available to you, testing with Firefly or even master would be > interesting — there's some performance work that should reduce > latencies. > Not an option, this is going into production next week. > But a well-tuned (or even default-tuned, I thought) Ceph cluster > certainly doesn't require 40ms/op, so you should probably run a wider > array of experiments to try and figure out where it's coming from. I think we can rule out the network, NPtcp gives me: --- 56:4096 bytes 1546 times -->979.22 Mbps in 31.91 usec --- For comparison at about 512KB it reaches maximum throughput and still isn't that laggy: --- 98: 524288 bytes121 times --> 9700.57 Mbps in 412.35 usec --- So with the network performing as well as my lengthy experience with IPoIB led me to believe, what else is there to look at? The storage nodes perform just as expected, indicated by the local fio tests. That pretty much leaves only Ceph/RBD to look at and I'm not really sure what experiments I should run on that. ^o^ Regards, Christian > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices
On Wed, May 7, 2014 at 5:57 PM, Christian Balzer wrote: > > Hello, > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The journals > are on (separate) DC 3700s, the actual OSDs are RAID6 behind an Areca 1882 > with 4GB of cache. > > Running this fio: > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 > --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128 > > results in: > > 30k IOPS on the journal SSD (as expected) > 110k IOPS on the OSD (it fits neatly into the cache, no surprise there) > 3200 IOPS from a VM using userspace RBD > 2900 IOPS from a host kernelspace mounted RBD > > When running the fio from the VM RBD the utilization of the journals is > about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after some > obvious merging). > The OSD processes are quite busy, reading well over 200% on atop, but > the system is not CPU or otherwise resource starved at that moment. > > Running multiple instances of this test from several VMs on different hosts > changes nothing, as in the aggregated IOPS for the whole cluster will > still be around 3200 IOPS. > > Now clearly RBD has to deal with latency here, but the network is IPoIB > with the associated low latency and the journal SSDs are the > (consistently) fasted ones around. > > I guess what I am wondering about is if this is normal and to be expected > or if not where all that potential performance got lost. Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) that's about 40ms of latency per op (for userspace RBD), which seems awfully long. You should check what your client-side objecter settings are; it might be limiting you to fewer outstanding ops than that. If it's available to you, testing with Firefly or even master would be interesting — there's some performance work that should reduce latencies. But a well-tuned (or even default-tuned, I thought) Ceph cluster certainly doesn't require 40ms/op, so you should probably run a wider array of experiments to try and figure out where it's coming from. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com