Re: [ceph-users] Slow IOPS on RBD compared to journalandbackingdevices

German Anders Wed, 14 May 2014 06:24:14 -0700

Hi Josef,
Thanks a lot for the quick answer.

yes 32M and rand writes

and also, do you get those values i guess with a MTU of 9000 or withthe traditional and beloved MTU 1500?




German Anders
Field Storage Support Engineer
Despegar.com - IT Team

--- Original message ---
Asunto: Re: [ceph-users] Slow IOPS on RBD compared tojournalandbackingdevices
De: Josef Johansson <jo...@oderland.se>
Para: <ceph-users@lists.ceph.com>
Fecha: Wednesday, 14/05/2014 10:10


Hi,

On 14/05/14 14:45, German Anders wrote:
I forgot to mention, of course on a 10GbE network



German               Anders
Field               Storage Support Engineer
Despegar.com             - IT Team
--- Original message ---
Asunto: Re: [ceph-users] Slow IOPS on RBD compared to journalandbackingdevices
De: German Anders <gand...@despegar.com>
Para: Christian Balzer <ch...@gol.com>
Cc: <ceph-users@lists.ceph.com>
Fecha: Wednesday, 14/05/2014 09:41
Someone could get a performance throughput on RBD of600MB/s or more on (rw) with a block size of 32768k?
    Is that 32M then?
Sequential or randwrite?
I get about those speeds when doing (1M block size) buffered writesfrom within a VM on 20GbE. The cluster max out at about 900MB/s.
Cheers,
Josef
German                   Anders
Field                   Storage Support Engineer
Despegar.com                 - IT Team
--- Original message ---
Asunto: Re: [ceph-users] Slow IOPS on RBD compared tojournal and backingdevices
De: Christian Balzer <ch...@gol.com>
Para: Josef Johansson <jo...@oderland.se>
Cc: <ceph-users@lists.ceph.com>
Fecha: Wednesday, 14/05/2014 09:33


Hello!

On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:
Hi Christian,
I missed this thread, haven't been reading the list thatwell the last
weeks.
You already know my setup, since we discussed it in anearlier thread. Idon't have a fast backing store, but I see the slow IOPSwhen doingrandwrite inside the VM, with rbd cache. Still runningdumpling here
though.

 Nods, I do recall that thread.
A thought struck me that I could test with a pool thatconsists of OSDsthat have tempfs-based disks, think I have a bit morelatency than yourIPoIB but I've pushed 100k IOPS with the same networkdevices before.This would verify if the problem is with the journaldisks. I'll alsotry to run the journal devices in tempfs as well, as itwould test
purely Ceph itself.

 That would be interesting indeed.
Given what I've seen (with the journal at 20% utilizationand the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
I'll get back to you with the results, hopefully I'llmanage to get them
done during this night.

 Looking forward to that. ^^
Christian
Cheers,
Josef

On 13/05/14 11:03, Christian Balzer wrote:
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal andfilestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph clusterwith a fastnetwork and FAST filestore, so like me with a big HWcache in front of
a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statementbelow per OSD?In my case with 4 OSDs and 3200 IOPS that's about 800IOPS per OSD,which is of course vastly faster than the normalindvidual HDDs could
do.
So I'm wondering if I'm hitting some inherent limitationof how fast asingle OSD (as in the software) can handle IOPS, giventhat everything
else has been ruled out from where I stand.
This would also explain why none of the option changesor the use ofRBD caching has any measurable effect in the test casebelow.As in, a slow OSD aka single HDD with journal on thesame disk wouldclearly benefit from even the small 32MB standard RBDcache, while inmy test case the only time the caching becomesnoticeable is if Iincrease the cache size to something larger than thetest data size.
^o^
On the other hand if people here regularly get thousandsor tens ofthousands IOPS per OSD with the appropriate HW I'mstumped.
Christian
On Fri, 9 May 2014 11:01:26 +0900 Christian Balzerwrote:
On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnumwrote:
Oh, I didn't notice that. I bet you aren't gettingthe expectedthroughput on the RAID array with OSD accesspatterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand wordstradition, I give you
this iostat -x output taken during a fio run:

avg-cpu: %user %nice %system %iowait %steal %idle
            50.82 0.00 19.43 0.17 0.00 29.58

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are thejournal SSDs.Look at these numbers, the lack of queues, the lowwait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers andthe networkresults below is that the latency happens within theOSD processes.
Regards,

Christian
When I suggested other tests, I meant with andwithout Ceph. Oneparticular one is OSD bench. That should beinteresting to try at avariety of block sizes. You could also try runninRADOS bench and
smalliobench at a few different sizes.
-Greg

On Wednesday, May 7, 2014, Alexandre DERUMIER <aderum...@odiso.com>
wrote:
Hi Christian,
Do you have tried without raid6, to have more osd?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slowwith ceph,
maybe can you try without --direct=1

and also enable rbd_cache

ceph.conf
[client]
rbd cache = true




----- Mail original -----

De: "Christian Balzer" <ch...@gol.com <javascript:;>>
À: "Gregory Farnum" <g...@inktank.com <javascript:;>>,
ceph-users@lists.ceph.com <javascript:;>
Envoyé: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD comparedto journal and
backing devices
On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnumwrote:
On Wed, May 7, 2014 at 5:57 PM, ChristianBalzer
<ch...@gol.com<javascript:;>>
 wrote:
Hello,
> ceph 0.72 on Debian Jessie, 2 storage nodeswith 2 OSDs each. The> journals are on (separate) DC 3700s, theactual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.

Running this fio:
> fio --size=400m --ioengine=libaio--invalidate=1 --direct=1> --numjobs=1 --rw=randwrite --name=fiojob--blocksize=4k
--iodepth=128

results in:

30k IOPS on the journal SSD (as expected)
> 110k IOPS on the OSD (it fits neatly into thecache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
> When running the fio from the VM RBD theutilization of the> journals is about 20% (2400 IOPS) and the OSDsare bored at 2%
(1500 IOPS after some obvious merging).
> The OSD processes are quite busy, reading wellover 200% on atop,> but the system is not CPU or otherwiseresource starved at that
moment.
> Running multiple instances of this test fromseveral VMs on> different hosts changes nothing, as in theaggregated IOPS for> the whole cluster will still be around 3200IOPS.
> Now clearly RBD has to deal with latency here,but the network is> IPoIB with the associated low latency and thejournal SSDs are
the (consistently) fasted ones around.
> I guess what I am wondering about is if thisis normal and to be> expected or if not where all that potentialperformance got lost.> Hmm, with 128 IOs at a time (I believe I'mreading that correctly?)Yes, but going down to 32 doesn't change thingsone iota.
Also note the multiple instances I mention upthere, so that wouldbe 256 IOs at a time, coming from different hostsover different
links and nothing changes.
that's about 40ms of latency per op (foruserspace RBD), whichseems awfully long. You should check what yourclient-side objectersettings are; it might be limiting you to feweroutstanding ops
than that.
Googling for client-side objecter gives a few hitson ceph devel and
bugs and nothing at all as far as configurationoptions areconcerned. Care to enlighten me where one can findthose?
Also note the kernelspace (3.13 if it matters)speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly oreven master would beinteresting — there's some performance work thatshould reduce
latencies.
Not an option, this is going into production nextweek.
But a well-tuned (or even default-tuned, Ithought) Ceph clustercertainly doesn't require 40ms/op, so you shouldprobably run awider array of experiments to try and figure outwhere it's coming
from.
I think we can rule out the network, NPtcp givesme:
---
56: 4096 bytes 1546 times --> 979.22 Mbps in31.91 usec
---
For comparison at about 512KB it reaches maximumthroughput and
still isn't that laggy:
---
98: 524288 bytes 121 times --> 9700.57 Mbps in412.35 usec
---
So with the network performing as well as mylengthy experience withIPoIB led me to believe, what else is there tolook at?The storage nodes perform just as expected,indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look atand I'm not really
sure what experiments I should run on that. ^o^

Regards,

Christian
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
Christian Balzer Network/Systems Engineer
ch...@gol.com <javascript:;> Global OnLineJapan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
ch...@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 _______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow IOPS on RBD compared to journalandbackingdevices

Reply via email to