Hi Josef,
Thanks a lot for the quick answer.
yes 32M and rand writes
and also, do you get those values i guess with a MTU of 9000 or with
the traditional and beloved MTU 1500?
German Anders
Field Storage Support Engineer
Despegar.com - IT Team
--- Original message ---
Asunto: Re: [ceph-users] Slow IOPS on RBD compared to
journalandbackingdevices
De: Josef Johansson <jo...@oderland.se>
Para: <ceph-users@lists.ceph.com>
Fecha: Wednesday, 14/05/2014 10:10
Hi,
On 14/05/14 14:45, German Anders wrote:
I forgot to mention, of course on a 10GbE network
German Anders
Field Storage Support Engineer
Despegar.com - IT Team
--- Original message ---
Asunto: Re: [ceph-users] Slow IOPS on RBD compared to journal
andbackingdevices
De: German Anders <gand...@despegar.com>
Para: Christian Balzer <ch...@gol.com>
Cc: <ceph-users@lists.ceph.com>
Fecha: Wednesday, 14/05/2014 09:41
Someone could get a performance throughput on RBD of
600MB/s or more on (rw) with a block size of 32768k?
Is that 32M then?
Sequential or randwrite?
I get about those speeds when doing (1M block size) buffered writes
from within a VM on 20GbE. The cluster max out at about 900MB/s.
Cheers,
Josef
German Anders
Field Storage Support Engineer
Despegar.com - IT Team
--- Original message ---
Asunto: Re: [ceph-users] Slow IOPS on RBD compared to
journal and backingdevices
De: Christian Balzer <ch...@gol.com>
Para: Josef Johansson <jo...@oderland.se>
Cc: <ceph-users@lists.ceph.com>
Fecha: Wednesday, 14/05/2014 09:33
Hello!
On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:
Hi Christian,
I missed this thread, haven't been reading the list that
well the last
weeks.
You already know my setup, since we discussed it in an
earlier thread. I
don't have a fast backing store, but I see the slow IOPS
when doing
randwrite inside the VM, with rbd cache. Still running
dumpling here
though.
Nods, I do recall that thread.
A thought struck me that I could test with a pool that
consists of OSDs
that have tempfs-based disks, think I have a bit more
latency than your
IPoIB but I've pushed 100k IOPS with the same network
devices before.
This would verify if the problem is with the journal
disks. I'll also
try to run the journal devices in tempfs as well, as it
would test
purely Ceph itself.
That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization
and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
I'll get back to you with the results, hopefully I'll
manage to get them
done during this night.
Looking forward to that. ^^
Christian
Cheers,
Josef
On 13/05/14 11:03, Christian Balzer wrote:
I'm clearly talking to myself, but whatever.
For Greg, I've played with all the pertinent journal and
filestore
options and TCP nodelay, no changes at all.
Is there anybody on this ML who's running a Ceph cluster
with a fast
network and FAST filestore, so like me with a big HW
cache in front of
a RAID/JBODs or using SSDs for final storage?
If so, what results do you get out of the fio statement
below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800
IOPS per OSD,
which is of course vastly faster than the normal
indvidual HDDs could
do.
So I'm wondering if I'm hitting some inherent limitation
of how fast a
single OSD (as in the software) can handle IOPS, given
that everything
else has been ruled out from where I stand.
This would also explain why none of the option changes
or the use of
RBD caching has any measurable effect in the test case
below.
As in, a slow OSD aka single HDD with journal on the
same disk would
clearly benefit from even the small 32MB standard RBD
cache, while in
my test case the only time the caching becomes
noticeable is if I
increase the cache size to something larger than the
test data size.
^o^
On the other hand if people here regularly get thousands
or tens of
thousands IOPS per OSD with the appropriate HW I'm
stumped.
Christian
On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer
wrote:
On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum
wrote:
Oh, I didn't notice that. I bet you aren't getting
the expected
throughput on the RAID array with OSD access
patterns, and that's
applying back pressure on the journal.
In the a "picture" being worth a thousand words
tradition, I give you
this iostat -x output taken during a fio run:
avg-cpu: %user %nice %system %iowait %steal %idle
50.82 0.00 19.43 0.17 0.00 29.58
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 51.50 0.00 1633.50 0.00 7460.00
9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
The %user CPU utilization is pretty much entirely the
2 OSD processes,
note the nearly complete absence of iowait.
sda and sdb are the OSDs RAIDs, sdc and sdd are the
journal SSDs.
Look at these numbers, the lack of queues, the low
wait and service
times (this is in ms) plus overall utilization.
The only conclusion I can draw from these numbers and
the network
results below is that the latency happens within the
OSD processes.
Regards,
Christian
When I suggested other tests, I meant with and
without Ceph. One
particular one is OSD bench. That should be
interesting to try at a
variety of block sizes. You could also try runnin
RADOS bench and
smalliobench at a few different sizes.
-Greg
On Wednesday, May 7, 2014, Alexandre DERUMIER <aderum...@odiso.com>
wrote:
Hi Christian,
Do you have tried without raid6, to have more osd
?
(how many disks do you have begin the raid6 ?)
Aslo, I known that direct ios can be quite slow
with ceph,
maybe can you try without --direct=1
and also enable rbd_cache
ceph.conf
[client]
rbd cache = true
----- Mail original -----
De: "Christian Balzer" <ch...@gol.com <javascript:;>>
À: "Gregory Farnum" <g...@inktank.com <javascript:;>>,
ceph-users@lists.ceph.com <javascript:;>
Envoyé: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared
to journal and
backing devices
On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum
wrote:
On Wed, May 7, 2014 at 5:57 PM, Christian
Balzer
<ch...@gol.com<javascript:;>>
wrote:
Hello,
> ceph 0.72 on Debian Jessie, 2 storage nodes
with 2 OSDs each. The
> journals are on (separate) DC 3700s, the
actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.
Running this fio:
> fio --size=400m --ioengine=libaio
--invalidate=1 --direct=1
> --numjobs=1 --rw=randwrite --name=fiojob
--blocksize=4k
--iodepth=128
results in:
30k IOPS on the journal SSD (as expected)
> 110k IOPS on the OSD (it fits neatly into the
cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD
> When running the fio from the VM RBD the
utilization of the
> journals is about 20% (2400 IOPS) and the OSDs
are bored at 2%
(1500 IOPS after some obvious merging).
> The OSD processes are quite busy, reading well
over 200% on atop,
> but the system is not CPU or otherwise
resource starved at that
moment.
> Running multiple instances of this test from
several VMs on
> different hosts changes nothing, as in the
aggregated IOPS for
> the whole cluster will still be around 3200
IOPS.
> Now clearly RBD has to deal with latency here,
but the network is
> IPoIB with the associated low latency and the
journal SSDs are
the (consistently) fasted ones around.
> I guess what I am wondering about is if this
is normal and to be
> expected or if not where all that potential
performance got lost.
> Hmm, with 128 IOs at a time (I believe I'm
reading that correctly?)
Yes, but going down to 32 doesn't change things
one iota.
Also note the multiple instances I mention up
there, so that would
be 256 IOs at a time, coming from different hosts
over different
links and nothing changes.
that's about 40ms of latency per op (for
userspace RBD), which
seems awfully long. You should check what your
client-side objecter
settings are; it might be limiting you to fewer
outstanding ops
than that.
Googling for client-side objecter gives a few hits
on ceph devel and
bugs and nothing at all as far as configuration
options are
concerned. Care to enlighten me where one can find
those?
Also note the kernelspace (3.13 if it matters)
speed, which is very
much in the same (junior league) ballpark.
If
it's available to you, testing with Firefly or
even master would be
interesting — there's some performance work that
should reduce
latencies.
Not an option, this is going into production next
week.
But a well-tuned (or even default-tuned, I
thought) Ceph cluster
certainly doesn't require 40ms/op, so you should
probably run a
wider array of experiments to try and figure out
where it's coming
from.
I think we can rule out the network, NPtcp gives
me:
---
56: 4096 bytes 1546 times --> 979.22 Mbps in
31.91 usec
---
For comparison at about 512KB it reaches maximum
throughput and
still isn't that laggy:
---
98: 524288 bytes 121 times --> 9700.57 Mbps in
412.35 usec
---
So with the network performing as well as my
lengthy experience with
IPoIB led me to believe, what else is there to
look at?
The storage nodes perform just as expected,
indicated by the local
fio tests.
That pretty much leaves only Ceph/RBD to look at
and I'm not really
sure what experiments I should run on that. ^o^
Regards,
Christian
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
Christian Balzer Network/Systems Engineer
ch...@gol.com <javascript:;> Global OnLine
Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
ch...@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com