Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

Josef Johansson Fri, 13 Jun 2014 11:20:12 -0700

Hey,

I did try this, it didn't work though, so I think I still have to patchthe kernel though, as the user_xattr is not allowed on tmpfs.


Thanks for the description though.

I think the next step in this is to do it all virtual, maybe on the samehardware to avoid network.Any problems with doing it all virtual? If it's just memory and the samemachine, we should see the pure ceph performance right?


Anyone done this?

Cheers,
Josef

Stefan Priebe - Profihost AG skrev 2014-05-15 09:58:

Am 15.05.2014 09:56, schrieb Josef Johansson:

On 15/05/14 09:11, Stefan Priebe - Profihost AG wrote:

Am 15.05.2014 00:26, schrieb Josef Johansson:

Hi,

So, apparently tmpfs does not support non-root xattr due to a possible
DoS-vector. There's configuration set for enabling it as far as I can see.

CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y

Anyone know a way around it? Saw that there's a patch for enabling it,
but recompiling my kernel is out of reach right now ;)

I would create an empty file in tmpfs and then format that file as a
block device.

How do you mean exactly? Creating with dd and mounting with losetup?

mount -t tmpfs -o size=4G /mnt /mnt
dd if=/dev/zero of=/mnt/blockdev_a bs=1M count=4000
mkfs.xfs -f /mnt/blockdev_a
mount /mnt/blockdev_a /ceph/osd.X

Dann /mnt/blockdev_a als OSD device nutzen.

Cheers,
Josef

Created the osd with following:

root@osd1:/# dd seek=6G if=/dev/zero of=/dev/shm/test-osd/img bs=1 count=1
root@osd1:/# losetup /dev/loop0 /dev/shm/test-osd/img
root@osd1:/# mkfs.xfs /dev/loop0
root@osd1:/# ceph osd create
50
root@osd1:/# mkdir /var/lib/ceph/osd/ceph-50
root@osd1:/# mount -t xfs /dev/loop0 /var/lib/ceph/osd/ceph-50
root@osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
2014-05-15 00:20:29.796822 7f40063bb780 -1 journal FileJournal::_open:
aio not supported without directio; disabling aio
2014-05-15 00:20:29.798583 7f40063bb780 -1 journal check: ondisk fsid
bc14ff30-e016-4e0d-9672-96262ee5f07e doesn't match expected
b3f5b98b-e024-4153-875d-5c758a6060eb, invalid (someone else's?) journal
2014-05-15 00:20:29.802155 7f40063bb780 -1 journal FileJournal::_open:
aio not supported without directio; disabling aio
2014-05-15 00:20:29.807237 7f40063bb780 -1
filestore(/var/lib/ceph/osd/ceph-50) could not find
23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-05-15 00:20:29.809083 7f40063bb780 -1 created object store
/var/lib/ceph/osd/ceph-50 journal /dev/sdc7 for osd.50 fsid
c51a2683-55dc-4634-9d9d-f0fec9a6f389
2014-05-15 00:20:29.809121 7f40063bb780 -1 auth: error reading file:
/var/lib/ceph/osd/ceph-50/keyring: can't open
/var/lib/ceph/osd/ceph-50/keyring: (2) No such file or directory
2014-05-15 00:20:29.809179 7f40063bb780 -1 created new key in keyring
/var/lib/ceph/osd/ceph-50/keyring
root@osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey
--osd-journal=/dev/sdc7 --mkjournal
2014-05-15 00:20:51.122716 7ff813ba4780 -1 journal FileJournal::_open:
aio not supported without directio; disabling aio
2014-05-15 00:20:51.126275 7ff813ba4780 -1 journal FileJournal::_open:
aio not supported without directio; disabling aio
2014-05-15 00:20:51.129532 7ff813ba4780 -1 provided osd id 50 !=
superblock's -1
2014-05-15 00:20:51.129845 7ff813ba4780 -1  ** ERROR: error creating
empty object store in /var/lib/ceph/osd/ceph-50: (22) Invalid argument

Cheers,
Josef

Christian Balzer skrev 2014-05-14 14:33:

Hello!

On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:

Hi Christian,

I missed this thread, haven't been reading the list that well the last
weeks.

You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here
though.

Nods, I do recall that thread.

A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.

That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.

I'll get back to you with the results, hopefully I'll manage to get them
done during this night.

Looking forward to that. ^^


Christian

Cheers,
Josef

On 13/05/14 11:03, Christian Balzer wrote:

I'm clearly talking to myself, but whatever.

For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.

Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of
a RAID/JBODs or using SSDs for final storage?

If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.

So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.

This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^

On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.

Christian

On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:

On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:

Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.

In the a "picture" being worth a thousand words tradition, I give you
this iostat -x output taken during a fio run:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            50.82    0.00   19.43    0.17    0.00   29.58

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    51.50    0.00 1633.50     0.00  7460.00
9.13     0.18    0.11    0.00    0.11   0.01   1.40 sdb
0.00     0.00    0.00 1240.50     0.00  5244.00     8.45     0.30
0.25    0.00    0.25   0.02   2.00 sdc               0.00     5.00
0.00 2468.50     0.00 13419.00    10.87     0.24    0.10    0.00
0.10   0.09  22.00 sdd               0.00     6.50    0.00 1913.00
0.00 10313.00    10.78     0.20    0.10    0.00    0.10   0.09  16.60

The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.

sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.

The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.

Regards,

Christian

When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg

On Wednesday, May 7, 2014, Alexandre DERUMIER <aderum...@odiso.com>
wrote:

Hi Christian,

Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)


Aslo, I known that direct ios can be quite slow with ceph,

maybe can you try without --direct=1

and also enable rbd_cache

ceph.conf
[client]
rbd cache = true




----- Mail original -----

De: "Christian Balzer" <ch...@gol.com <javascript:;>>
À: "Gregory Farnum" <g...@inktank.com <javascript:;>>,
ceph-users@lists.ceph.com <javascript:;>
Envoyé: Jeudi 8 Mai 2014 04:49:16
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
backing devices

On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote:

On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
<ch...@gol.com<javascript:;>>

wrote:

Hello,

ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
journals are on (separate) DC 3700s, the actual OSDs are RAID6
behind an Areca 1882 with 4GB of cache.

Running this fio:

fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
--iodepth=128

results in:

30k IOPS on the journal SSD (as expected)
110k IOPS on the OSD (it fits neatly into the cache, no surprise
there) 3200 IOPS from a VM using userspace RBD
2900 IOPS from a host kernelspace mounted RBD

When running the fio from the VM RBD the utilization of the
journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
(1500 IOPS after some obvious merging).
The OSD processes are quite busy, reading well over 200% on atop,
but the system is not CPU or otherwise resource starved at that
moment.

Running multiple instances of this test from several VMs on
different hosts changes nothing, as in the aggregated IOPS for
the whole cluster will still be around 3200 IOPS.

Now clearly RBD has to deal with latency here, but the network is
IPoIB with the associated low latency and the journal SSDs are
the (consistently) fasted ones around.

I guess what I am wondering about is if this is normal and to be
expected or if not where all that potential performance got lost.

Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)

Yes, but going down to 32 doesn't change things one iota.
Also note the multiple instances I mention up there, so that would
be 256 IOs at a time, coming from different hosts over different
links and nothing changes.

that's about 40ms of latency per op (for userspace RBD), which
seems awfully long. You should check what your client-side objecter
settings are; it might be limiting you to fewer outstanding ops
than that.

Googling for client-side objecter gives a few hits on ceph devel and
bugs and nothing at all as far as configuration options are
concerned. Care to enlighten me where one can find those?

Also note the kernelspace (3.13 if it matters) speed, which is very
much in the same (junior league) ballpark.

If
it's available to you, testing with Firefly or even master would be
interesting — there's some performance work that should reduce
latencies.

Not an option, this is going into production next week.

But a well-tuned (or even default-tuned, I thought) Ceph cluster
certainly doesn't require 40ms/op, so you should probably run a
wider array of experiments to try and figure out where it's coming
from.

I think we can rule out the network, NPtcp gives me:
---
56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
---

For comparison at about 512KB it reaches maximum throughput and
still isn't that laggy:
---
98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
---

So with the network performing as well as my lengthy experience with
IPoIB led me to believe, what else is there to look at?
The storage nodes perform just as expected, indicated by the local
fio tests.

That pretty much leaves only Ceph/RBD to look at and I'm not really
sure what experiments I should run on that. ^o^

Regards,

Christian

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

--
Christian Balzer Network/Systems Engineer
ch...@gol.com <javascript:;> Global OnLine Japan/Fusion
Communications http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com <javascript:;>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

Reply via email to