Hello,

On Mon, 21 Apr 2014 20:47:21 +0800 Indra Pramana wrote:

> Dear all,
> 
> I have a Ceph RBD cluster with around 31 OSDs running SSD drives, and I
> tried to use the benchmark tools recommended by Sebastien on his blog
> here:
>
How many OSDs per storage node and what is in those storage nodes in terms
of controller, CPU, RAM?
 
> http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/
> 
Sebastien has done a great job with those, however with Ceph being such a
fast moving target quite a bit of that information is somewhat dated.

> Our configuration:
> 
> - Ceph version 0.67.7
That's also a bit dated.

> - 31 OSDs of 500 GB SSD drives each
> - Journal for each OSD is configured on the same SSD drive itself
> - Journal size 10 GB
> 
> After doing some tests recommended on the article, I find out that
> generally:
> 
> - Local disk benchmark tests using dd is fast, around 245 MB/s since we
> are using SSDs.
> - Network benchmark tests using iperf and netcat is also fast, I can get
> around 9.9 Mbit/sec since we are using 10G network.

I think you mean 9.9Gb/s there. ^o^

How many network ports per node, cluster network or not? 

> 
> However:
> 
> - RADOS bench test (rados bench -p my_pool 300 write) on the whole
> cluster is slow, averaging around 112 MB/s for write.

That commands fires of a single thread, which is unlikely to be able to
saturate things. 

Try that with a "-t 32" before the time (300) and if that improves
things increase that value until it doesn't (probably around 128).

Are you testing this from just one client?
How is that client connected to the Ceph network?  

Another thing comes to mind, how many pg_num and pgp_num are in your
"my_pool"? 
You could have some quite unevenly distributed data.

> - Invididual test using "ceph tell osd.X bench" gives different results
> per OSD but also averaging around 110-130 MB/s only.
> 
That at least is easily explained by what I'm mentioning below about the
remaining performance of your SSD when journal and OSD data are on it at
the same time.
> Anyone can advise what could be the reason of why our RADOS/Ceph
> benchmark test result is slow compared to a direct physical drive test
> on the OSDs directly? Anything on Ceph configuration that we need to
> optimise further?
> 
For starters, since your journals (I frequently wonder if journals ought
be something that can be turned off) are on the same device as the OSD
data, your total throughput and IOPS of that device have now been halved.

And what replication level are you using? That again will cut into your
cluster wide throughput and IOPS.

I've read a number of times that Ceph will be in general half as fast as
your expected speed from the cluster hardware your deploying, but that of
course is something based on many factors and needs verification in each
specific case.

For me, I have OSDs (11 disk RAID6 on an Areca 1882 with 1GB cache, 2
OSDs each on 2 nodes total) that can handle the fio run below directly on
the OSD at 37k IOPS (since it fits into the cache nicely).
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=randwrite --name=fiojob --blocksize_range=4k-4K --iodepth=16
---
The Journal SSD is about same.

However that same benchmark just delivers a mere 3100 IOPS when run from a
VM (userspace RBD, caching enabled but that makes no difference at all) and
the journal SSDs are busier (25%) than the actual OSDs (5%), but still
nowhere near their capacity. 
This leads me to believe that aside from network latencies (4xQDDR
Infiniband here, which has less latency than 10GBE) that there is a lot of
space for improvement when it comes to how Ceph handles things
(bottlenecks in the code) and tuning in general.


Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to