Hi Christian,

Good day to you, and thank you for your reply. :)  See my reply inline.

On Mon, Apr 21, 2014 at 10:20 PM, Christian Balzer <ch...@gol.com> wrote:

>
> Hello,
>
> On Mon, 21 Apr 2014 20:47:21 +0800 Indra Pramana wrote:
>
> > Dear all,
> >
> > I have a Ceph RBD cluster with around 31 OSDs running SSD drives, and I
> > tried to use the benchmark tools recommended by Sebastien on his blog
> > here:
> >
> How many OSDs per storage node and what is in those storage nodes in terms
> of controller, CPU, RAM?
>

Each storage node has mainly 4 OSDs, although I have one node having 6.
Each OSD consists of 480 GB / 500 GB SSD drives (depends on the brand).

Each node has mainly SATA 2.0 controllers (newer one uses SATA 3.0), 4-core
3.3 GHz CPU, 16 GB of RAM.


> > http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/
> >
> Sebastien has done a great job with those, however with Ceph being such a
> fast moving target quite a bit of that information is somewhat dated.
>
> > Our configuration:
> >
> > - Ceph version 0.67.7
> That's also a bit dated.
>

Yes, decided to stick with the latest stable version of dumpling. Do you
think upgrading to Emperor might help to improve performance?

 > - 31 OSDs of 500 GB SSD drives each
> > - Journal for each OSD is configured on the same SSD drive itself
> > - Journal size 10 GB
> >
> > After doing some tests recommended on the article, I find out that
> > generally:
> >
> > - Local disk benchmark tests using dd is fast, around 245 MB/s since we
> > are using SSDs.
> > - Network benchmark tests using iperf and netcat is also fast, I can get
> > around 9.9 Mbit/sec since we are using 10G network.
>
> I think you mean 9.9Gb/s there. ^o^
>

Yes, I meant 9.9 Gbit/sec. Sorry for the typo.

How many network ports per node, cluster network or not?
>

Each OSD has 2 x 10 Gbps connection to our 10 gigabit-switch, one for
client network and another one is for replication network between OSDs.


> > However:
> >
> > - RADOS bench test (rados bench -p my_pool 300 write) on the whole
> > cluster is slow, averaging around 112 MB/s for write.
>
> That commands fires of a single thread, which is unlikely to be able to
> saturate things.
>
> Try that with a "-t 32" before the time (300) and if that improves
> things increase that value until it doesn't (probably around 128).
>

Using 32 concurrent writes, result is below. The speed really fluctuates.

 Total time run:         64.317049
Total writes made:      1095
Write size:             4194304
Bandwidth (MB/sec):     68.100

Stddev Bandwidth:       44.6773
Max bandwidth (MB/sec): 184
Min bandwidth (MB/sec): 0
Average Latency:        1.87761
Stddev Latency:         1.90906
Max latency:            9.99347
Min latency:            0.075849

Are you testing this from just one client?
>

Yes. One KVM hypervisor host.

How is that client connected to the Ceph network?
>

It's connected through the same 10Gb network. iperf result shows no issue
on the bandwidth between the client and the MONs/OSDs.


> Another thing comes to mind, how many pg_num and pgp_num are in your
> "my_pool"?
> You could have some quite unevenly distributed data.
>

pg_num/pgp_num for the pool is currently set to 850.


>  > - Invididual test using "ceph tell osd.X bench" gives different results
> > per OSD but also averaging around 110-130 MB/s only.
> >
> That at least is easily explained by what I'm mentioning below about the
> remaining performance of your SSD when journal and OSD data are on it at
> the same time.
> > Anyone can advise what could be the reason of why our RADOS/Ceph
> > benchmark test result is slow compared to a direct physical drive test
> > on the OSDs directly? Anything on Ceph configuration that we need to
> > optimise further?
> >
> For starters, since your journals (I frequently wonder if journals ought
> be something that can be turned off) are on the same device as the OSD
> data, your total throughput and IOPS of that device have now been halved.
>
> And what replication level are you using? That again will cut into your
> cluster wide throughput and IOPS.
>

I maintain 2 replicas on the pool.


>
> I've read a number of times that Ceph will be in general half as fast as
> your expected speed from the cluster hardware your deploying, but that of
> course is something based on many factors and needs verification in each
> specific case.
>
> For me, I have OSDs (11 disk RAID6 on an Areca 1882 with 1GB cache, 2
> OSDs each on 2 nodes total) that can handle the fio run below directly on
> the OSD at 37k IOPS (since it fits into the cache nicely).
> ---
> fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
> --rw=randwrite --name=fiojob --blocksize_range=4k-4K --iodepth=16
> ---
> The Journal SSD is about same.
>
> However that same benchmark just delivers a mere 3100 IOPS when run from a
> VM (userspace RBD, caching enabled but that makes no difference at all) and
> the journal SSDs are busier (25%) than the actual OSDs (5%), but still
> nowhere near their capacity.
> This leads me to believe that aside from network latencies (4xQDDR
> Infiniband here, which has less latency than 10GBE) that there is a lot of
> space for improvement when it comes to how Ceph handles things
> (bottlenecks in the code) and tuning in general.
>

Thanks for sharing.

Any further tuning configuration which can be suggested is greatly
appreciated.

Cheers.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to