Re: [ceph-users] Slow RBD Benchmark Compared To Direct I/O Test

Christian Balzer Mon, 21 Apr 2014 21:54:25 -0700

On Tue, 22 Apr 2014 02:45:24 +0800 Indra Pramana wrote:

> Hi Christian,
> 
> Good day to you, and thank you for your reply. :)  See my reply inline.
> 
> On Mon, Apr 21, 2014 at 10:20 PM, Christian Balzer <ch...@gol.com> wrote:
> 
> >
> > Hello,
> >
> > On Mon, 21 Apr 2014 20:47:21 +0800 Indra Pramana wrote:
> >
> > > Dear all,
> > >
> > > I have a Ceph RBD cluster with around 31 OSDs running SSD drives,
> > > and I tried to use the benchmark tools recommended by Sebastien on
> > > his blog here:
> > >
> > How many OSDs per storage node and what is in those storage nodes in
> > terms of controller, CPU, RAM?
> >
> 
> Each storage node has mainly 4 OSDs, although I have one node having 6.
> Each OSD consists of 480 GB / 500 GB SSD drives (depends on the brand).
> 
So I make that 7 or 8 nodes then?


> Each node has mainly SATA 2.0 controllers (newer one uses SATA 3.0),
> 4-core 3.3 GHz CPU, 16 GB of RAM.
>
That sounds good enough as far as memory and CPU are concerned. 
The SATA-2 speed will limit you, I have some journal SSDs hanging of
SATA-2 and they can't get over 250MB/s while they can get to 350MB/s on
SATA-3.
 
> 
> > > http://www.sebastien-han.fr/blog/2012/08/26/ceph-benchmarks/
> > >
> > Sebastien has done a great job with those, however with Ceph being
> > such a fast moving target quite a bit of that information is somewhat
> > dated.
> >
> > > Our configuration:
> > >
> > > - Ceph version 0.67.7
> > That's also a bit dated.
> >
> 
> Yes, decided to stick with the latest stable version of dumpling. Do you
> think upgrading to Emperor might help to improve performance?
> 
Given that older versions of Ceph tend to get little support (bug fixes
backported) and that Firefly is around the corner I would suggest moving
to Emperor to rule out any problems with Dumpling, get experience with
inevitable cluster upgrades and have a smoother path to Firefly when it
comes out. 

>  > - 31 OSDs of 500 GB SSD drives each
> > > - Journal for each OSD is configured on the same SSD drive itself
> > > - Journal size 10 GB
> > >
> > > After doing some tests recommended on the article, I find out that
> > > generally:
> > >
> > > - Local disk benchmark tests using dd is fast, around 245 MB/s since
> > > we are using SSDs.
> > > - Network benchmark tests using iperf and netcat is also fast, I can
> > > get around 9.9 Mbit/sec since we are using 10G network.
> >
> > I think you mean 9.9Gb/s there. ^o^
> >
> 
> Yes, I meant 9.9 Gbit/sec. Sorry for the typo.
> 
> How many network ports per node, cluster network or not?
> >
> 
> Each OSD has 2 x 10 Gbps connection to our 10 gigabit-switch, one for
> client network and another one is for replication network between OSDs.
> 
All very good and by the book.

> 
> > > However:
> > >
> > > - RADOS bench test (rados bench -p my_pool 300 write) on the whole
> > > cluster is slow, averaging around 112 MB/s for write.
> >
> > That commands fires of a single thread, which is unlikely to be able to
> > saturate things.
> >
> > Try that with a "-t 32" before the time (300) and if that improves
> > things increase that value until it doesn't (probably around 128).
> >
> 
> Using 32 concurrent writes, result is below. The speed really fluctuates.
> 
>  Total time run:         64.317049
> Total writes made:      1095
> Write size:             4194304
> Bandwidth (MB/sec):     68.100
> 
> Stddev Bandwidth:       44.6773
> Max bandwidth (MB/sec): 184
> Min bandwidth (MB/sec): 0
> Average Latency:        1.87761
> Stddev Latency:         1.90906
> Max latency:            9.99347
> Min latency:            0.075849
> 
That is really weird, it should get faster, not slower. ^o^
I assume you've run this a number of times?
 
Also my apologies, the default is 16 threads, not 1, but that still isn't
enough to get my cluster to full speed:
---
Bandwidth (MB/sec):     349.044 

Stddev Bandwidth:       107.582
Max bandwidth (MB/sec): 408
---
at 64 threads it will ramp up from a slow start to:
---
Bandwidth (MB/sec):     406.967 

Stddev Bandwidth:       114.015
Max bandwidth (MB/sec): 452
---

But what stands out is your latency. I don't have a 10GBE network to
compare, but my Infiniband based cluster (going through at least one
switch) gives me values like this:
---
Average Latency:        0.335519
Stddev Latency:         0.177663
Max latency:            1.37517
Min latency:            0.1017
---

Of course that latency is not just the network.

I would suggest running atop (gives you more information at one glance) or
"iostat -x 3" on all your storage nodes during these tests to identify any
node or OSD that is overloaded in some way.

> Are you testing this from just one client?
> >
> 
> Yes. One KVM hypervisor host.
> 
> How is that client connected to the Ceph network?
> >
> 
> It's connected through the same 10Gb network. iperf result shows no issue
> on the bandwidth between the client and the MONs/OSDs.
> 
> 
> > Another thing comes to mind, how many pg_num and pgp_num are in your
> > "my_pool"?
> > You could have some quite unevenly distributed data.
> >
> 
> pg_num/pgp_num for the pool is currently set to 850.
> 
If this isn't production yet, I would strongly suggest upping that to 2048
for a much smoother distribution and adhering to the recommended values
for this.

> 
> >  > - Invididual test using "ceph tell osd.X bench" gives different
> >  > results
> > > per OSD but also averaging around 110-130 MB/s only.
> > >
> > That at least is easily explained by what I'm mentioning below about
> > the remaining performance of your SSD when journal and OSD data are on
> > it at the same time.
> > > Anyone can advise what could be the reason of why our RADOS/Ceph
> > > benchmark test result is slow compared to a direct physical drive
> > > test on the OSDs directly? Anything on Ceph configuration that we
> > > need to optimise further?
> > >
> > For starters, since your journals (I frequently wonder if journals
> > ought be something that can be turned off) are on the same device as
> > the OSD data, your total throughput and IOPS of that device have now
> > been halved.
> >
> > And what replication level are you using? That again will cut into your
> > cluster wide throughput and IOPS.
> >
> 
> I maintain 2 replicas on the pool.
> 

So to simplify things I will assume 8 nodes with OSDs each and all SSDs on
SATA-2, giving a raw speed of 250MB/s per SSD. 
The speed per OSD will be just half that, though, since it has to share
that with the journal. 
So just 500MB/s of potential speed per node or 4GB/s for the whole cluster.

Now here is where it gets tricky. 
With just one thread and one client you will write to one PG, first to
journal of the primary OSD, then that will be written to the journal of
the secondary OSD (on another node) and your transaction will be ACK'ed. 
This if course doesn't take any advantage of the parallelism of Ceph and
will never get close to achieving maximum bandwidth per client. But it
also won't be impacted by on which OSDs the PGs reside, as there is no
competition from other clients/threads. 

With 16 threads (and more) the PG distribution becomes very crucial.
Ideally each thread would be writing to different primary OSDs and all the
secondary OSDs would be ones that aren't primary ones (32 assumed OSDs/2).

But if the PGs are clumpy and for example osd.0 happens to the primary for
one PG being written to by one thread and the secondary for another
thread at the same time it bandwidth just dropped again.

Regards,

Christian
> 
> >
> > I've read a number of times that Ceph will be in general half as fast
> > as your expected speed from the cluster hardware your deploying, but
> > that of course is something based on many factors and needs
> > verification in each specific case.
> >
> > For me, I have OSDs (11 disk RAID6 on an Areca 1882 with 1GB cache, 2
> > OSDs each on 2 nodes total) that can handle the fio run below directly
> > on the OSD at 37k IOPS (since it fits into the cache nicely).
> > ---
> > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
> > --rw=randwrite --name=fiojob --blocksize_range=4k-4K --iodepth=16
> > ---
> > The Journal SSD is about same.
> >
> > However that same benchmark just delivers a mere 3100 IOPS when run
> > from a VM (userspace RBD, caching enabled but that makes no difference
> > at all) and the journal SSDs are busier (25%) than the actual OSDs
> > (5%), but still nowhere near their capacity.
> > This leads me to believe that aside from network latencies (4xQDDR
> > Infiniband here, which has less latency than 10GBE) that there is a
> > lot of space for improvement when it comes to how Ceph handles things
> > (bottlenecks in the code) and tuning in general.
> >
> 
> Thanks for sharing.
> 
> Any further tuning configuration which can be suggested is greatly
> appreciated.
> 
> Cheers.


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow RBD Benchmark Compared To Direct I/O Test

Reply via email to