Re: [ceph-users] Bad cluster benchmark results

Christian Balzer Thu, 02 Oct 2014 02:08:50 -0700

Hello,

On Wed, 1 Oct 2014 23:08:53 -0700 Jakes John wrote:


> Thanks Christian. You saved my time! I mistakenly assumed -b  value to be
> in KB.
> 
> Now, when I ran same benchmarks, I am getting ~106 MB/s for writes and
> ~1050MB/s for reads for replica of 2.
> 
> I am slightly confused about the Read and write bandwidth terminology.
> What is the theoretical maximum for a read and write bandwidth for a 4
> node cluster?  How is it defined and how is it calculated. 

That totally depends on the hardware you're using of course. 
And a big factor in that is the amount of OSDs. 
Think about it (and watch it with atop), your disks are not only 100% busy
when they are writing sequentially at full speed, but also at much lower
speeds when they have to seek. And so a write to a PG that primary on OSD-0
and secondary on OSD-3 may have to compete with one that is primary on
OSD-3, slowing down OSD-3 (at least) by about half for the duration.

> Why are both
> different by a huge factor? 

If you monitor things, you will probably find that all the data still fit
into the pagecache of the storage nodes at the read speed you're seeing.
So you were measuring the speed limit of your network connection. ^.^

> My osd and journals are on two different
> disks.
> 
I missed that in your first mail. And while a separate journal will help,
it isn't ideal (very fast) either. In addition, you're using a journal
file and not a partition from what I gather.

> 
> I thought, for a replica size of 2, maximum write bandwidth must be
> 120MB * 4 / 2 = 240 MB
>                 for a replica size of 3, maximum write bandwidth must be
> 120MB * 4 / 3  = 160 MB
> 
Too optimistic, this ignores the fact that these are not sequential local
writes, but segmented ones on top of a filesystem and distributed amongst
network nodes.
Monitor your nodes with atop and see where the bottlenecks are (I still
bet disks).
Re-read my mail below.

Christian
> 
> 
> Thanks
> 
> 
> On Wed, Oct 1, 2014 at 7:24 PM, Christian Balzer <ch...@gol.com> wrote:
> 
> >
> > Hello,
> >
> > On Wed, 1 Oct 2014 14:43:49 -0700 Jakes John wrote:
> >
> > > Hi Ceph users,
> > >                         I am stuck with the benchmark results that I
> > > obtained from the ceph cluster.
> > >
> > > Ceph Cluster:
> > >
> > > 1 Mon node, 4 osd nodes of 1 TB. I have one journal for each osd.
> > >
> > > All disks are identical and nodes are connected by 10 G.  Below is
> > > the dd results
> > >
> > >
> > > dd if=/dev/zero of=/home/ubuntu/deleteme bs=10G count=1 oflag=direct
> > > 0+1 records in
> > > 0+1 records out
> > > 2147479552 bytes (2.1 GB) copied, 17.0705 s, 126 MB/s
> > >
> > That's for one disk done locally I presume?
> > Note that with a bs of 10G you're really comparing apples to oranges
> > later on of course.
> >
> > >
> > > I created 1 osd(xfs) on each node as below.
> > >
> > > mkfs.xfs /dev/sdo1
> > > mount /dev/sdo1 /node/nodeo
> > >
> > > sudo mkfs.xfs /dev/sdp1
> > >
> > > ceph-deploy osd prepare mynode:/node/nodeo:/dev/sdp1
> > > ceph-deploy osd activate mynode:/node/nodeo:/dev/sdp1
> > >
> > > Now, when I run rados bechmarks, I am just getting ~4 MB/s for
> > > writes and ~40 Mbps for reads. What am I doing wrong?.
> > Nothing really.
> >
> > > I have seen Christian's post regarding the block sizes and
> > > parallelism. My benchmark arguments seem to be right.
> > >
> > You're testing with 4k blocks, which are still quite small in the Ceph
> > world, the default (with no -b parameter) is 4MB!
> >
> > If I use your parameters, I can get about 8MB/s from my cluster with 8
> > OSDs per node and 4 SSDs for journals, connected by Infiniband.
> > So don't feel bad. ^o^
> > Using the default 4MB block size, I get 600MB/s.
> >
> > > Replica size of test-pool - 2
> > > No of pgs: 256
> > >
> > > rados -p test-pool bench 120 write -b 4096 -t 16 --no-cleanup
> > >
> > > Total writes made:      245616
> > > Write size:             4096
> > > Bandwidth (MB/sec):     3.997
> > >
> > > Stddev Bandwidth:       2.19989
> > > Max bandwidth (MB/sec): 8.46094
> > > Min bandwidth (MB/sec): 0
> > > Average Latency:        0.0156332
> > > Stddev Latency:         0.0460168
> > > Max latency:            2.94882
> >
> > This suggests to me that at one point your disks were the bottlenecks,
> > probably due to the journals being on the same device.
> >
> > Always run atop (as it covers nearly all the bases) on all your OSD
> > nodes when doing tests, you will see when disks are bottlenecks and
> > you might find that with certain operations CPU usage spikes so much
> > it might be the culprit.
> >
> > > Min latency:            0.001725
> > >
> > >
> > > rados -p test-pool bench 120 seq -t 16 --no-cleanup
> > >
> > >
> > > Total reads made:     245616
> > > Read size:            4096
> > > Bandwidth (MB/sec):    40.276
> > >
> > > Average Latency:       0.00155048
> > > Max latency:           3.25052
> > > Min latency:           0.000515
> > >
> >
> > I don't know the intimate inner details of Ceph, but I assume this is
> > because things were written with 4KB blocks and I can certainly
> > reproduce this behavior and results on my "fast" cluster. Also looking
> > at atop, it gets VERY busy CPU wise at that time, also suggesting it
> > has to deal with lots of little transactions.
> >
> > Doing the rados bench with the default 4MB block size (no -b
> > parameter) I also get 600MB/s read performance.
> >
> >
> > Some general observation about what to expect for write
> >
> > Lets do some very simplified calculations here:
> > 1. Your disks can write about 120MB/s individually. Now that are
> > sequential writes you tested, Ceph writes 4MB blobs into a filesystem
> > and thus has way more overhead and will be significantly slower.
> > 2. You have on disk journals, thus halving your base disk speed,
> > meaning a drive can now at best write about 60MB/s.
> > 3. And a replication of 2, potentially halving speeds again.
> >
> > So the base speed of your cluster is about 120MB/s, about the same as a
> > single drive. And these are non-sequential writes spread over a network
> > (which IS slower than local writes).
> >
> > On my crappy test cluster I can't get much over 40MB/s and it
> > incidentally also has 4 OSDs with on disk journals as well.
> >
> > Christian
> > --
> > Christian Balzer        Network/Systems Engineer
> > ch...@gol.com           Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bad cluster benchmark results

Reply via email to