Thanks Christian. You saved my time! I mistakenly assumed -b  value to be
in KB.

Now, when I ran same benchmarks, I am getting ~106 MB/s for writes and
~1050MB/s for reads for replica of 2.

I am slightly confused about the Read and write bandwidth terminology. What
is the theoretical maximum for a read and write bandwidth for a 4 node
cluster?  How is it defined and how is it calculated. Why are both
different by a huge factor? My osd and journals are on two different disks.


I thought, for a replica size of 2, maximum write bandwidth must be 120MB *
4 / 2 = 240 MB
                for a replica size of 3, maximum write bandwidth must be
120MB * 4 / 3  = 160 MB



Thanks


On Wed, Oct 1, 2014 at 7:24 PM, Christian Balzer <ch...@gol.com> wrote:

>
> Hello,
>
> On Wed, 1 Oct 2014 14:43:49 -0700 Jakes John wrote:
>
> > Hi Ceph users,
> >                         I am stuck with the benchmark results that I
> > obtained from the ceph cluster.
> >
> > Ceph Cluster:
> >
> > 1 Mon node, 4 osd nodes of 1 TB. I have one journal for each osd.
> >
> > All disks are identical and nodes are connected by 10 G.  Below is the dd
> > results
> >
> >
> > dd if=/dev/zero of=/home/ubuntu/deleteme bs=10G count=1 oflag=direct
> > 0+1 records in
> > 0+1 records out
> > 2147479552 bytes (2.1 GB) copied, 17.0705 s, 126 MB/s
> >
> That's for one disk done locally I presume?
> Note that with a bs of 10G you're really comparing apples to oranges later
> on of course.
>
> >
> > I created 1 osd(xfs) on each node as below.
> >
> > mkfs.xfs /dev/sdo1
> > mount /dev/sdo1 /node/nodeo
> >
> > sudo mkfs.xfs /dev/sdp1
> >
> > ceph-deploy osd prepare mynode:/node/nodeo:/dev/sdp1
> > ceph-deploy osd activate mynode:/node/nodeo:/dev/sdp1
> >
> > Now, when I run rados bechmarks, I am just getting ~4 MB/s for writes and
> > ~40 Mbps for reads. What am I doing wrong?.
> Nothing really.
>
> > I have seen Christian's post regarding the block sizes and parallelism.
> > My benchmark arguments seem to be right.
> >
> You're testing with 4k blocks, which are still quite small in the Ceph
> world, the default (with no -b parameter) is 4MB!
>
> If I use your parameters, I can get about 8MB/s from my cluster with 8
> OSDs per node and 4 SSDs for journals, connected by Infiniband.
> So don't feel bad. ^o^
> Using the default 4MB block size, I get 600MB/s.
>
> > Replica size of test-pool - 2
> > No of pgs: 256
> >
> > rados -p test-pool bench 120 write -b 4096 -t 16 --no-cleanup
> >
> > Total writes made:      245616
> > Write size:             4096
> > Bandwidth (MB/sec):     3.997
> >
> > Stddev Bandwidth:       2.19989
> > Max bandwidth (MB/sec): 8.46094
> > Min bandwidth (MB/sec): 0
> > Average Latency:        0.0156332
> > Stddev Latency:         0.0460168
> > Max latency:            2.94882
>
> This suggests to me that at one point your disks were the bottlenecks,
> probably due to the journals being on the same device.
>
> Always run atop (as it covers nearly all the bases) on all your OSD nodes
> when doing tests, you will see when disks are bottlenecks and you might
> find that with certain operations CPU usage spikes so much it might be the
> culprit.
>
> > Min latency:            0.001725
> >
> >
> > rados -p test-pool bench 120 seq -t 16 --no-cleanup
> >
> >
> > Total reads made:     245616
> > Read size:            4096
> > Bandwidth (MB/sec):    40.276
> >
> > Average Latency:       0.00155048
> > Max latency:           3.25052
> > Min latency:           0.000515
> >
>
> I don't know the intimate inner details of Ceph, but I assume this is
> because things were written with 4KB blocks and I can certainly reproduce
> this behavior and results on my "fast" cluster. Also looking at atop, it
> gets VERY busy CPU wise at that time, also suggesting it has to deal with
> lots of little transactions.
>
> Doing the rados bench with the default 4MB block size (no -b parameter) I
> also get 600MB/s read performance.
>
>
> Some general observation about what to expect for write
>
> Lets do some very simplified calculations here:
> 1. Your disks can write about 120MB/s individually. Now that are sequential
> writes you tested, Ceph writes 4MB blobs into a filesystem and thus has way
> more overhead and will be significantly slower.
> 2. You have on disk journals, thus halving your base disk speed, meaning a
> drive can now at best write about 60MB/s.
> 3. And a replication of 2, potentially halving speeds again.
>
> So the base speed of your cluster is about 120MB/s, about the same as a
> single drive. And these are non-sequential writes spread over a network
> (which IS slower than local writes).
>
> On my crappy test cluster I can't get much over 40MB/s and it incidentally
> also has 4 OSDs with on disk journals as well.
>
> Christian
> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com           Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to