Thanks Christian. You saved my time! I mistakenly assumed -b value to be in KB.
Now, when I ran same benchmarks, I am getting ~106 MB/s for writes and ~1050MB/s for reads for replica of 2. I am slightly confused about the Read and write bandwidth terminology. What is the theoretical maximum for a read and write bandwidth for a 4 node cluster? How is it defined and how is it calculated. Why are both different by a huge factor? My osd and journals are on two different disks. I thought, for a replica size of 2, maximum write bandwidth must be 120MB * 4 / 2 = 240 MB for a replica size of 3, maximum write bandwidth must be 120MB * 4 / 3 = 160 MB Thanks On Wed, Oct 1, 2014 at 7:24 PM, Christian Balzer <ch...@gol.com> wrote: > > Hello, > > On Wed, 1 Oct 2014 14:43:49 -0700 Jakes John wrote: > > > Hi Ceph users, > > I am stuck with the benchmark results that I > > obtained from the ceph cluster. > > > > Ceph Cluster: > > > > 1 Mon node, 4 osd nodes of 1 TB. I have one journal for each osd. > > > > All disks are identical and nodes are connected by 10 G. Below is the dd > > results > > > > > > dd if=/dev/zero of=/home/ubuntu/deleteme bs=10G count=1 oflag=direct > > 0+1 records in > > 0+1 records out > > 2147479552 bytes (2.1 GB) copied, 17.0705 s, 126 MB/s > > > That's for one disk done locally I presume? > Note that with a bs of 10G you're really comparing apples to oranges later > on of course. > > > > > I created 1 osd(xfs) on each node as below. > > > > mkfs.xfs /dev/sdo1 > > mount /dev/sdo1 /node/nodeo > > > > sudo mkfs.xfs /dev/sdp1 > > > > ceph-deploy osd prepare mynode:/node/nodeo:/dev/sdp1 > > ceph-deploy osd activate mynode:/node/nodeo:/dev/sdp1 > > > > Now, when I run rados bechmarks, I am just getting ~4 MB/s for writes and > > ~40 Mbps for reads. What am I doing wrong?. > Nothing really. > > > I have seen Christian's post regarding the block sizes and parallelism. > > My benchmark arguments seem to be right. > > > You're testing with 4k blocks, which are still quite small in the Ceph > world, the default (with no -b parameter) is 4MB! > > If I use your parameters, I can get about 8MB/s from my cluster with 8 > OSDs per node and 4 SSDs for journals, connected by Infiniband. > So don't feel bad. ^o^ > Using the default 4MB block size, I get 600MB/s. > > > Replica size of test-pool - 2 > > No of pgs: 256 > > > > rados -p test-pool bench 120 write -b 4096 -t 16 --no-cleanup > > > > Total writes made: 245616 > > Write size: 4096 > > Bandwidth (MB/sec): 3.997 > > > > Stddev Bandwidth: 2.19989 > > Max bandwidth (MB/sec): 8.46094 > > Min bandwidth (MB/sec): 0 > > Average Latency: 0.0156332 > > Stddev Latency: 0.0460168 > > Max latency: 2.94882 > > This suggests to me that at one point your disks were the bottlenecks, > probably due to the journals being on the same device. > > Always run atop (as it covers nearly all the bases) on all your OSD nodes > when doing tests, you will see when disks are bottlenecks and you might > find that with certain operations CPU usage spikes so much it might be the > culprit. > > > Min latency: 0.001725 > > > > > > rados -p test-pool bench 120 seq -t 16 --no-cleanup > > > > > > Total reads made: 245616 > > Read size: 4096 > > Bandwidth (MB/sec): 40.276 > > > > Average Latency: 0.00155048 > > Max latency: 3.25052 > > Min latency: 0.000515 > > > > I don't know the intimate inner details of Ceph, but I assume this is > because things were written with 4KB blocks and I can certainly reproduce > this behavior and results on my "fast" cluster. Also looking at atop, it > gets VERY busy CPU wise at that time, also suggesting it has to deal with > lots of little transactions. > > Doing the rados bench with the default 4MB block size (no -b parameter) I > also get 600MB/s read performance. > > > Some general observation about what to expect for write > > Lets do some very simplified calculations here: > 1. Your disks can write about 120MB/s individually. Now that are sequential > writes you tested, Ceph writes 4MB blobs into a filesystem and thus has way > more overhead and will be significantly slower. > 2. You have on disk journals, thus halving your base disk speed, meaning a > drive can now at best write about 60MB/s. > 3. And a replication of 2, potentially halving speeds again. > > So the base speed of your cluster is about 120MB/s, about the same as a > single drive. And these are non-sequential writes spread over a network > (which IS slower than local writes). > > On my crappy test cluster I can't get much over 40MB/s and it incidentally > also has 4 OSDs with on disk journals as well. > > Christian > -- > Christian Balzer Network/Systems Engineer > ch...@gol.com Global OnLine Japan/Fusion Communications > http://www.gol.com/ > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com