random write performance

Mark Nelson Mon, 19 Aug 2013 05:49:33 -0700

On 08/19/2013 06:28 AM, Da Chun Ng wrote:

I have a 3 nodes, 15 osds ceph cluster setup:
* 15 7200 RPM SATA disks, 5 for each node.
* 10G network
* Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.
* 64G Ram for each node.
I deployed the cluster with ceph-deploy, and created a new data poolfor cephfs.
Both the data and metadata pools are set with replica size 3.
Then mounted the cephfs on one of the three nodes, and tested theperformance with fio.
The sequential read  performance looks good:
fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K-size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60
read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec

Sounds like readahead and or caching is helping out a lot here. Btw, youmight want to make sure this is actually coming from the disks withiostat or collectl or something.

But the sequential write/random read/random write performance is verypoor:fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K-size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec

One thing to keep in mind is that unless you have SSDs in this system,you will be doing 2 writes for every client write to the spinning disks(since data and journals will both be on the same disk).


So let's do the math:

6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024(KB->bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive

If there is no write coalescing going on, this isn't terrible. If thereis, this is terrible. Have you tried buffered writes with the syncengine at the same IO size?

fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio -bs=16K-size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec


In this case:

11087 * 1024 (KB->bytes) / 16384 / 15 = ~46 IOPS / drive.

Definitely not great! You might want to try fiddling with read aheadboth on the CephFS client and on the block devices under the OSDsthemselves.

One thing I did notice back during bobtail is that increasing the numberof osd op threads seemed to help small object read performance. Itmight be worth looking at too.


http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread

Other than that, if you really want to dig into this, you can use toolslike iostat, collectl, blktrace, and seekwatcher to try and get a feelfor what the IO going to the OSDs looks like. That can help whendiagnosing this sort of thing.

fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec

6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024(KB->bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS / drive

I am mostly surprised by the seq write performance comparing to theraw sata disk performance(It can get 4127 IOPS when mounted withext4). My cephfs only gets 1/10 performance of the raw disk.

7200 RPM spinning disks typically top out at something like 150 IOPS(and some are lower). With 15 disks, to hit 4127 IOPS you were probablyseeing some write coalescing effects (or if these were random reads,some benefit to read ahead).

How can I tune my cluster to improve the sequential write/randomread/random write performance?

I don't know what kind of controller you have, but in cases wherejournals are on the same disks as the data, using writeback cache helpsa lot because the controller can coalesce the direct IO journal writesin cache and just do big periodic dumps to the drives. That reallyreduces seek overhead for the writes. Using SSDs for the journalsaccomplishes much of the same effect, and lets you get faster large IOwrites too, but in many chassis there is a density (and cost) trade-off.


Hope this helps!

Mark






_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor write/random read/random write performance

Reply via email to