Ok I ran the two tests again with direct=1, smaller block size (4k) and smaller total io (100m), disabled cache at ceph.conf side on client by adding:
[client] rbd cache = false rbd cache max dirty = 0 rbd cache size = 0 rbd cache target dirty = 0 The result seems to have swapped around, now the librbd job is running ~50% faster than the krbd job! ####### krbd job: [root@rcprsdc1r72-01-ac rafaell]# fio ext4_test job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=16 fio-2.2.8 Starting 1 process Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/571KB/0KB /s] [0/142/0 iops] [eta 00m:00s] job1: (groupid=0, jobs=1): err= 0: pid=29095: Fri Sep 11 14:48:21 2015 write: io=102400KB, bw=647137B/s, iops=157, runt=162033msec clat (msec): min=2, max=25, avg= 6.32, stdev= 1.21 lat (msec): min=2, max=25, avg= 6.32, stdev= 1.21 clat percentiles (usec): | 1.00th=[ 2896], 5.00th=[ 4320], 10.00th=[ 4768], 20.00th=[ 5536], | 30.00th=[ 5920], 40.00th=[ 6176], 50.00th=[ 6432], 60.00th=[ 6624], | 70.00th=[ 6816], 80.00th=[ 7136], 90.00th=[ 7584], 95.00th=[ 7968], | 99.00th=[ 9024], 99.50th=[ 9664], 99.90th=[15808], 99.95th=[17536], | 99.99th=[19328] bw (KB /s): min= 506, max= 1171, per=100.00%, avg=632.22, stdev=104.77 lat (msec) : 4=2.88%, 10=96.69%, 20=0.43%, 50=0.01% cpu : usr=0.17%, sys=0.71%, ctx=25634, majf=0, minf=35 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=25600/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): WRITE: io=102400KB, aggrb=631KB/s, minb=631KB/s, maxb=631KB/s, mint=162033msec, maxt=162033msec Disk stats (read/write): rbd0: ios=0/25638, merge=0/32, ticks=0/160765, in_queue=160745, util=99.11% [root@rcprsdc1r72-01-ac rafaell]# ###### librb job: [root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=16 fio-2.2.8 Starting 1 process rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/703KB/0KB /s] [0/175/0 iops] [eta 00m:00s] job1: (groupid=0, jobs=1): err= 0: pid=30568: Fri Sep 11 14:50:24 2015 write: io=102400KB, bw=950141B/s, iops=231, runt=110360msec slat (usec): min=70, max=992, avg=115.05, stdev=30.07 clat (msec): min=13, max=117, avg=67.91, stdev=24.93 lat (msec): min=13, max=117, avg=68.03, stdev=24.93 clat percentiles (msec): | 1.00th=[ 19], 5.00th=[ 26], 10.00th=[ 38], 20.00th=[ 40], | 30.00th=[ 46], 40.00th=[ 62], 50.00th=[ 77], 60.00th=[ 85], | 70.00th=[ 88], 80.00th=[ 91], 90.00th=[ 95], 95.00th=[ 99], | 99.00th=[ 105], 99.50th=[ 110], 99.90th=[ 116], 99.95th=[ 117], | 99.99th=[ 118] bw (KB /s): min= 565, max= 3174, per=100.00%, avg=935.74, stdev=407.67 lat (msec) : 20=2.41%, 50=29.85%, 100=64.46%, 250=3.29% cpu : usr=2.43%, sys=0.29%, ctx=7847, majf=0, minf=2750 IO depths : 1=6.2%, 2=12.5%, 4=25.0%, 8=50.0%, 16=6.2%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=94.1%, 8=0.0%, 16=5.9%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=25600/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): WRITE: io=102400KB, aggrb=927KB/s, minb=927KB/s, maxb=927KB/s, mint=110360msec, maxt=110360msec Disk stats (read/write): dm-1: ios=240/369, merge=0/0, ticks=742/40, in_queue=782, util=0.38%, aggrios=240/379, aggrmerge=0/19, aggrticks=742/41, aggrin_queue=783, aggrutil=0.39% sda: ios=240/379, merge=0/19, ticks=742/41, in_queue=783, util=0.39% [root@rcprsdc1r72-01-ac rafaell]# Confirmed speed (at least for krbd) using dd: [root@rcprsdc1r72-01-ac rafaell]# dd if=/mnt/ssd/random100g of=/mnt/rbd/dd_io_test bs=4k count=10000 oflag=direct 10000+0 records in 10000+0 records out 40960000 bytes (41 MB) copied, 64.9799 s, 630 kB/s [root@rcprsdc1r72-01-ac rafaell]# Back to FIO, it's worse for 1M block size (librbd is about ~100% better perf). 1M librbd: Run status group 0 (all jobs): WRITE: io=1024.0MB, aggrb=112641KB/s, minb=112641KB/s, maxb=112641KB/s, mint=9309msec, maxt=9309msec 1M krbd: Run status group 0 (all jobs): WRITE: io=1024.0MB, aggrb=49939KB/s, minb=49939KB/s, maxb=49939KB/s, mint=20997msec, maxt=20997msec Raf On 11 September 2015 at 14:33, Somnath Roy <somnath....@sandisk.com> wrote: > Only changing client side ceph.conf and rerunning the tests is sufficient. > > > > Thanks & Regards > > Somnath > > > > *From:* Rafael Lopez [mailto:rafael.lo...@monash.edu] > *Sent:* Thursday, September 10, 2015 8:58 PM > *To:* Somnath Roy > *Cc:* ceph-users@lists.ceph.com > *Subject:* Re: [ceph-users] bad perf for librbd vs krbd using FIO > > > > Thanks for the quick reply Somnath, will give this a try. > > > > In order to set the rbd cache settings, is it a matter of updating the > ceph.conf file on the client only prior to running the test, or do I need > to inject args to all OSDs ? > > > > Raf > > > > > > On 11 September 2015 at 13:39, Somnath Roy <somnath....@sandisk.com> > wrote: > > It may be due to rbd cache effect.. > > Try the following.. > > > > Run your test with direct = 1 both the cases and rbd_cache = false > (disable all other rbd cache option as well). This should give you similar > result like krbd. > > > > In direct =1 case, we saw ~10-20% degradation if we make rbd_cache = true. > > But, direct = 0 case, it could be more as you are seeing.. > > > > I think there is a delta (or need to tune properly) if you want to use rbd > cache. > > > > Thanks & Regards > > Somnath > > > > > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf > Of *Rafael Lopez > *Sent:* Thursday, September 10, 2015 8:24 PM > *To:* ceph-users@lists.ceph.com > *Subject:* [ceph-users] bad perf for librbd vs krbd using FIO > > > > Hi all, > > > > I am seeing a big discrepancy between librbd and kRBD/ext4 performance > using FIO with single RBD image. RBD images are coming from same RBD pool, > same size and settings for both. The librbd results are quite bad by > comparison, and in addition if I scale up the kRBD FIO job with more > jobs/threads it increases up to 3-4x results below, but librbd doesn't seem > to scale much at all. I figured that it should be close to the kRBD result > for a single job/thread before parallelism comes into play though. RBD > cache settings are all default. > > > > I can see some obvious differences in FIO output, but not being well > versed with FIO I'm not sure what to make of it or where to start > diagnosing the discrepancy. Hunted around but haven't found anything > useful, any suggestions/insights would be appreciated. > > > > RBD cache settings: > > [root@rcmktdc1r72-09-ac rafaell]# ceph --admin-daemon > /var/run/ceph/ceph-osd.659.asok config show | grep rbd_cache > > "rbd_cache": "true", > > "rbd_cache_writethrough_until_flush": "true", > > "rbd_cache_size": "33554432", > > "rbd_cache_max_dirty": "25165824", > > "rbd_cache_target_dirty": "16777216", > > "rbd_cache_max_dirty_age": "1", > > "rbd_cache_max_dirty_object": "0", > > "rbd_cache_block_writes_upfront": "false", > > [root@rcmktdc1r72-09-ac rafaell]# > > > > This is the FIO job file for the kRBD job: > > > > [root@rcprsdc1r72-01-ac rafaell]# cat ext4_test > > ; -- start job file -- > > [global] > > rw=rw > > size=100g > > filename=/mnt/rbd/fio_test_file_ext4 > > rwmixread=0 > > rwmixwrite=100 > > percentage_random=0 > > bs=1024k > > direct=0 > > iodepth=16 > > thread=1 > > numjobs=1 > > [job1] > > ; -- end job file -- > > > > [root@rcprsdc1r72-01-ac rafaell]# > > > > This is the FIO job file for the librbd job: > > > > [root@rcprsdc1r72-01-ac rafaell]# cat fio_rbd_test > > ; -- start job file -- > > [global] > > rw=rw > > size=100g > > rwmixread=0 > > rwmixwrite=100 > > percentage_random=0 > > bs=1024k > > direct=0 > > iodepth=16 > > thread=1 > > numjobs=1 > > ioengine=rbd > > rbdname=nas1-rds-stg31 > > pool=rbd > > [job1] > > ; -- end job file -- > > > > > > Here are the results: > > > > [root@rcprsdc1r72-01-ac rafaell]# fio ext4_test > > job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=sync, iodepth=16 > > fio-2.2.8 > > Starting 1 thread > > job1: Laying out IO file(s) (1 file(s) / 102400MB) > > Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/321.7MB/0KB /s] [0/321/0 iops] > [eta 00m:00s] > > job1: (groupid=0, jobs=1): err= 0: pid=37981: Fri Sep 11 12:33:13 2015 > > write: io=102400MB, bw=399741KB/s, iops=390, runt=262314msec > > clat (usec): min=411, max=574082, avg=2492.91, stdev=7316.96 > > lat (usec): min=418, max=574113, avg=2520.12, stdev=7318.53 > > clat percentiles (usec): > > | 1.00th=[ 446], 5.00th=[ 458], 10.00th=[ 474], 20.00th=[ 510], > > | 30.00th=[ 1064], 40.00th=[ 1096], 50.00th=[ 1160], 60.00th=[ 1320], > > | 70.00th=[ 1592], 80.00th=[ 2448], 90.00th=[ 7712], 95.00th=[ 7904], > > | 99.00th=[11072], 99.50th=[11712], 99.90th=[13120], 99.95th=[73216], > > | 99.99th=[464896] > > bw (KB /s): min= 264, max=2156544, per=100.00%, avg=412986.27, > stdev=375092.66 > > lat (usec) : 500=18.68%, 750=7.43%, 1000=2.11% > > lat (msec) : 2=48.89%, 4=4.35%, 10=16.79%, 20=1.67%, 50=0.03% > > lat (msec) : 100=0.03%, 250=0.02%, 500=0.01%, 750=0.01% > > cpu : usr=1.24%, sys=45.38%, ctx=19298, majf=0, minf=974 > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > > issued : total=r=0/w=102400/d=0, short=r=0/w=0/d=0, > drop=r=0/w=0/d=0 > > latency : target=0, window=0, percentile=100.00%, depth=16 > > > > Run status group 0 (all jobs): > > WRITE: io=102400MB, aggrb=399740KB/s, minb=399740KB/s, maxb=399740KB/s, > mint=262314msec, maxt=262314msec > > > > Disk stats (read/write): > > rbd0: ios=0/150890, merge=0/49, ticks=0/36117700, in_queue=36145277, > util=96.97% > > [root@rcprsdc1r72-01-ac rafaell]# > > > > [root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test > > job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=rbd, iodepth=16 > > fio-2.2.8 > > Starting 1 thread > > rbd engine: RBD version: 0.1.9 > > Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/65405KB/0KB /s] [0/63/0 iops] > [eta 00m:00s] > > job1: (groupid=0, jobs=1): err= 0: pid=43960: Fri Sep 11 12:54:25 2015 > > write: io=102400MB, bw=121882KB/s, iops=119, runt=860318msec > > slat (usec): min=355, max=7300, avg=908.97, stdev=361.02 > > clat (msec): min=11, max=1468, avg=129.59, stdev=130.68 > > lat (msec): min=12, max=1468, avg=130.50, stdev=130.69 > > clat percentiles (msec): > > | 1.00th=[ 21], 5.00th=[ 26], 10.00th=[ 29], 20.00th=[ 34], > > | 30.00th=[ 37], 40.00th=[ 40], 50.00th=[ 44], 60.00th=[ 63], > > | 70.00th=[ 233], 80.00th=[ 241], 90.00th=[ 269], 95.00th=[ 367], > > | 99.00th=[ 553], 99.50th=[ 652], 99.90th=[ 832], 99.95th=[ 848], > > | 99.99th=[ 1369] > > bw (KB /s): min=20363, max=248543, per=100.00%, avg=124381.19, > stdev=42313.29 > > lat (msec) : 20=0.95%, 50=55.27%, 100=5.55%, 250=24.83%, 500=12.28% > > lat (msec) : 750=0.89%, 1000=0.21%, 2000=0.01% > > cpu : usr=9.58%, sys=1.15%, ctx=23883, majf=0, minf=2751023 > > IO depths : 1=1.2%, 2=3.0%, 4=9.7%, 8=68.3%, 16=17.8%, 32=0.0%, > >=64=0.0% > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > > complete : 0=0.0%, 4=92.5%, 8=4.3%, 16=3.2%, 32=0.0%, 64=0.0%, > >=64=0.0% > > issued : total=r=0/w=102400/d=0, short=r=0/w=0/d=0, > drop=r=0/w=0/d=0 > > latency : target=0, window=0, percentile=100.00%, depth=16 > > > > Run status group 0 (all jobs): > > WRITE: io=102400MB, aggrb=121882KB/s, minb=121882KB/s, maxb=121882KB/s, > mint=860318msec, maxt=860318msec > > > > Disk stats (read/write): > > dm-1: ios=0/2072, merge=0/0, ticks=0/233, in_queue=233, util=0.01%, > aggrios=1/2249, aggrmerge=7/559, aggrticks=9/254, aggrin_queue=261, > aggrutil=0.01% > > sda: ios=1/2249, merge=7/559, ticks=9/254, in_queue=261, util=0.01% > > [root@rcprsdc1r72-01-ac rafaell]# > > > > Cheers, > > Raf > > > > > > -- > > Rafael Lopez > > Data Storage Administrator > Servers & Storage (eSolutions) > > > ------------------------------ > > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If > the reader of this message is not the intended recipient, you are hereby > notified that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies > or electronically stored copies). > > > > > > -- > > Rafael Lopez > > Data Storage Administrator > Servers & Storage (eSolutions) > > +61 3 990 59118 > > -- Rafael Lopez Data Storage Administrator Servers & Storage (eSolutions) +61 3 990 59118
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com