[ceph-users] Bluestore cluster, bad IO perf on blocksize<64k... could it be throttling ?

Frederic BRET Thu, 22 Mar 2018 01:55:23 -0700

Hi all,

We encounter bad IO perfs on blocksize<64k on our new Bluestore test cluster


The context :
- Test cluster aside production one
- Fresh install on Luminous
- choice of Bluestore (coming from Filestore)
- Default config (including wpq queuing)

- 6 nodes SAS12, 14 OSD, 2 SSD, 2 x 10Gb nodes, far more Gb at eachswitch uplink...

- R3 pool, 2 nodes per site

- separate db (25GB) and wal (600MB) partitions on SSD for each OSD tobe able to observe each kind of IO with iostat

- RBD client fio --ioengine=libaio --iodepth=128 --direct=1
- client RDB :  rbd map rbd/test_rbd -o queue_depth=1024

- Just to point out, this is not a thread on SSD performance oradequation between SSD and number of OSD. These 12Gb SAS 10DWPD SSD areperfectly performing with lot of headroom on the production cluster evenwith XFS filestore and journals on SSDs.- This thread is about a possible bottleneck on low size blocks withrocksdb/wal/Bluestore.

To begin with, Bluestore performance is really breathtaking compared tofilestore/XFS : we saturate the 20Gb clients bandwidth on this smalltest cluster, as soon as IO blocksize=64k, a thing we couldn't achievewith Filestore and journals, even at 256k.

The downside, all small IO blockizes (4k, 8k, 16k, 32k) are considerablyslower and appear somewhat capped.

Just to compare, here are observed latencies at 2 consecutive values forblocksize 64k and 32k :

64k :
 write: io=55563MB, bw=1849.2MB/s, iops=29586, runt= 30048msec
    lat (msec): min=2, max=867, avg=17.29, stdev=32.31

32k :
 write: io=6332.2MB, bw=207632KB/s, iops=6488, runt= 31229msec
    lat (msec): min=1, max=5111, avg=78.81, stdev=430.50

Whereas 64k one is almost filling the 20Gb client connection, the 32kone is only getting a mere 1/10th of the bandwidth, and IOs latenciesare multiplied by 4.5 (or get a ~60ms pause ? ... )


And we see the same constant latency at 16k, 8k and 4k :
16k :
 write: io=3129.4MB, bw=102511KB/s, iops=6406, runt= 31260msec
    lat (msec): min=0.908, max=6.67, avg=79.87, stdev=500.08

8k :
 write: io=1592.8MB, bw=52604KB/s, iops=6575, runt= 31005msec
    lat (msec): min=0.824, max=5.49, avg=77.82, stdev=461.61

4k :
 write: io=837892KB, bw=26787KB/s, iops=6696, runt= 31280msec
    lat (msec): min=0.766, max=5.45, avg=76.39, stdev=428.29

To compare with filestore, on 4k IOs results I have on hand fromprevious install, we were getting almost 2x the Bluestore perfs on theexact same cluster :

WRITE: io=1221.4MB, aggrb=41477KB/,s maxt=30152msec

The thing is during these small blocksize fio benchmarks, nowhere nodesCPU, OSD, SSD, or of course network are saturated (ie. I think this hasnothing to do with write amplification), nevertheless clients IOPSstarve at low values.

Shouldn't Bluestore IOPs be far higher than Filestore on small IOs too ?

To summerize, here is what we can observe :

Seeking counters, I found in "perf dump" incrementing values with slowIO benchs, here for 1 run of 4k fio :

       "deferred_write_ops": 7631,
       "deferred_write_bytes": 31457280,

Does this means throttling or other QoS mechanism may be the cause anddefault config values may be artificially limiting small IO performanceon our architecture ? And has anyone an idea on how to circumvent it ?

OSD Config Reference documentation may be talking about these aspects inthe QoS/MClock/Caveats section, but I'm not sure to understand the wholepicture.


Could someone help ?

Thanks
Frederic

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Bluestore cluster, bad IO perf on blocksize<64k... could it be throttling ?

Reply via email to