Hi all,

We encounter bad IO perfs on blocksize<64k on our new Bluestore test cluster

The context :
- Test cluster aside production one
- Fresh install on Luminous
- choice of Bluestore (coming from Filestore)
- Default config (including wpq queuing)
- 6 nodes SAS12, 14 OSD, 2 SSD, 2 x 10Gb nodes, far more Gb at each switch uplink...
- R3 pool, 2 nodes per site
- separate db (25GB) and wal (600MB) partitions on SSD for each OSD to be able to observe each kind of IO with iostat
- RBD client fio --ioengine=libaio --iodepth=128 --direct=1
- client RDB :  rbd map rbd/test_rbd -o queue_depth=1024
- Just to point out, this is not a thread on SSD performance or adequation between SSD and number of OSD. These 12Gb SAS 10DWPD SSD are perfectly performing with lot of headroom on the production cluster even with XFS filestore and journals on SSDs. - This thread is about a possible bottleneck on low size blocks with rocksdb/wal/Bluestore.

To begin with, Bluestore performance is really breathtaking compared to filestore/XFS : we saturate the 20Gb clients bandwidth on this small test cluster, as soon as IO blocksize=64k, a thing we couldn't achieve with Filestore and journals, even at 256k.

The downside, all small IO blockizes (4k, 8k, 16k, 32k) are considerably slower and appear somewhat capped.

Just to compare, here are observed latencies at 2 consecutive values for blocksize 64k and 32k :
64k :
 write: io=55563MB, bw=1849.2MB/s, iops=29586, runt= 30048msec
    lat (msec): min=2, max=867, avg=17.29, stdev=32.31

32k :
 write: io=6332.2MB, bw=207632KB/s, iops=6488, runt= 31229msec
    lat (msec): min=1, max=5111, avg=78.81, stdev=430.50

Whereas 64k one is almost filling the 20Gb client connection, the 32k one is only getting a mere 1/10th of the bandwidth, and IOs latencies are multiplied by 4.5 (or get a ~60ms pause ? ... )

And we see the same constant latency at 16k, 8k and 4k :
16k :
 write: io=3129.4MB, bw=102511KB/s, iops=6406, runt= 31260msec
    lat (msec): min=0.908, max=6.67, avg=79.87, stdev=500.08

8k :
 write: io=1592.8MB, bw=52604KB/s, iops=6575, runt= 31005msec
    lat (msec): min=0.824, max=5.49, avg=77.82, stdev=461.61

4k :
 write: io=837892KB, bw=26787KB/s, iops=6696, runt= 31280msec
    lat (msec): min=0.766, max=5.45, avg=76.39, stdev=428.29

To compare with filestore, on 4k IOs results I have on hand from previous install, we were getting almost 2x the Bluestore perfs on the exact same cluster :
WRITE: io=1221.4MB, aggrb=41477KB/,s maxt=30152msec

The thing is during these small blocksize fio benchmarks, nowhere nodes CPU, OSD, SSD, or of course network are saturated (ie. I think this has nothing to do with write amplification), nevertheless clients IOPS starve at low values.
Shouldn't Bluestore IOPs be far higher than Filestore on small IOs too ?

To summerize, here is what we can observe :


Seeking counters, I found in "perf dump" incrementing values with slow IO benchs, here for 1 run of 4k fio :
       "deferred_write_ops": 7631,
       "deferred_write_bytes": 31457280,

Does this means throttling or other QoS mechanism may be the cause and default config values may be artificially limiting small IO performance on our architecture ? And has anyone an idea on how to circumvent it ?

OSD Config Reference documentation may be talking about these aspects in the QoS/MClock/Caveats section, but I'm not sure to understand the whole picture.

Could someone help ?

Thanks
Frederic


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to