On 2018-03-21 19:50, Frederic BRET wrote:

> Hi all,
> 
> The context :
> - Test cluster aside production one
> - Fresh install on Luminous
> - choice of Bluestore (coming from Filestore)
> - Default config (including wpq queuing)
> - 6 nodes SAS12, 14 OSD, 2 SSD, 2 x 10Gb nodes, far more Gb at each switch 
> uplink...
> - R3 pool, 2 nodes per site
> - separate db (25GB) and wal (600MB) partitions on SSD for each OSD to be 
> able to observe each kind of IO with iostat
> - RBD client fio --ioengine=libaio --iodepth=128 --direct=1 
> - client RDB :  rbd map rbd/test_rbd -o queue_depth=1024
> - Just to point out, this is not a thread on SSD performance or adequation 
> between SSD and number of OSD. These 12Gb SAS 10DWPD SSD are perfectly 
> performing with lot of headroom on the production cluster even with XFS 
> filestore and journals on SSDs. 
> - This thread is about a possible bottleneck on low size blocks with 
> rocksdb/wal/Bluestore.
> 
> To begin with, Bluestore performance is really breathtaking compared to 
> filestore/XFS : we saturate the 20Gb clients bandwidth on this small test 
> cluster, as soon as IO blocksize=64k, a thing we couldn't achieve with 
> Filestore and journals, even at 256k.
> 
> The downside, all small IO blockizes (4k, 8k, 16k, 32k) are considerably 
> slower and appear somewhat capped.
> 
> Just to compare, here are observed latencies at 2 consecutive values for 
> blocksize 64k and 32k :
> 64k :
> write: io=55563MB, bw=1849.2MB/s, iops=29586, runt= 30048msec
> lat (msec): min=2, max=867, avg=17.29, stdev=32.31
> 
> 32k :
> write: io=6332.2MB, bw=207632KB/s, iops=6488, runt= 31229msec
> lat (msec): min=1, max=5111, avg=78.81, stdev=430.50
> 
> Whereas 64k one is almost filling the 20Gb client connection, the 32k one is 
> only getting a mere 1/10th of the bandwidth, and IOs latencies are multiplied 
> by 4.5 (or get a  ~60ms pause ? ... )
> 
> And we see the same constant latency at 16k, 8k and 4k :
> 16k :
> write: io=3129.4MB, bw=102511KB/s, iops=6406, runt= 31260msec
> lat (msec): min=0.908, max=6.67, avg=79.87, stdev=500.08
> 
> 8k :
> write: io=1592.8MB, bw=52604KB/s, iops=6575, runt= 31005msec
> lat (msec): min=0.824, max=5.49, avg=77.82, stdev=461.61
> 
> 4k :
> write: io=837892KB, bw=26787KB/s, iops=6696, runt= 31280msec
> lat (msec): min=0.766, max=5.45, avg=76.39, stdev=428.29
> 
> To compare with filestore, on 4k IOs results I have on hand from previous 
> install, we were getting almost 2x the Bluestore perfs on the exact same 
> cluster :
> WRITE: io=1221.4MB, aggrb=41477KB/,s maxt=30152msec
> 
> The thing is during these small blocksize fio benchmarks, nowhere nodes CPU, 
> OSD, SSD, or of course network are saturated (ie. I think this has nothing to 
> do with write amplification), nevertheless clients IOPS starve at low values.
> Shouldn't Bluestore IOPs be far higher than Filestore on small IOs too ?
> 
> To summerize, here is what we can observe :
> 
> Seeking counters, I found in "perf dump" incrementing values with slow IO 
> benchs, here for 1 run of 4k fio :
> "deferred_write_ops": 7631,
> "deferred_write_bytes": 31457280,
> 
> Does this means throttling or other QoS mechanism may be the cause and 
> default config values may be artificially limiting small IO performance on 
> our architecture ? And has anyone an idea on how to circumvent it ?
> 
> OSD Config Reference documentation may be talking about these aspects in the 
> QoS/MClock/Caveats section, but I'm not sure to understand the whole picture. 
> 
> Could someone help ?
> 
> Thanks
> Frederic 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi Fredric, 

I too hope someone from ceph team will answer this. I believe some
people do see this behavior. 

In the meantime i would suggest further data: 

1) What is the raw disk iops and disk utilization (%busy) on your hdds ?
you do show the ssds (2800-4000 iops), but likely it is the hdds
iops/utilization that could be an issue.

2) Can you try setting 
bluestore_prefer_deferred_size_hdd = 0
(in effect we are disabling the deferred writes mechanism) and see if
this helps 

3) If you have a controller with write back cache, can you enable it. 

Again i wish someone from ceph team input into this. 

Maged
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to