Hi,

It's a bit weird that you benchmark 1024 bytes -- or is that your
realistic use-case?
This is smaller than the min alloc unit for even SSDs, so will need a
read/modify/write cycle to update, slowing substantially.

Anyway, since you didn't mention it, have you disabled the write cache
on your drives? See
https://docs.ceph.com/en/latest/start/hardware-recommendations/#write-caches
for the latest related docs.

-- Dan





On Mon, Dec 6, 2021 at 5:28 PM <c...@komadev.de> wrote:
>
> Dear List,
>
> until we upgraded our cluster 3 weeks ago we had a cute high performing small 
> productive CEPH cluster running Nautilus 14.2.22 on Proxmox 6.4 (Kernel 
> 5.4-143 at this time). Then we started the upgrade to Octopus 15.2.15. Since 
> we did an online upgrade, we stopped the autoconvert with
>
> ceph config set osd bluestore_fsck_quick_fix_on_mount false
>
> but followed up the OMAP conversion after the complete upgrade step by step 
> by restarting one OSD after the other.
>
> Our Setup is
> 5 x Storage Node, each : 16 x 2.3GHz, 64GB RAM, 1 x SSD OSD 1.6TB, 1 x 7.68TB 
> (both WD Enterprise, SAS-12), 3 HDD OSD (10TB, SAS-12) with Optane Cache)
> 4 x Compute Nodes
> 40 GE Storage network (Mellanox Switch + Mellanox CX354 40GE Dual Port Cards, 
> Linux OSS drivers)
> 10 GE Cluster/Mgmt Network
>
> Our performance before the upgrade, Ceph 14.2.22 (about 36k IOPS on the SSD 
> Pool)
>
> ### SSD Pool on 40GE Switches
> # rados bench -p SSD 30 -t 256 -b 1024 write
> hints = 1
> Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for 
> up to 30 seconds or 0 objects
> ...
> Total time run:         30.004
> Total writes made:      1094320
> Write size:             1024
> Object size:            1024
> Bandwidth (MB/sec):     35.6177
> Stddev Bandwidth:       4.71909
> Max bandwidth (MB/sec): 40.7314
> Min bandwidth (MB/sec): 21.3037
> Average IOPS:           36472
> Stddev IOPS:            4832.35
> Max IOPS:               41709
> Min IOPS:               21815
> Average Latency(s):     0.00701759
> Stddev Latency(s):      0.00854068
> Max latency(s):         0.445397
> Min latency(s):         0.000909089
> Cleaning up (deleting benchmark objects)
>
> Our performance after the update CEPH 15.2.15 (drops to max 17k IOPS on the 
> SSD Pool)
> # rados bench -p SSD 30 -t 256 -b 1024 write
> hints = 1
> Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for 
> up to 30 seconds or 0 objects
> ...
> Total time run:         30.0146
> Total writes made:      468513
> Write size:             1024
> Object size:            1024
> Bandwidth (MB/sec):     15.2437
> Stddev Bandwidth:       0.78677
> Max bandwidth (MB/sec): 16.835
> Min bandwidth (MB/sec): 13.3184
> Average IOPS:           15609
> Stddev IOPS:            805.652
> Max IOPS:               17239
> Min IOPS:               13638
> Average Latency(s):     0.016396
> Stddev Latency(s):      0.00777054
> Max latency(s):         0.140793
> Min latency(s):         0.00106735
> Cleaning up (deleting benchmark objects)
> Note : OSD.17 is out on purpose
> # ceph osd tree
> ID   CLASS  WEIGHT     TYPE NAME            STATUS  REWEIGHT  PRI-AFF
>  -1         208.94525  root default
>  -3          41.43977      host xx-ceph01
>   0    hdd    9.17380          osd.0            up   1.00000  1.00000
>   5    hdd    9.17380          osd.5            up   1.00000  1.00000
>  23    hdd   14.65039          osd.23           up   1.00000  1.00000
>   7    ssd    1.45549          osd.7            up   1.00000  1.00000
>  15    ssd    6.98630          osd.15           up   1.00000  1.00000
>  -5          41.43977      host xx-ceph02
>   1    hdd    9.17380          osd.1            up   1.00000  1.00000
>   4    hdd    9.17380          osd.4            up   1.00000  1.00000
>  24    hdd   14.65039          osd.24           up   1.00000  1.00000
>   9    ssd    1.45549          osd.9            up   1.00000  1.00000
>  20    ssd    6.98630          osd.20           up   1.00000  1.00000
>  -7          41.43977      host xx-ceph03
>   2    hdd    9.17380          osd.2            up   1.00000  1.00000
>   3    hdd    9.17380          osd.3            up   1.00000  1.00000
>  25    hdd   14.65039          osd.25           up   1.00000  1.00000
>   8    ssd    1.45549          osd.8            up   1.00000  1.00000
>  21    ssd    6.98630          osd.21           up   1.00000  1.00000
> -17          41.43977      host xx-ceph04
>  10    hdd    9.17380          osd.10           up   1.00000  1.00000
>  11    hdd    9.17380          osd.11           up   1.00000  1.00000
>  26    hdd   14.65039          osd.26           up   1.00000  1.00000
>   6    ssd    1.45549          osd.6            up   1.00000  1.00000
>  22    ssd    6.98630          osd.22           up   1.00000  1.00000
> -21          43.18616      host xx-ceph05
>  13    hdd    9.17380          osd.13           up   1.00000  1.00000
>  14    hdd    9.17380          osd.14           up   1.00000  1.00000
>  27    hdd   14.65039          osd.27           up   1.00000  1.00000
>  12    ssd    1.45540          osd.12           up   1.00000  1.00000
>  16    ssd    1.74660          osd.16           up   1.00000  1.00000
>  17    ssd    3.49309          osd.17           up         0  1.00000
>  18    ssd    1.74660          osd.18           up   1.00000  1.00000
>  19    ssd    1.74649          osd.19           up   1.00000  1.00000
>
> # ceph osd df
> ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     
> AVAIL    %USE   VAR   PGS  STATUS
>  0    hdd   9.17380   1.00000  9.2 TiB  2.5 TiB  2.4 TiB   28 MiB  5.0 GiB  
> 6.6 TiB  27.56  0.96   88      up
>  5    hdd   9.17380   1.00000  9.2 TiB  2.6 TiB  2.5 TiB   57 MiB  5.1 GiB  
> 6.6 TiB  27.89  0.98   89      up
> 23    hdd  14.65039   1.00000   15 TiB  3.9 TiB  3.8 TiB   40 MiB  7.2 GiB   
> 11 TiB  26.69  0.93  137      up
>  7    ssd   1.45549   1.00000  1.5 TiB  634 GiB  633 GiB   33 MiB  1.8 GiB  
> 856 GiB  42.57  1.49   64      up
> 15    ssd   6.98630   1.00000  7.0 TiB  2.6 TiB  2.6 TiB  118 MiB  5.9 GiB  
> 4.4 TiB  37.70  1.32  272      up
>  1    hdd   9.17380   1.00000  9.2 TiB  2.4 TiB  2.3 TiB   31 MiB  4.7 GiB  
> 6.8 TiB  26.04  0.91   83      up
>  4    hdd   9.17380   1.00000  9.2 TiB  2.6 TiB  2.5 TiB   28 MiB  5.2 GiB  
> 6.6 TiB  28.51  1.00   91      up
> 24    hdd  14.65039   1.00000   15 TiB  4.0 TiB  3.9 TiB   38 MiB  7.2 GiB   
> 11 TiB  27.06  0.95  139      up
>  9    ssd   1.45549   1.00000  1.5 TiB  583 GiB  582 GiB   30 MiB  1.6 GiB  
> 907 GiB  39.13  1.37   59      up
> 20    ssd   6.98630   1.00000  7.0 TiB  2.5 TiB  2.5 TiB   81 MiB  7.4 GiB  
> 4.5 TiB  35.45  1.24  260      up
>  2    hdd   9.17380   1.00000  9.2 TiB  2.4 TiB  2.3 TiB   26 MiB  4.8 GiB  
> 6.8 TiB  26.01  0.91   83      up
>  3    hdd   9.17380   1.00000  9.2 TiB  2.7 TiB  2.6 TiB   29 MiB  5.4 GiB  
> 6.5 TiB  29.38  1.03   94      up
> 25    hdd  14.65039   1.00000   15 TiB  4.2 TiB  4.1 TiB   41 MiB  7.7 GiB   
> 10 TiB  28.79  1.01  149      up
>  8    ssd   1.45549   1.00000  1.5 TiB  637 GiB  635 GiB   34 MiB  1.7 GiB  
> 854 GiB  42.71  1.49   65      up
> 21    ssd   6.98630   1.00000  7.0 TiB  2.5 TiB  2.5 TiB   96 MiB  7.5 GiB  
> 4.5 TiB  35.49  1.24  260      up
> 10    hdd   9.17380   1.00000  9.2 TiB  2.2 TiB  2.1 TiB   26 MiB  4.5 GiB  
> 7.0 TiB  24.21  0.85   77      up
> 11    hdd   9.17380   1.00000  9.2 TiB  2.5 TiB  2.4 TiB   30 MiB  5.0 GiB  
> 6.7 TiB  27.24  0.95   87      up
> 26    hdd  14.65039   1.00000   15 TiB  3.6 TiB  3.5 TiB   37 MiB  6.6 GiB   
> 11 TiB  24.64  0.86  127      up
>  6    ssd   1.45549   1.00000  1.5 TiB  572 GiB  570 GiB   29 MiB  1.5 GiB  
> 918 GiB  38.38  1.34   57      up
> 22    ssd   6.98630   1.00000  7.0 TiB  2.3 TiB  2.3 TiB   77 MiB  7.0 GiB  
> 4.7 TiB  33.23  1.16  243      up
> 13    hdd   9.17380   1.00000  9.2 TiB  2.4 TiB  2.3 TiB   25 MiB  4.8 GiB  
> 6.8 TiB  26.07  0.91   84      up
> 14    hdd   9.17380   1.00000  9.2 TiB  2.3 TiB  2.2 TiB   54 MiB  4.6 GiB  
> 6.9 TiB  25.13  0.88   80      up
> 27    hdd  14.65039   1.00000   15 TiB  3.7 TiB  3.6 TiB   54 MiB  6.9 GiB   
> 11 TiB  25.55  0.89  131      up
> 12    ssd   1.45540   1.00000  1.5 TiB  619 GiB  617 GiB  163 MiB  2.3 GiB  
> 871 GiB  41.53  1.45   63      up
> 16    ssd   1.74660   1.00000  1.7 TiB  671 GiB  669 GiB   23 MiB  2.2 GiB  
> 1.1 TiB  37.51  1.31   69      up
> 17    ssd   3.49309         0      0 B      0 B      0 B      0 B      0 B    
>   0 B      0     0    0      up
> 18    ssd   1.74660   1.00000  1.7 TiB  512 GiB  509 GiB   18 MiB  2.3 GiB  
> 1.2 TiB  28.62  1.00   52      up
> 19    ssd   1.74649   1.00000  1.7 TiB  709 GiB  707 GiB   64 MiB  2.0 GiB  
> 1.1 TiB  39.64  1.39   72      up
>                         TOTAL  205 TiB   59 TiB   57 TiB  1.3 GiB  128 GiB  
> 147 TiB  28.60
> MIN/MAX VAR: 0.85/1.49  STDDEV: 6.81
>
>
> What we have done so far (no success)
>
> - reformat two of the SSD OSD's (one was still from luminos, non LVM)
> - set bluestore_allocator from hybrid back to bitmap
> - set osd_memory_target to 6442450944 for some of the SSD OSDs
> - cpupower idle-set -D 11
> - bluefs_buffered_io to true
> - disabled default firewalls between CEPH nodes (for testing only)
> - disabled apparmor
> - added memory (runs now on 128GB per Node)
> - upgraded OS, runs now on kernel 5.13.19-1
>
> What we observe
> - HDD Pool has similar behaviour
> - load is higher since update, seems like more CPU consumption (see graph1), 
> migration was on 10. Nov, around 10pm
> - latency on the "big" 7TB SSD's (i.e. OSD.15) is significantly higher than 
> on the small 1.6TB SSDs (OSD.12), see graph2, must be due to the higher 
> weight though
> - load of OSD.15 is 4 times higher than load of OSD.12, must be due to the 
> higher weight though as well
> - start of OSD.15 (the 7TB SSD's is significantly slower (~10 sec) compared 
> to the 1.6TB SSDs
> - increasing the block size in the benchmark to 4k, 8k or even 16k increases 
> the throughput but keeps the IOPS more or less stable, the drop at 32k is 
> minimal to ~14k IOPS in average
>
> We already checked the ProxMoxx List without any remedies yet and we are a 
> bit helpless, any suggestions and / or does someone else has similar 
> experiences?
>
> We are a bit hesitant to upgrade to Pacific, given the current situation.
>
> Thanks,
>
> Kai
>
>
>
>
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to