Hi Dan, Josh,

thanks for the input, bluefs_buffered_io with true and false, no real 
differences to be seen (hard to say in a productive cluster. maybe some little 
percent). 

We now disabled the write cache on our SSD’s and see a “felt” increase of the 
performance up to 17k IOPS with 4k blocks but still far from the original 
values, anyway thanks! ;-)

For the 1024 bytes i think at some point we also wanted to test network 
latencies, but i agree, using 4k is the default minimum.

Attached you can find the links to the IO of the boxes i referenced in the 
first mail, still new to good old mailing lists ;-)

The IOPS Graph is the most significant, the IO waitstates of the system dropped 
by nearly 50% which is the reason for our drop in IOPS overall i think. just no 
clue why… (Update was on the 10th of Nov). i guess i want these waitstates back 
:-o

https://kai.freshx.de/img/ceph-io.png <https://kai.freshx.de/img/ceph-io.png>

The latencies of some disks (OSD.12 is a small SSD (1.5TB) OSD.15 a big one 
(7TB)) but here i guess that the big one gets 4 times more IOPS because of its 
weight in CEPH

https://kai.freshx.de/img/ceph-lat.png

Cheers Kai

> On 6 Dec 2021, at 18:12, Dan van der Ster <d...@vanderster.com> wrote:
> 
> Hi,
> 
> It's a bit weird that you benchmark 1024 bytes -- or is that your
> realistic use-case?
> This is smaller than the min alloc unit for even SSDs, so will need a
> read/modify/write cycle to update, slowing substantially.
> 
> Anyway, since you didn't mention it, have you disabled the write cache
> on your drives? See
> https://docs.ceph.com/en/latest/start/hardware-recommendations/#write-caches
> for the latest related docs.
> 
> -- Dan
> 
> 
> 
> 
> 
> On Mon, Dec 6, 2021 at 5:28 PM <c...@komadev.de> wrote:
>> 
>> Dear List,
>> 
>> until we upgraded our cluster 3 weeks ago we had a cute high performing 
>> small productive CEPH cluster running Nautilus 14.2.22 on Proxmox 6.4 
>> (Kernel 5.4-143 at this time). Then we started the upgrade to Octopus 
>> 15.2.15. Since we did an online upgrade, we stopped the autoconvert with
>> 
>> ceph config set osd bluestore_fsck_quick_fix_on_mount false
>> 
>> but followed up the OMAP conversion after the complete upgrade step by step 
>> by restarting one OSD after the other.
>> 
>> Our Setup is
>> 5 x Storage Node, each : 16 x 2.3GHz, 64GB RAM, 1 x SSD OSD 1.6TB, 1 x 
>> 7.68TB (both WD Enterprise, SAS-12), 3 HDD OSD (10TB, SAS-12) with Optane 
>> Cache)
>> 4 x Compute Nodes
>> 40 GE Storage network (Mellanox Switch + Mellanox CX354 40GE Dual Port 
>> Cards, Linux OSS drivers)
>> 10 GE Cluster/Mgmt Network
>> 
>> Our performance before the upgrade, Ceph 14.2.22 (about 36k IOPS on the SSD 
>> Pool)
>> 
>> ### SSD Pool on 40GE Switches
>> # rados bench -p SSD 30 -t 256 -b 1024 write
>> hints = 1
>> Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for 
>> up to 30 seconds or 0 objects
>> ...
>> Total time run:         30.004
>> Total writes made:      1094320
>> Write size:             1024
>> Object size:            1024
>> Bandwidth (MB/sec):     35.6177
>> Stddev Bandwidth:       4.71909
>> Max bandwidth (MB/sec): 40.7314
>> Min bandwidth (MB/sec): 21.3037
>> Average IOPS:           36472
>> Stddev IOPS:            4832.35
>> Max IOPS:               41709
>> Min IOPS:               21815
>> Average Latency(s):     0.00701759
>> Stddev Latency(s):      0.00854068
>> Max latency(s):         0.445397
>> Min latency(s):         0.000909089
>> Cleaning up (deleting benchmark objects)
>> 
>> Our performance after the update CEPH 15.2.15 (drops to max 17k IOPS on the 
>> SSD Pool)
>> # rados bench -p SSD 30 -t 256 -b 1024 write
>> hints = 1
>> Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for 
>> up to 30 seconds or 0 objects
>> ...
>> Total time run:         30.0146
>> Total writes made:      468513
>> Write size:             1024
>> Object size:            1024
>> Bandwidth (MB/sec):     15.2437
>> Stddev Bandwidth:       0.78677
>> Max bandwidth (MB/sec): 16.835
>> Min bandwidth (MB/sec): 13.3184
>> Average IOPS:           15609
>> Stddev IOPS:            805.652
>> Max IOPS:               17239
>> Min IOPS:               13638
>> Average Latency(s):     0.016396
>> Stddev Latency(s):      0.00777054
>> Max latency(s):         0.140793
>> Min latency(s):         0.00106735
>> Cleaning up (deleting benchmark objects)
>> Note : OSD.17 is out on purpose
>> # ceph osd tree
>> ID   CLASS  WEIGHT     TYPE NAME            STATUS  REWEIGHT  PRI-AFF
>> -1         208.94525  root default
>> -3          41.43977      host xx-ceph01
>>  0    hdd    9.17380          osd.0            up   1.00000  1.00000
>>  5    hdd    9.17380          osd.5            up   1.00000  1.00000
>> 23    hdd   14.65039          osd.23           up   1.00000  1.00000
>>  7    ssd    1.45549          osd.7            up   1.00000  1.00000
>> 15    ssd    6.98630          osd.15           up   1.00000  1.00000
>> -5          41.43977      host xx-ceph02
>>  1    hdd    9.17380          osd.1            up   1.00000  1.00000
>>  4    hdd    9.17380          osd.4            up   1.00000  1.00000
>> 24    hdd   14.65039          osd.24           up   1.00000  1.00000
>>  9    ssd    1.45549          osd.9            up   1.00000  1.00000
>> 20    ssd    6.98630          osd.20           up   1.00000  1.00000
>> -7          41.43977      host xx-ceph03
>>  2    hdd    9.17380          osd.2            up   1.00000  1.00000
>>  3    hdd    9.17380          osd.3            up   1.00000  1.00000
>> 25    hdd   14.65039          osd.25           up   1.00000  1.00000
>>  8    ssd    1.45549          osd.8            up   1.00000  1.00000
>> 21    ssd    6.98630          osd.21           up   1.00000  1.00000
>> -17          41.43977      host xx-ceph04
>> 10    hdd    9.17380          osd.10           up   1.00000  1.00000
>> 11    hdd    9.17380          osd.11           up   1.00000  1.00000
>> 26    hdd   14.65039          osd.26           up   1.00000  1.00000
>>  6    ssd    1.45549          osd.6            up   1.00000  1.00000
>> 22    ssd    6.98630          osd.22           up   1.00000  1.00000
>> -21          43.18616      host xx-ceph05
>> 13    hdd    9.17380          osd.13           up   1.00000  1.00000
>> 14    hdd    9.17380          osd.14           up   1.00000  1.00000
>> 27    hdd   14.65039          osd.27           up   1.00000  1.00000
>> 12    ssd    1.45540          osd.12           up   1.00000  1.00000
>> 16    ssd    1.74660          osd.16           up   1.00000  1.00000
>> 17    ssd    3.49309          osd.17           up         0  1.00000
>> 18    ssd    1.74660          osd.18           up   1.00000  1.00000
>> 19    ssd    1.74649          osd.19           up   1.00000  1.00000
>> 
>> # ceph osd df
>> ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     
>> AVAIL    %USE   VAR   PGS  STATUS
>> 0    hdd   9.17380   1.00000  9.2 TiB  2.5 TiB  2.4 TiB   28 MiB  5.0 GiB  
>> 6.6 TiB  27.56  0.96   88      up
>> 5    hdd   9.17380   1.00000  9.2 TiB  2.6 TiB  2.5 TiB   57 MiB  5.1 GiB  
>> 6.6 TiB  27.89  0.98   89      up
>> 23    hdd  14.65039   1.00000   15 TiB  3.9 TiB  3.8 TiB   40 MiB  7.2 GiB   
>> 11 TiB  26.69  0.93  137      up
>> 7    ssd   1.45549   1.00000  1.5 TiB  634 GiB  633 GiB   33 MiB  1.8 GiB  
>> 856 GiB  42.57  1.49   64      up
>> 15    ssd   6.98630   1.00000  7.0 TiB  2.6 TiB  2.6 TiB  118 MiB  5.9 GiB  
>> 4.4 TiB  37.70  1.32  272      up
>> 1    hdd   9.17380   1.00000  9.2 TiB  2.4 TiB  2.3 TiB   31 MiB  4.7 GiB  
>> 6.8 TiB  26.04  0.91   83      up
>> 4    hdd   9.17380   1.00000  9.2 TiB  2.6 TiB  2.5 TiB   28 MiB  5.2 GiB  
>> 6.6 TiB  28.51  1.00   91      up
>> 24    hdd  14.65039   1.00000   15 TiB  4.0 TiB  3.9 TiB   38 MiB  7.2 GiB   
>> 11 TiB  27.06  0.95  139      up
>> 9    ssd   1.45549   1.00000  1.5 TiB  583 GiB  582 GiB   30 MiB  1.6 GiB  
>> 907 GiB  39.13  1.37   59      up
>> 20    ssd   6.98630   1.00000  7.0 TiB  2.5 TiB  2.5 TiB   81 MiB  7.4 GiB  
>> 4.5 TiB  35.45  1.24  260      up
>> 2    hdd   9.17380   1.00000  9.2 TiB  2.4 TiB  2.3 TiB   26 MiB  4.8 GiB  
>> 6.8 TiB  26.01  0.91   83      up
>> 3    hdd   9.17380   1.00000  9.2 TiB  2.7 TiB  2.6 TiB   29 MiB  5.4 GiB  
>> 6.5 TiB  29.38  1.03   94      up
>> 25    hdd  14.65039   1.00000   15 TiB  4.2 TiB  4.1 TiB   41 MiB  7.7 GiB   
>> 10 TiB  28.79  1.01  149      up
>> 8    ssd   1.45549   1.00000  1.5 TiB  637 GiB  635 GiB   34 MiB  1.7 GiB  
>> 854 GiB  42.71  1.49   65      up
>> 21    ssd   6.98630   1.00000  7.0 TiB  2.5 TiB  2.5 TiB   96 MiB  7.5 GiB  
>> 4.5 TiB  35.49  1.24  260      up
>> 10    hdd   9.17380   1.00000  9.2 TiB  2.2 TiB  2.1 TiB   26 MiB  4.5 GiB  
>> 7.0 TiB  24.21  0.85   77      up
>> 11    hdd   9.17380   1.00000  9.2 TiB  2.5 TiB  2.4 TiB   30 MiB  5.0 GiB  
>> 6.7 TiB  27.24  0.95   87      up
>> 26    hdd  14.65039   1.00000   15 TiB  3.6 TiB  3.5 TiB   37 MiB  6.6 GiB   
>> 11 TiB  24.64  0.86  127      up
>> 6    ssd   1.45549   1.00000  1.5 TiB  572 GiB  570 GiB   29 MiB  1.5 GiB  
>> 918 GiB  38.38  1.34   57      up
>> 22    ssd   6.98630   1.00000  7.0 TiB  2.3 TiB  2.3 TiB   77 MiB  7.0 GiB  
>> 4.7 TiB  33.23  1.16  243      up
>> 13    hdd   9.17380   1.00000  9.2 TiB  2.4 TiB  2.3 TiB   25 MiB  4.8 GiB  
>> 6.8 TiB  26.07  0.91   84      up
>> 14    hdd   9.17380   1.00000  9.2 TiB  2.3 TiB  2.2 TiB   54 MiB  4.6 GiB  
>> 6.9 TiB  25.13  0.88   80      up
>> 27    hdd  14.65039   1.00000   15 TiB  3.7 TiB  3.6 TiB   54 MiB  6.9 GiB   
>> 11 TiB  25.55  0.89  131      up
>> 12    ssd   1.45540   1.00000  1.5 TiB  619 GiB  617 GiB  163 MiB  2.3 GiB  
>> 871 GiB  41.53  1.45   63      up
>> 16    ssd   1.74660   1.00000  1.7 TiB  671 GiB  669 GiB   23 MiB  2.2 GiB  
>> 1.1 TiB  37.51  1.31   69      up
>> 17    ssd   3.49309         0      0 B      0 B      0 B      0 B      0 B   
>>    0 B      0     0    0      up
>> 18    ssd   1.74660   1.00000  1.7 TiB  512 GiB  509 GiB   18 MiB  2.3 GiB  
>> 1.2 TiB  28.62  1.00   52      up
>> 19    ssd   1.74649   1.00000  1.7 TiB  709 GiB  707 GiB   64 MiB  2.0 GiB  
>> 1.1 TiB  39.64  1.39   72      up
>>                        TOTAL  205 TiB   59 TiB   57 TiB  1.3 GiB  128 GiB  
>> 147 TiB  28.60
>> MIN/MAX VAR: 0.85/1.49  STDDEV: 6.81
>> 
>> 
>> What we have done so far (no success)
>> 
>> - reformat two of the SSD OSD's (one was still from luminos, non LVM)
>> - set bluestore_allocator from hybrid back to bitmap
>> - set osd_memory_target to 6442450944 for some of the SSD OSDs
>> - cpupower idle-set -D 11
>> - bluefs_buffered_io to true
>> - disabled default firewalls between CEPH nodes (for testing only)
>> - disabled apparmor
>> - added memory (runs now on 128GB per Node)
>> - upgraded OS, runs now on kernel 5.13.19-1
>> 
>> What we observe
>> - HDD Pool has similar behaviour
>> - load is higher since update, seems like more CPU consumption (see graph1), 
>> migration was on 10. Nov, around 10pm
>> - latency on the "big" 7TB SSD's (i.e. OSD.15) is significantly higher than 
>> on the small 1.6TB SSDs (OSD.12), see graph2, must be due to the higher 
>> weight though
>> - load of OSD.15 is 4 times higher than load of OSD.12, must be due to the 
>> higher weight though as well
>> - start of OSD.15 (the 7TB SSD's is significantly slower (~10 sec) compared 
>> to the 1.6TB SSDs
>> - increasing the block size in the benchmark to 4k, 8k or even 16k increases 
>> the throughput but keeps the IOPS more or less stable, the drop at 32k is 
>> minimal to ~14k IOPS in average
>> 
>> We already checked the ProxMoxx List without any remedies yet and we are a 
>> bit helpless, any suggestions and / or does someone else has similar 
>> experiences?
>> 
>> We are a bit hesitant to upgrade to Pacific, given the current situation.
>> 
>> Thanks,
>> 
>> Kai
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to