Hi Dan, Josh, thanks for the input, bluefs_buffered_io with true and false, no real differences to be seen (hard to say in a productive cluster. maybe some little percent).
We now disabled the write cache on our SSD’s and see a “felt” increase of the performance up to 17k IOPS with 4k blocks but still far from the original values, anyway thanks! ;-) For the 1024 bytes i think at some point we also wanted to test network latencies, but i agree, using 4k is the default minimum. Attached you can find the links to the IO of the boxes i referenced in the first mail, still new to good old mailing lists ;-) The IOPS Graph is the most significant, the IO waitstates of the system dropped by nearly 50% which is the reason for our drop in IOPS overall i think. just no clue why… (Update was on the 10th of Nov). i guess i want these waitstates back :-o https://kai.freshx.de/img/ceph-io.png <https://kai.freshx.de/img/ceph-io.png> The latencies of some disks (OSD.12 is a small SSD (1.5TB) OSD.15 a big one (7TB)) but here i guess that the big one gets 4 times more IOPS because of its weight in CEPH https://kai.freshx.de/img/ceph-lat.png Cheers Kai > On 6 Dec 2021, at 18:12, Dan van der Ster <d...@vanderster.com> wrote: > > Hi, > > It's a bit weird that you benchmark 1024 bytes -- or is that your > realistic use-case? > This is smaller than the min alloc unit for even SSDs, so will need a > read/modify/write cycle to update, slowing substantially. > > Anyway, since you didn't mention it, have you disabled the write cache > on your drives? See > https://docs.ceph.com/en/latest/start/hardware-recommendations/#write-caches > for the latest related docs. > > -- Dan > > > > > > On Mon, Dec 6, 2021 at 5:28 PM <c...@komadev.de> wrote: >> >> Dear List, >> >> until we upgraded our cluster 3 weeks ago we had a cute high performing >> small productive CEPH cluster running Nautilus 14.2.22 on Proxmox 6.4 >> (Kernel 5.4-143 at this time). Then we started the upgrade to Octopus >> 15.2.15. Since we did an online upgrade, we stopped the autoconvert with >> >> ceph config set osd bluestore_fsck_quick_fix_on_mount false >> >> but followed up the OMAP conversion after the complete upgrade step by step >> by restarting one OSD after the other. >> >> Our Setup is >> 5 x Storage Node, each : 16 x 2.3GHz, 64GB RAM, 1 x SSD OSD 1.6TB, 1 x >> 7.68TB (both WD Enterprise, SAS-12), 3 HDD OSD (10TB, SAS-12) with Optane >> Cache) >> 4 x Compute Nodes >> 40 GE Storage network (Mellanox Switch + Mellanox CX354 40GE Dual Port >> Cards, Linux OSS drivers) >> 10 GE Cluster/Mgmt Network >> >> Our performance before the upgrade, Ceph 14.2.22 (about 36k IOPS on the SSD >> Pool) >> >> ### SSD Pool on 40GE Switches >> # rados bench -p SSD 30 -t 256 -b 1024 write >> hints = 1 >> Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for >> up to 30 seconds or 0 objects >> ... >> Total time run: 30.004 >> Total writes made: 1094320 >> Write size: 1024 >> Object size: 1024 >> Bandwidth (MB/sec): 35.6177 >> Stddev Bandwidth: 4.71909 >> Max bandwidth (MB/sec): 40.7314 >> Min bandwidth (MB/sec): 21.3037 >> Average IOPS: 36472 >> Stddev IOPS: 4832.35 >> Max IOPS: 41709 >> Min IOPS: 21815 >> Average Latency(s): 0.00701759 >> Stddev Latency(s): 0.00854068 >> Max latency(s): 0.445397 >> Min latency(s): 0.000909089 >> Cleaning up (deleting benchmark objects) >> >> Our performance after the update CEPH 15.2.15 (drops to max 17k IOPS on the >> SSD Pool) >> # rados bench -p SSD 30 -t 256 -b 1024 write >> hints = 1 >> Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for >> up to 30 seconds or 0 objects >> ... >> Total time run: 30.0146 >> Total writes made: 468513 >> Write size: 1024 >> Object size: 1024 >> Bandwidth (MB/sec): 15.2437 >> Stddev Bandwidth: 0.78677 >> Max bandwidth (MB/sec): 16.835 >> Min bandwidth (MB/sec): 13.3184 >> Average IOPS: 15609 >> Stddev IOPS: 805.652 >> Max IOPS: 17239 >> Min IOPS: 13638 >> Average Latency(s): 0.016396 >> Stddev Latency(s): 0.00777054 >> Max latency(s): 0.140793 >> Min latency(s): 0.00106735 >> Cleaning up (deleting benchmark objects) >> Note : OSD.17 is out on purpose >> # ceph osd tree >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> -1 208.94525 root default >> -3 41.43977 host xx-ceph01 >> 0 hdd 9.17380 osd.0 up 1.00000 1.00000 >> 5 hdd 9.17380 osd.5 up 1.00000 1.00000 >> 23 hdd 14.65039 osd.23 up 1.00000 1.00000 >> 7 ssd 1.45549 osd.7 up 1.00000 1.00000 >> 15 ssd 6.98630 osd.15 up 1.00000 1.00000 >> -5 41.43977 host xx-ceph02 >> 1 hdd 9.17380 osd.1 up 1.00000 1.00000 >> 4 hdd 9.17380 osd.4 up 1.00000 1.00000 >> 24 hdd 14.65039 osd.24 up 1.00000 1.00000 >> 9 ssd 1.45549 osd.9 up 1.00000 1.00000 >> 20 ssd 6.98630 osd.20 up 1.00000 1.00000 >> -7 41.43977 host xx-ceph03 >> 2 hdd 9.17380 osd.2 up 1.00000 1.00000 >> 3 hdd 9.17380 osd.3 up 1.00000 1.00000 >> 25 hdd 14.65039 osd.25 up 1.00000 1.00000 >> 8 ssd 1.45549 osd.8 up 1.00000 1.00000 >> 21 ssd 6.98630 osd.21 up 1.00000 1.00000 >> -17 41.43977 host xx-ceph04 >> 10 hdd 9.17380 osd.10 up 1.00000 1.00000 >> 11 hdd 9.17380 osd.11 up 1.00000 1.00000 >> 26 hdd 14.65039 osd.26 up 1.00000 1.00000 >> 6 ssd 1.45549 osd.6 up 1.00000 1.00000 >> 22 ssd 6.98630 osd.22 up 1.00000 1.00000 >> -21 43.18616 host xx-ceph05 >> 13 hdd 9.17380 osd.13 up 1.00000 1.00000 >> 14 hdd 9.17380 osd.14 up 1.00000 1.00000 >> 27 hdd 14.65039 osd.27 up 1.00000 1.00000 >> 12 ssd 1.45540 osd.12 up 1.00000 1.00000 >> 16 ssd 1.74660 osd.16 up 1.00000 1.00000 >> 17 ssd 3.49309 osd.17 up 0 1.00000 >> 18 ssd 1.74660 osd.18 up 1.00000 1.00000 >> 19 ssd 1.74649 osd.19 up 1.00000 1.00000 >> >> # ceph osd df >> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META >> AVAIL %USE VAR PGS STATUS >> 0 hdd 9.17380 1.00000 9.2 TiB 2.5 TiB 2.4 TiB 28 MiB 5.0 GiB >> 6.6 TiB 27.56 0.96 88 up >> 5 hdd 9.17380 1.00000 9.2 TiB 2.6 TiB 2.5 TiB 57 MiB 5.1 GiB >> 6.6 TiB 27.89 0.98 89 up >> 23 hdd 14.65039 1.00000 15 TiB 3.9 TiB 3.8 TiB 40 MiB 7.2 GiB >> 11 TiB 26.69 0.93 137 up >> 7 ssd 1.45549 1.00000 1.5 TiB 634 GiB 633 GiB 33 MiB 1.8 GiB >> 856 GiB 42.57 1.49 64 up >> 15 ssd 6.98630 1.00000 7.0 TiB 2.6 TiB 2.6 TiB 118 MiB 5.9 GiB >> 4.4 TiB 37.70 1.32 272 up >> 1 hdd 9.17380 1.00000 9.2 TiB 2.4 TiB 2.3 TiB 31 MiB 4.7 GiB >> 6.8 TiB 26.04 0.91 83 up >> 4 hdd 9.17380 1.00000 9.2 TiB 2.6 TiB 2.5 TiB 28 MiB 5.2 GiB >> 6.6 TiB 28.51 1.00 91 up >> 24 hdd 14.65039 1.00000 15 TiB 4.0 TiB 3.9 TiB 38 MiB 7.2 GiB >> 11 TiB 27.06 0.95 139 up >> 9 ssd 1.45549 1.00000 1.5 TiB 583 GiB 582 GiB 30 MiB 1.6 GiB >> 907 GiB 39.13 1.37 59 up >> 20 ssd 6.98630 1.00000 7.0 TiB 2.5 TiB 2.5 TiB 81 MiB 7.4 GiB >> 4.5 TiB 35.45 1.24 260 up >> 2 hdd 9.17380 1.00000 9.2 TiB 2.4 TiB 2.3 TiB 26 MiB 4.8 GiB >> 6.8 TiB 26.01 0.91 83 up >> 3 hdd 9.17380 1.00000 9.2 TiB 2.7 TiB 2.6 TiB 29 MiB 5.4 GiB >> 6.5 TiB 29.38 1.03 94 up >> 25 hdd 14.65039 1.00000 15 TiB 4.2 TiB 4.1 TiB 41 MiB 7.7 GiB >> 10 TiB 28.79 1.01 149 up >> 8 ssd 1.45549 1.00000 1.5 TiB 637 GiB 635 GiB 34 MiB 1.7 GiB >> 854 GiB 42.71 1.49 65 up >> 21 ssd 6.98630 1.00000 7.0 TiB 2.5 TiB 2.5 TiB 96 MiB 7.5 GiB >> 4.5 TiB 35.49 1.24 260 up >> 10 hdd 9.17380 1.00000 9.2 TiB 2.2 TiB 2.1 TiB 26 MiB 4.5 GiB >> 7.0 TiB 24.21 0.85 77 up >> 11 hdd 9.17380 1.00000 9.2 TiB 2.5 TiB 2.4 TiB 30 MiB 5.0 GiB >> 6.7 TiB 27.24 0.95 87 up >> 26 hdd 14.65039 1.00000 15 TiB 3.6 TiB 3.5 TiB 37 MiB 6.6 GiB >> 11 TiB 24.64 0.86 127 up >> 6 ssd 1.45549 1.00000 1.5 TiB 572 GiB 570 GiB 29 MiB 1.5 GiB >> 918 GiB 38.38 1.34 57 up >> 22 ssd 6.98630 1.00000 7.0 TiB 2.3 TiB 2.3 TiB 77 MiB 7.0 GiB >> 4.7 TiB 33.23 1.16 243 up >> 13 hdd 9.17380 1.00000 9.2 TiB 2.4 TiB 2.3 TiB 25 MiB 4.8 GiB >> 6.8 TiB 26.07 0.91 84 up >> 14 hdd 9.17380 1.00000 9.2 TiB 2.3 TiB 2.2 TiB 54 MiB 4.6 GiB >> 6.9 TiB 25.13 0.88 80 up >> 27 hdd 14.65039 1.00000 15 TiB 3.7 TiB 3.6 TiB 54 MiB 6.9 GiB >> 11 TiB 25.55 0.89 131 up >> 12 ssd 1.45540 1.00000 1.5 TiB 619 GiB 617 GiB 163 MiB 2.3 GiB >> 871 GiB 41.53 1.45 63 up >> 16 ssd 1.74660 1.00000 1.7 TiB 671 GiB 669 GiB 23 MiB 2.2 GiB >> 1.1 TiB 37.51 1.31 69 up >> 17 ssd 3.49309 0 0 B 0 B 0 B 0 B 0 B >> 0 B 0 0 0 up >> 18 ssd 1.74660 1.00000 1.7 TiB 512 GiB 509 GiB 18 MiB 2.3 GiB >> 1.2 TiB 28.62 1.00 52 up >> 19 ssd 1.74649 1.00000 1.7 TiB 709 GiB 707 GiB 64 MiB 2.0 GiB >> 1.1 TiB 39.64 1.39 72 up >> TOTAL 205 TiB 59 TiB 57 TiB 1.3 GiB 128 GiB >> 147 TiB 28.60 >> MIN/MAX VAR: 0.85/1.49 STDDEV: 6.81 >> >> >> What we have done so far (no success) >> >> - reformat two of the SSD OSD's (one was still from luminos, non LVM) >> - set bluestore_allocator from hybrid back to bitmap >> - set osd_memory_target to 6442450944 for some of the SSD OSDs >> - cpupower idle-set -D 11 >> - bluefs_buffered_io to true >> - disabled default firewalls between CEPH nodes (for testing only) >> - disabled apparmor >> - added memory (runs now on 128GB per Node) >> - upgraded OS, runs now on kernel 5.13.19-1 >> >> What we observe >> - HDD Pool has similar behaviour >> - load is higher since update, seems like more CPU consumption (see graph1), >> migration was on 10. Nov, around 10pm >> - latency on the "big" 7TB SSD's (i.e. OSD.15) is significantly higher than >> on the small 1.6TB SSDs (OSD.12), see graph2, must be due to the higher >> weight though >> - load of OSD.15 is 4 times higher than load of OSD.12, must be due to the >> higher weight though as well >> - start of OSD.15 (the 7TB SSD's is significantly slower (~10 sec) compared >> to the 1.6TB SSDs >> - increasing the block size in the benchmark to 4k, 8k or even 16k increases >> the throughput but keeps the IOPS more or less stable, the drop at 32k is >> minimal to ~14k IOPS in average >> >> We already checked the ProxMoxx List without any remedies yet and we are a >> bit helpless, any suggestions and / or does someone else has similar >> experiences? >> >> We are a bit hesitant to upgrade to Pacific, given the current situation. >> >> Thanks, >> >> Kai >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io