Hi, It's a bit weird that you benchmark 1024 bytes -- or is that your realistic use-case? This is smaller than the min alloc unit for even SSDs, so will need a read/modify/write cycle to update, slowing substantially.
Anyway, since you didn't mention it, have you disabled the write cache on your drives? See https://docs.ceph.com/en/latest/start/hardware-recommendations/#write-caches for the latest related docs. -- Dan On Mon, Dec 6, 2021 at 5:28 PM <c...@komadev.de> wrote: > > Dear List, > > until we upgraded our cluster 3 weeks ago we had a cute high performing small > productive CEPH cluster running Nautilus 14.2.22 on Proxmox 6.4 (Kernel > 5.4-143 at this time). Then we started the upgrade to Octopus 15.2.15. Since > we did an online upgrade, we stopped the autoconvert with > > ceph config set osd bluestore_fsck_quick_fix_on_mount false > > but followed up the OMAP conversion after the complete upgrade step by step > by restarting one OSD after the other. > > Our Setup is > 5 x Storage Node, each : 16 x 2.3GHz, 64GB RAM, 1 x SSD OSD 1.6TB, 1 x 7.68TB > (both WD Enterprise, SAS-12), 3 HDD OSD (10TB, SAS-12) with Optane Cache) > 4 x Compute Nodes > 40 GE Storage network (Mellanox Switch + Mellanox CX354 40GE Dual Port Cards, > Linux OSS drivers) > 10 GE Cluster/Mgmt Network > > Our performance before the upgrade, Ceph 14.2.22 (about 36k IOPS on the SSD > Pool) > > ### SSD Pool on 40GE Switches > # rados bench -p SSD 30 -t 256 -b 1024 write > hints = 1 > Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for > up to 30 seconds or 0 objects > ... > Total time run: 30.004 > Total writes made: 1094320 > Write size: 1024 > Object size: 1024 > Bandwidth (MB/sec): 35.6177 > Stddev Bandwidth: 4.71909 > Max bandwidth (MB/sec): 40.7314 > Min bandwidth (MB/sec): 21.3037 > Average IOPS: 36472 > Stddev IOPS: 4832.35 > Max IOPS: 41709 > Min IOPS: 21815 > Average Latency(s): 0.00701759 > Stddev Latency(s): 0.00854068 > Max latency(s): 0.445397 > Min latency(s): 0.000909089 > Cleaning up (deleting benchmark objects) > > Our performance after the update CEPH 15.2.15 (drops to max 17k IOPS on the > SSD Pool) > # rados bench -p SSD 30 -t 256 -b 1024 write > hints = 1 > Maintaining 256 concurrent writes of 1024 bytes to objects of size 1024 for > up to 30 seconds or 0 objects > ... > Total time run: 30.0146 > Total writes made: 468513 > Write size: 1024 > Object size: 1024 > Bandwidth (MB/sec): 15.2437 > Stddev Bandwidth: 0.78677 > Max bandwidth (MB/sec): 16.835 > Min bandwidth (MB/sec): 13.3184 > Average IOPS: 15609 > Stddev IOPS: 805.652 > Max IOPS: 17239 > Min IOPS: 13638 > Average Latency(s): 0.016396 > Stddev Latency(s): 0.00777054 > Max latency(s): 0.140793 > Min latency(s): 0.00106735 > Cleaning up (deleting benchmark objects) > Note : OSD.17 is out on purpose > # ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 208.94525 root default > -3 41.43977 host xx-ceph01 > 0 hdd 9.17380 osd.0 up 1.00000 1.00000 > 5 hdd 9.17380 osd.5 up 1.00000 1.00000 > 23 hdd 14.65039 osd.23 up 1.00000 1.00000 > 7 ssd 1.45549 osd.7 up 1.00000 1.00000 > 15 ssd 6.98630 osd.15 up 1.00000 1.00000 > -5 41.43977 host xx-ceph02 > 1 hdd 9.17380 osd.1 up 1.00000 1.00000 > 4 hdd 9.17380 osd.4 up 1.00000 1.00000 > 24 hdd 14.65039 osd.24 up 1.00000 1.00000 > 9 ssd 1.45549 osd.9 up 1.00000 1.00000 > 20 ssd 6.98630 osd.20 up 1.00000 1.00000 > -7 41.43977 host xx-ceph03 > 2 hdd 9.17380 osd.2 up 1.00000 1.00000 > 3 hdd 9.17380 osd.3 up 1.00000 1.00000 > 25 hdd 14.65039 osd.25 up 1.00000 1.00000 > 8 ssd 1.45549 osd.8 up 1.00000 1.00000 > 21 ssd 6.98630 osd.21 up 1.00000 1.00000 > -17 41.43977 host xx-ceph04 > 10 hdd 9.17380 osd.10 up 1.00000 1.00000 > 11 hdd 9.17380 osd.11 up 1.00000 1.00000 > 26 hdd 14.65039 osd.26 up 1.00000 1.00000 > 6 ssd 1.45549 osd.6 up 1.00000 1.00000 > 22 ssd 6.98630 osd.22 up 1.00000 1.00000 > -21 43.18616 host xx-ceph05 > 13 hdd 9.17380 osd.13 up 1.00000 1.00000 > 14 hdd 9.17380 osd.14 up 1.00000 1.00000 > 27 hdd 14.65039 osd.27 up 1.00000 1.00000 > 12 ssd 1.45540 osd.12 up 1.00000 1.00000 > 16 ssd 1.74660 osd.16 up 1.00000 1.00000 > 17 ssd 3.49309 osd.17 up 0 1.00000 > 18 ssd 1.74660 osd.18 up 1.00000 1.00000 > 19 ssd 1.74649 osd.19 up 1.00000 1.00000 > > # ceph osd df > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META > AVAIL %USE VAR PGS STATUS > 0 hdd 9.17380 1.00000 9.2 TiB 2.5 TiB 2.4 TiB 28 MiB 5.0 GiB > 6.6 TiB 27.56 0.96 88 up > 5 hdd 9.17380 1.00000 9.2 TiB 2.6 TiB 2.5 TiB 57 MiB 5.1 GiB > 6.6 TiB 27.89 0.98 89 up > 23 hdd 14.65039 1.00000 15 TiB 3.9 TiB 3.8 TiB 40 MiB 7.2 GiB > 11 TiB 26.69 0.93 137 up > 7 ssd 1.45549 1.00000 1.5 TiB 634 GiB 633 GiB 33 MiB 1.8 GiB > 856 GiB 42.57 1.49 64 up > 15 ssd 6.98630 1.00000 7.0 TiB 2.6 TiB 2.6 TiB 118 MiB 5.9 GiB > 4.4 TiB 37.70 1.32 272 up > 1 hdd 9.17380 1.00000 9.2 TiB 2.4 TiB 2.3 TiB 31 MiB 4.7 GiB > 6.8 TiB 26.04 0.91 83 up > 4 hdd 9.17380 1.00000 9.2 TiB 2.6 TiB 2.5 TiB 28 MiB 5.2 GiB > 6.6 TiB 28.51 1.00 91 up > 24 hdd 14.65039 1.00000 15 TiB 4.0 TiB 3.9 TiB 38 MiB 7.2 GiB > 11 TiB 27.06 0.95 139 up > 9 ssd 1.45549 1.00000 1.5 TiB 583 GiB 582 GiB 30 MiB 1.6 GiB > 907 GiB 39.13 1.37 59 up > 20 ssd 6.98630 1.00000 7.0 TiB 2.5 TiB 2.5 TiB 81 MiB 7.4 GiB > 4.5 TiB 35.45 1.24 260 up > 2 hdd 9.17380 1.00000 9.2 TiB 2.4 TiB 2.3 TiB 26 MiB 4.8 GiB > 6.8 TiB 26.01 0.91 83 up > 3 hdd 9.17380 1.00000 9.2 TiB 2.7 TiB 2.6 TiB 29 MiB 5.4 GiB > 6.5 TiB 29.38 1.03 94 up > 25 hdd 14.65039 1.00000 15 TiB 4.2 TiB 4.1 TiB 41 MiB 7.7 GiB > 10 TiB 28.79 1.01 149 up > 8 ssd 1.45549 1.00000 1.5 TiB 637 GiB 635 GiB 34 MiB 1.7 GiB > 854 GiB 42.71 1.49 65 up > 21 ssd 6.98630 1.00000 7.0 TiB 2.5 TiB 2.5 TiB 96 MiB 7.5 GiB > 4.5 TiB 35.49 1.24 260 up > 10 hdd 9.17380 1.00000 9.2 TiB 2.2 TiB 2.1 TiB 26 MiB 4.5 GiB > 7.0 TiB 24.21 0.85 77 up > 11 hdd 9.17380 1.00000 9.2 TiB 2.5 TiB 2.4 TiB 30 MiB 5.0 GiB > 6.7 TiB 27.24 0.95 87 up > 26 hdd 14.65039 1.00000 15 TiB 3.6 TiB 3.5 TiB 37 MiB 6.6 GiB > 11 TiB 24.64 0.86 127 up > 6 ssd 1.45549 1.00000 1.5 TiB 572 GiB 570 GiB 29 MiB 1.5 GiB > 918 GiB 38.38 1.34 57 up > 22 ssd 6.98630 1.00000 7.0 TiB 2.3 TiB 2.3 TiB 77 MiB 7.0 GiB > 4.7 TiB 33.23 1.16 243 up > 13 hdd 9.17380 1.00000 9.2 TiB 2.4 TiB 2.3 TiB 25 MiB 4.8 GiB > 6.8 TiB 26.07 0.91 84 up > 14 hdd 9.17380 1.00000 9.2 TiB 2.3 TiB 2.2 TiB 54 MiB 4.6 GiB > 6.9 TiB 25.13 0.88 80 up > 27 hdd 14.65039 1.00000 15 TiB 3.7 TiB 3.6 TiB 54 MiB 6.9 GiB > 11 TiB 25.55 0.89 131 up > 12 ssd 1.45540 1.00000 1.5 TiB 619 GiB 617 GiB 163 MiB 2.3 GiB > 871 GiB 41.53 1.45 63 up > 16 ssd 1.74660 1.00000 1.7 TiB 671 GiB 669 GiB 23 MiB 2.2 GiB > 1.1 TiB 37.51 1.31 69 up > 17 ssd 3.49309 0 0 B 0 B 0 B 0 B 0 B > 0 B 0 0 0 up > 18 ssd 1.74660 1.00000 1.7 TiB 512 GiB 509 GiB 18 MiB 2.3 GiB > 1.2 TiB 28.62 1.00 52 up > 19 ssd 1.74649 1.00000 1.7 TiB 709 GiB 707 GiB 64 MiB 2.0 GiB > 1.1 TiB 39.64 1.39 72 up > TOTAL 205 TiB 59 TiB 57 TiB 1.3 GiB 128 GiB > 147 TiB 28.60 > MIN/MAX VAR: 0.85/1.49 STDDEV: 6.81 > > > What we have done so far (no success) > > - reformat two of the SSD OSD's (one was still from luminos, non LVM) > - set bluestore_allocator from hybrid back to bitmap > - set osd_memory_target to 6442450944 for some of the SSD OSDs > - cpupower idle-set -D 11 > - bluefs_buffered_io to true > - disabled default firewalls between CEPH nodes (for testing only) > - disabled apparmor > - added memory (runs now on 128GB per Node) > - upgraded OS, runs now on kernel 5.13.19-1 > > What we observe > - HDD Pool has similar behaviour > - load is higher since update, seems like more CPU consumption (see graph1), > migration was on 10. Nov, around 10pm > - latency on the "big" 7TB SSD's (i.e. OSD.15) is significantly higher than > on the small 1.6TB SSDs (OSD.12), see graph2, must be due to the higher > weight though > - load of OSD.15 is 4 times higher than load of OSD.12, must be due to the > higher weight though as well > - start of OSD.15 (the 7TB SSD's is significantly slower (~10 sec) compared > to the 1.6TB SSDs > - increasing the block size in the benchmark to 4k, 8k or even 16k increases > the throughput but keeps the IOPS more or less stable, the drop at 32k is > minimal to ~14k IOPS in average > > We already checked the ProxMoxx List without any remedies yet and we are a > bit helpless, any suggestions and / or does someone else has similar > experiences? > > We are a bit hesitant to upgrade to Pacific, given the current situation. > > Thanks, > > Kai > > > > > > > > > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io