On Wed, 22 Sept 2021 at 05:54, Szabo, Istvan (Agoda) <istvan.sz...@agoda.com> wrote: > > Sorry to steal it, so if I have 500GB and 700GB mixed wal+rocksdb on nvme > the number should be the level base 5000000000 and 7000000000? Or needs to be > power of 2?
Generally the sum of all levels (up to the max of your metadata) needs to fit into the db partition for each OSD. If you have a 500 or 700 GB WAL+DB partition per OSD then the default settings should carry you to L3 (~333GB required free space). Do you have more than 300GB metadata per OSD? All examples I've ever seen show the level base size at a power of 2 but I don't know if there are any side effects not doing that 5/7GB level base is an order of magnitude higher than the default and it's unclear what performance effect this has. I would not advise tinkering with the defaults unless you have a lot of time and energy to burn on testing and are willing to accept potential future performance issues on upgrade because you run a setup that nobody ever tests for. What's the size of your OSDs, how much db space per OSD do you actually have and what do the spillover warnings say? > Istvan Szabo > Senior Infrastructure Engineer > --------------------------------------------------- > Agoda Services Co., Ltd. > e: istvan.sz...@agoda.com > --------------------------------------------------- > > On 2021. Sep 21., at 9:19, Christian Wuerdig <christian.wuer...@gmail.com> > wrote: > > Email received from the internet. If in doubt, don't click any link nor open > any attachment ! > ________________________________ > > It's been discussed a few times on the list but RocksDB levels essentially > grow by a factor of 10 (max_bytes_for_level_multiplier) by default and you > need (level-1)*10 space for the next level on your drive to avoid spill over > So the sequence (by default) is 256MB -> 2.56GB -> 25.6GB -> 256GB GB and > since 50GB < 286 (sum of all levels) you get spill-over going from L2 to > L3. See also > https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing > > Interestingly your level base seems to be 512MB instead of the default > 256MB - did you change that? In your case the sequence I would have > expected is 0.5 -> 5 -> 50 - and you should have already seen spillover at > 5GB (since you only have 50GB partitions but you need 55.5GB at least) > Not sure what's up with that. I think you need to re-create OSDs after > changing these RocksDB params > > Overall since Pacific this no longer holds entirely true since RocksDB > sharding was added ( > https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#bluestore-rocksdb-sharding) > - it was broken in 16.2.4 but looks like it's fixed in 16.2.6 > > 1. Upgrade to Pacific > 2. Get rid of the NVME raid > 3. Make 160GB DB partitions > 4. Activate RocksDB sharding > 5. Don't worry about RocksDB params > > If you don't feel like upgrading to Pacific any time soon but want to make > more efficient use of the NVME and don't mind going out on a limp I'd still > do 2+3 plus study > https://github-wiki-see.page/m/facebook/rocksdb/wiki/Leveled-Compaction > carefully and make adjustments based on that. > With 160GB partitions a multiplier of 7 might work well with a base size of > 350MB 0.35 -> 2.45 -> 17.15 -> 120.05 (total 140GB out of 160GB) > > You could also try to switch to a 9x multiplier and re-create one of the > OSDs to see how it pans out prior to dissolving the raid1 setup (given your > settings that should result in 0.5 -> 4.5 -> 40.5 GB usage) > > On Tue, 21 Sept 2021 at 13:19, mhnx <morphinwith...@gmail.com> wrote: > > Hello everyone! > > I want to understand the concept and tune my rocksDB options on nautilus > > 14.2.16. > > > osd.178 spilled over 102 GiB metadata from 'db' device (24 GiB used of > > 50 GiB) to slow device > > osd.180 spilled over 91 GiB metadata from 'db' device (33 GiB used of > > 50 GiB) to slow device > > > The problem is, I have the spill over warnings like the rest of the > > community. > > I tuned RocksDB Options with the settings below but the problem still > > exists and I wonder if I did anything wrong. I still have the Spill Overs > > and also some times index SSD's are getting down due to compaction problems > > and can not start them until I do offline compaction. > > > Let me tell you about my hardware right? > > Every server in my system has: > > HDD - 19 x TOSHIBA MG08SCA16TEY 16.0TB for EC pool. > > SSD - 3 x SAMSUNG MZILS960HEHP/007 GXL0 960GB > > NVME - 2 x PM1725b 1.6TB > > > I'm using Raid 1 Nvme for Bluestore DB. I dont have WAL. > > 19*50GB = 950GB total usage on NVME. (I was thinking use the rest but > > regret it now) > > > So! Finally let's check my RocksDB Options: > > [osd] > > bluefs_buffered_io = true > > bluestore_rocksdb_options = > > > compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,flusher_threads=8,compaction_readahead_size=2MB,compaction_threads=16, > > *max_bytes_for_level_base=536870912*, > > *max_bytes_for_level_multiplier=10* > > > *"ceph osd df tree" *to see ssd and hdd usage, omap and meta. > > > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META > > AVAIL %USE VAR PGS STATUS TYPE NAME > > -28 280.04810 - 280 TiB 169 TiB 166 TiB 688 GiB 2.4 TiB 111 > > TiB 60.40 1.00 - host MHNX1 > > 178 hdd 14.60149 1.00000 15 TiB 8.6 TiB 8.5 TiB 44 KiB 126 GiB 6.0 > > TiB 59.21 0.98 174 up osd.178 > > 179 ssd 0.87329 1.00000 894 GiB 415 GiB 89 GiB 321 GiB 5.4 GiB 479 > > GiB 46.46 0.77 104 up osd.179 > > > > I know the size of NVME is not suitable for 16TB HDD's. I should have more > > but the expense is cutting us pieces. Because of that I think I'll see the > > spill overs no matter what I do. But maybe I will make it better > > with your help! > > > *My questions are:* > > 1- What is the meaning of (33 GiB used of 50 GiB) > > 2- Why it's not 50GiB / 50GiB ? > > 3- Do I have 17GiB unused area on the DB partition? > > 4- Is there anything wrong with my Rocksdb options? > > 5- How can I be sure and find the good Rocksdb Options for Ceph? > > 6- How can I measure the change and test it? > > 7- Do I need different RocksDB options for HDD's and SSD's ? > > 8- If I stop using Nvme Raid1 to gain x2 size and resize the DB's to > > 160GiB. Is it worth to take Nvme faulty? Because I will lose 10HDD at the > > same time but I have 10 Node and that's only %5 of the EC data . I use m=8 > > k=2. > > > P.S: There are so many people asking and searching around this. I hope it > > will work this time. > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io