[ceph-users] why rgw generates large quantities orphan objects?
Hi all, Description of problem: [RGW] Buckets/objects deletion is causing large quantities orphan raods objects The cluster was running a cosbench workload, then remove the partial data by deleting objects from the cosbench client, then we have deleted all the buckets with the help of `s3cmd rb --recursive --force` command that removed all the buckets, but that did not help in the space reclaimation. ``` [root@node01 /]# rgw-orphan-list Available pools: device_health_metrics .rgw.root os-test.rgw.buckets.non-ec os-test.rgw.log os-test.rgw.control os-test.rgw.buckets.index os-test.rgw.meta os-test.rgw.buckets.data deeproute-replica-hdd-pool deeproute-replica-ssd-pool cephfs-metadata cephfs-replicated-pool .nfs Which pool do you want to search for orphans (for multiple, use space-separated list)? os-test.rgw.buckets.data Pool is "os-test.rgw.buckets.data". Note: output files produced will be tagged with the current timestamp -- 20221008062356. running 'rados ls' at Sat Oct 8 06:24:05 UTC 2022 running 'rados ls' on pool os-test.rgw.buckets.data. running 'radosgw-admin bucket radoslist' at Sat Oct 8 06:43:21 UTC 2022 computing delta at Sat Oct 8 06:47:17 UTC 2022 39662551 potential orphans found out of a possible 39844453 (99%). The results can be found in './orphan-list-20221008062356.out'. Intermediate files are './rados-20221008062356.intermediate' and './radosgw-admin-20221008062356.intermediate'. *** *** WARNING: This is EXPERIMENTAL code and the results should be used *** only with CAUTION! *** Done at Sat Oct 8 06:48:07 UTC 2022. [root@node01 /]# radosgw-admin gc list [] [root@node01 /]# cat orphan-list-20221008062356.out | wc -l 39662551 [root@node01 /]# rados df POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR .nfs 4.3 MiB 4 0 12 00 0 77398 76 MiB146 79 KiB 0 B 0 B .rgw.root180 KiB16 0 48 00 0 28749 28 MiB 0 0 B 0 B 0 B cephfs-metadata 932 MiB 14772 0 44316 00 01569690 3.8 GiB1258651 3.4 GiB 0 B 0 B cephfs-replicated-pool 738 GiB300962 0 902886 00 0 794612 470 GiB 770689 245 GiB 0 B 0 B deeproute-replica-hdd-pool 1016 GiB104276 0 312828 00 0 18176216 298 GiB 441783780 6.7 TiB 0 B 0 B deeproute-replica-ssd-pool30 GiB 3691 0 11073 00 02466079 2.1 GiB8416232 221 GiB 0 B 0 B device_health_metrics 50 MiB 108 0324 00 0 1836 1.8 MiB 1944 18 MiB 0 B 0 B os-test.rgw.buckets.data 5.6 TiB 39844453 0 239066718 00 0 552896177 3.0 TiB 999441015 60 TiB 0 B 0 B os-test.rgw.buckets.index1.8 GiB33 0 99 00 0 153600295 154 GiB 110916573 62 GiB 0 B 0 B os-test.rgw.buckets.non-ec 2.1 MiB45 0135 00 0 574240 349 MiB 153725 139 MiB 0 B 0 B os-test.rgw.control 0 B 8 0 24 00 0 0 0 B 0 0 B 0 B 0 B os-test.rgw.log 3.7 MiB 346 0 1038 00 0 83877803 80 GiB6306730 7.6 GiB 0 B 0 B os-test.rgw.meta 220 KiB23 0 69 00 0 640854 506 MiB 108229 53 MiB 0 B 0 B total_objects40268737 total_used 7.8 TiB total_avail 1.1 PiB total_space 1.1 PiB ``` ceph verison: ``` [root@node01 /]# ceph versions { "mon": { "ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 2 }, "osd": { "ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 108 }, "mds": { "ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 2 }, "rgw": { "ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 9 }, "overall": { "ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 124 } } ``` Thanks, Best regards Liang Zheng ___ ceph-users mailing list -- ceph-us
[ceph-users] Re: Updating Git Submodules -- a documentation question
For untracked files (eg. src/pybind/cephfs/cephfs.c) all you need is 'git clean -fdx' which you ran last in this case. Just about everything can be solved by a combination of these commands. git submodule update --init --recursive git clean -fdx git submodule foreach git clean -fdx If you have files that show up in diff output that have unwanted changes you can also use 'git checkout .' or 'git checkout ./path/to/filename' to revert the changes. If you still have persistent problems with a submodule directory after that just rm the offending directory and run 'git submodule update --init --recursive' again. Also, rather than doing 'git checkout main; git pull' on main I would do 'git checkout main; git fetch origin/main; git reset --hard origin/main' as it's easy to get into a state where pull will fail. HTH. On Wed, Oct 12, 2022 at 12:40 PM John Zachary Dover wrote: > > The following console output, which is far too long to include in > tutorial-style documentation that people are expected to read, shows the > sequence of commands necessary to diagnose and repair submodules that have > fallen out of sync with the submodules in the upstream repository. > > In this example, my local working copy has fallen out of sync. This will be > obvious to adepts, but this procedure does not need to be communicated to > them. > > This procedure was given to me by Brad Hubbard. > > Untracked files: > > (use "git add ..." to include in what will be committed) > > src/pybind/cephfs/build/ > > src/pybind/cephfs/cephfs.c > > src/pybind/cephfs/cephfs.egg-info/ > > src/pybind/rados/build/ > > src/pybind/rados/rados.c > > src/pybind/rados/rados.egg-info/ > > src/pybind/rbd/build/ > > src/pybind/rbd/rbd.c > > src/pybind/rbd/rbd.egg-info/ > > src/pybind/rgw/build/ > > src/pybind/rgw/rgw.c > > src/pybind/rgw/rgw.egg-info/ > > > > nothing added to commit but untracked files present (use "git add" to > > track) > > [zdover@fedora ceph]$ cd src/ > > [zdover@fedora src]$ ls > > arch cstart.sh nasm-wrapper > > auth dmclock neorados > > bash_completion doc objclass > > blk dokan objsync > > blkin erasure-codeocf > > btrfs_ioc_test.c etc-rbdmap os > > c-aresfmt osd > > cephadm global osdc > > ceph-clsinfo googletest perfglue > > ceph_common.shinclude perf_histogram.h > > ceph.conf.twoosds init-ceph.inpowerdns > > ceph-coverage.in init-radosgwps-ceph.pl > > ceph-crash.in isa-l push_to_qemu.pl > > ceph-create-keys jaegertracing pybind > > ceph-debugpack.in javapython-common > > ceph_fuse.cc journal rapidjson > > ceph.in json_spirit rbd_fuse > > ceph_mds.cc key_value_store rbdmap > > ceph_mgr.cc krbd.cc rbd_replay > > ceph_mon.cc kv rbd-replay-many > > ceph_osd.cc libcephfs.ccREADME > > ceph-osd-prestart.sh libcephsqlite.ccrgw > > ceph-post-file.in libkmip rocksdb > > ceph-rbdnamer libradoss3select > > ceph_release librados-config.cc sample.ceph.conf > > ceph-run libradosstriper script > > ceph_syn.cc librbd seastar > > ceph_ver.cloadclass.shSimpleRADOSStriper.cc > > ceph_ver.h.in.cmake log SimpleRADOSStriper.h > > ceph-volume logrotate.conf spawn > > civetweb mds spdk > > ckill.sh messagesstop.sh > > clientmgr telemetry > > cls mon test > > cls_acl.ccmount TODO > > cls_crypto.cc mount.fuse.ceph tools > > CMakeLists.txtmrgw.sh tracing > > cmonctl mrunvnewosd.sh > > commonmsg vstart.sh > > compressormstart.sh xxHash > > crimson mstop.shzstd > > crush multi-dump.sh > > cryptomypy.ini > > [zdover@fedora src]$ git checkout main > > Switched to branch 'main' > > Your branch is up to date with 'origin/main'. > > [zdover@fedora src]$ git pull > > Already up to date. > > [zdover@fedora src]$ git status > > On branch main > > Your branch is up to date with 'origin/main'. > > > > Untracked files: > > (use "git add ..." to include in what will be committed) > > pybind/cephfs/build/ > > pybind/cephfs/cephfs.c > > pybind/cephfs/cephfs.egg-info/ > > pybind/rados/build/ > > pybind/rados/rados.c > > pybind/rados/rados.egg-info/ > > pybind/rbd/build/ > > pybind/rbd/rbd.c > > pybind/rbd/rbd.egg-info/ > > pybind/rgw/build/ > > pybind/rgw/rgw.c > > pybin
[ceph-users] Re: Updating Git Submodules -- a documentation question
The following console output, which is far too long to include in tutorial-style documentation that people are expected to read, shows the sequence of commands necessary to diagnose and repair submodules that have fallen out of sync with the submodules in the upstream repository. In this example, my local working copy has fallen out of sync. This will be obvious to adepts, but this procedure does not need to be communicated to them. This procedure was given to me by Brad Hubbard. Untracked files: > (use "git add ..." to include in what will be committed) > src/pybind/cephfs/build/ > src/pybind/cephfs/cephfs.c > src/pybind/cephfs/cephfs.egg-info/ > src/pybind/rados/build/ > src/pybind/rados/rados.c > src/pybind/rados/rados.egg-info/ > src/pybind/rbd/build/ > src/pybind/rbd/rbd.c > src/pybind/rbd/rbd.egg-info/ > src/pybind/rgw/build/ > src/pybind/rgw/rgw.c > src/pybind/rgw/rgw.egg-info/ > > nothing added to commit but untracked files present (use "git add" to > track) > [zdover@fedora ceph]$ cd src/ > [zdover@fedora src]$ ls > arch cstart.sh nasm-wrapper > auth dmclock neorados > bash_completion doc objclass > blk dokan objsync > blkin erasure-codeocf > btrfs_ioc_test.c etc-rbdmap os > c-aresfmt osd > cephadm global osdc > ceph-clsinfo googletest perfglue > ceph_common.shinclude perf_histogram.h > ceph.conf.twoosds init-ceph.inpowerdns > ceph-coverage.in init-radosgwps-ceph.pl > ceph-crash.in isa-l push_to_qemu.pl > ceph-create-keys jaegertracing pybind > ceph-debugpack.in javapython-common > ceph_fuse.cc journal rapidjson > ceph.in json_spirit rbd_fuse > ceph_mds.cc key_value_store rbdmap > ceph_mgr.cc krbd.cc rbd_replay > ceph_mon.cc kv rbd-replay-many > ceph_osd.cc libcephfs.ccREADME > ceph-osd-prestart.sh libcephsqlite.ccrgw > ceph-post-file.in libkmip rocksdb > ceph-rbdnamer libradoss3select > ceph_release librados-config.cc sample.ceph.conf > ceph-run libradosstriper script > ceph_syn.cc librbd seastar > ceph_ver.cloadclass.shSimpleRADOSStriper.cc > ceph_ver.h.in.cmake log SimpleRADOSStriper.h > ceph-volume logrotate.conf spawn > civetweb mds spdk > ckill.sh messagesstop.sh > clientmgr telemetry > cls mon test > cls_acl.ccmount TODO > cls_crypto.cc mount.fuse.ceph tools > CMakeLists.txtmrgw.sh tracing > cmonctl mrunvnewosd.sh > commonmsg vstart.sh > compressormstart.sh xxHash > crimson mstop.shzstd > crush multi-dump.sh > cryptomypy.ini > [zdover@fedora src]$ git checkout main > Switched to branch 'main' > Your branch is up to date with 'origin/main'. > [zdover@fedora src]$ git pull > Already up to date. > [zdover@fedora src]$ git status > On branch main > Your branch is up to date with 'origin/main'. > > Untracked files: > (use "git add ..." to include in what will be committed) > pybind/cephfs/build/ > pybind/cephfs/cephfs.c > pybind/cephfs/cephfs.egg-info/ > pybind/rados/build/ > pybind/rados/rados.c > pybind/rados/rados.egg-info/ > pybind/rbd/build/ > pybind/rbd/rbd.c > pybind/rbd/rbd.egg-info/ > pybind/rgw/build/ > pybind/rgw/rgw.c > pybind/rgw/rgw.egg-info/ > > nothing added to commit but untracked files present (use "git add" to > track) > [zdover@fedora src]$ > > [zdover@fedora ceph]$ git status > On branch main > Your branch is up to date with 'origin/main'. > > Untracked files: > (use "git add ..." to include in what will be committed) > src/pybind/cephfs/build/ > src/pybind/cephfs/cephfs.c > src/pybind/cephfs/cephfs.egg-info/ > src/pybind/rados/build/ > src/pybind/rados/rados.c > src/pybind/rados/rados.egg-info/ > src/pybind/rbd/build/ > src/pybind/rbd/rbd.c > src/pybind/rbd/rbd.egg-info/ > src/pybind/rgw/build/ > src/pybind/rgw/rgw.c > src/pybind/rgw/rgw.egg-info/ > > nothing added to commit but untracked files present (use "git add" to > track) > [zdover@fedora ceph]$ git submodule update --force --init --recursive > Submodule 'ceph-erasure-code-corpus' ( > https://github.com/ceph/ceph-erasure-code-corpus.git) registered for path > 'ceph-erasure-code-corpus' > Submodule 'ceph-object-corpus' ( > https://github.com/ceph/ceph-object-corpus.git) registered for path > 'ceph-object-corpus' > S
[ceph-users] Re: encrypt OSDs after creation
ср, 12 окт. 2022 г. в 00:32, Ali Akil : > > Hallo folks, > > i created before couple of months a quincy ceph cluster with cephadm. I > didn't encpryt the OSDs at that time. > What would be the process to encrypt these OSDs afterwards? > The documentation states only adding `encrypted: true` to the osd > manifest, which will work only upon creation. There is no such process. Destroy one OSD, recreate it with the same ID but with the encryption on, wait for the cluster to heal itself, then do the same with the next OSD, rinse, repeat. You may want to set the norebalance flag during the operation. -- Alexander E. Patrakov ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to force PG merging in one step?
Hi Eugen, thanks, that was a great hint! I have a strong déjà vu feeling, we discussed this before with increasing pg_num, didn't we? I just set it to 1 and it did exactly what I wanted. Its the same number of PGs backfilling, but pgp_num=1024, so while the rebalancing load is the same, I got rid of any redundant data movements and I can actually see the progress of the merge just with ceph status. Related to that, I have set mon_max_pg_per_osd=300 and do have OSDs with more than 400 PGs. Still, I don't see the promised health warning in ceph status. Is this a known issue? Opinion part. Returning to the above setting, I have to say that the assignment of which parameter influences what seems a bit unintuitive if not inconsistent. The parameter target_max_misplaced_ratio belongs to the balancer module, but merging PGs clearly is a task of the pg_autoscaler module. I'm not balancing, I'm scaling PG numbers. Such cross dependencies make it really hard to find relevant information in the section of the documentation where one would be looking for it. It starts being distributed all over the place. If its not possible to have such things separated and specific tasks consistently explained in a single section, there could at least be a hint including also the case of PG merging/splitting in the description of target_max_misplaced_ratio so that a search for these terms brings up this page. There should also be a cross reference from "ceph osd pool set pg[p]_num" to target_max_misplaced_ratio. Well, its now here in this message for google to reveal. I have to add that, while I understand the motivation behind adding these baby sitting modules, I would actually appreciate if one could disable them. I personally find them to be really annoying especially in emergency situations, but also in normal operations. I would consider them a nice to have and not enforce it on people who want to be in charge. For example, in my current situation, I'm halving the PG count of a pool. Doing the merge in one go or letting the target_max_misplaced_ratio "help" me leads to exactly the same number of PGs backfilling at any time. Which means both cases, target_max_misplaced_ratio=0.05 and 1 lead to exactly the same interference of rebalancing IO with user IO. The difference is that with target_max_misplaced_ratio=0.05 this phase of reduced performance will take longer, because every time the module decides to change pgp_num it will inevitably also rebalance objects again that have been moved before. I find it difficult to consider this an improvement. I prefer to avoid any redundant writes at all cost for the benefit of disk life time. If I really need to reduce the impact of recovery IO I can set recovery_sleep. My personal opinion to the user group. Thanks for your help and have a nice evening! Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: 11 October 2022 14:13:45 To: ceph-users@ceph.io Subject: [ceph-users] Re: How to force PG merging in one step? Hi Frank, I don't think it's the autoscaler interferring here but the default 5% target_max_misplaced_ratio. I haven't tested the impacts of increasing that to a much higher value, so be careful. Regards, Eugen Zitat von Frank Schilder : > Hi all, > > I need to reduce the number of PGs in a pool from 2048 to 512 and > would really like to do that in a single step. I executed the set > pg_num 512 command, but the PGs are not all merged. Instead I get > this intermediate state: > > pool 13 'con-fs2-meta2' replicated size 4 min_size 2 crush_rule 3 > object_hash rjenkins pg_num 2048 pgp_num 1946 pg_num_target 512 > pgp_num_target 512 autoscale_mode off last_change 916710 lfor > 0/0/618995 flags hashpspool,nodelete,selfmanaged_snaps max_bytes > 107374182400 stripe_width 0 compression_mode none application cephfs > > This is really annoying, because it will not only lead to repeated > redundant data movements and but I also need to rebalance this pool > onto fewer OSDs, which cannot hold the 1946 PGs it will be merged to > intermittently. How can I override the autoscaler interfering with > admin operations in such tight corners? > > As you can see, we disabled autoscaler on all pools and also > globally. Still, it interferes with admin commands in an unsolicited > way. I would like the PG merge happen on the fly as the data moves > to the new OSDs. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io T
[ceph-users] Re: crush hierarchy backwards and upmaps ...
Hi Chris, Just curious, does this rule make sense and help with the multi level crush map issue? (Maybe it also results in zero movement, or at least less then the alternative you proposed?) step choose indep 4 type rack step chooseleaf indep 2 type chassis Cheers, Dan On Tue, Oct 11, 2022, 19:29 Christopher Durham wrote: > Dan, > > Thank you. > > I did what you said regarding --test-map-pgs-dump and it wants to move 3 > OSDs in every PG. Yuk. > > So before i do that, I tried this rule, after changing all my 'pod' bucket > definitions to 'chassis', and compiling and > injecting the new crushmap to an osdmap: > > > rule mypoolname { > id -5 > type erasure > step take myroot > step choose indep 4 type rack > step choose indep 2 type chassis > step chooseleaf indep 1 type host > step emit > > } > > --test-pg-upmap-entries shows there were NO changes to be done after > comparing it with the original!!! > > However, --upmap-cleanup says: > > verify_upmap number of buckets 8 exceeds desired number of 2 > check_pg_upmaps verify_upmap of poolid.pgid returning -22 > > This is output for every current upmap, but I really do want 8 total > buckets per PG, as my pool is a 6+2. > > The upmap-cleanup output wants me to remove all of my upmaps. > > This seems consistent with a bug report that says that there is a problem > with the balancer on a > multi-level rule such as the above, albeit on 14.2.x. Any thoughts? > > https://tracker.ceph.com/issues/51729 > > I am leaning towards just eliminating the middle rule and go directly from > rack to host, even though > it wants to move a LARGE amount of data according to a diff before and > after of --test-pg-upmap-entries. > In this scenario, I dont see any unexpected errors with --upmap-cleanup > and I do not want to get stuck > > rule mypoolname { > id -5 > type erasure > step take myroot > step choose indep 4 type rack > step chooseleaf indep 2 type host > step emit > } > > -Chris > > > -Original Message- > From: Dan van der Ster > To: Christopher Durham > Cc: Ceph Users > Sent: Mon, Oct 10, 2022 12:22 pm > Subject: [ceph-users] Re: crush hierarchy backwards and upmaps ... > > Hi, > > Here's a similar bug: https://tracker.ceph.com/issues/47361 > > Back then, upmap would generate mappings that invalidate the crush rule. I > don't know if that is still the case, but indeed you'll want to correct > your rule. > > Something else you can do before applying the new crush map is use > osdmaptool to compare the PGs placement before and after, something like: > > osdmaptool --test-map-pgs-dump osdmap.before > before.txt > > osdmaptool --test-map-pgs-dump osdmap.after > after.txt > > diff -u before.txt after.txt > > The above will help you estimate how much data will move after injecting > the fixed crush map. So depending on the impact you can schedule the change > appropriately. > > I also recommend to keep a backup of the previous crushmap so that you can > quickly restore it if anything goes wrong. > > Cheers, Dan > > > > > > On Mon, Oct 10, 2022, 19:31 Christopher Durham wrote: > > > Hello, > > I am using pacific 16.2.10 on Rocky 8.6 Linux. > > > > After setting upmap_max_deviation to 1 on the ceph balancer in ceph-mgr, > I > > achieved a near perfect balance of PGs and space on my OSDs. This is > great. > > > > However, I started getting the following errors on my ceph-mon logs, > every > > three minutes, for each of the OSDs that had been mappedby the balancer: > >2022-10-07T17:10:39.619+ 7f7c2786d700 1 verify_upmap unable to get > > parent of osd.497, skipping for now > > > > After banging my head against the wall for a bit trying to figure this > > out, I think I have discovered the issue: > > > > Currently, I have my pool EC Pool configured with the following crush > rule: > > > > rule mypoolname { > >id -5 > >type erasure > >step take myroot > >step choose indep 4 type rack > >step choose indep 2 type pod > >step chooseleaf indep 1 type host > >step emit > > } > > > > Basically, pick 4 racks, then 2 pods in each rack, and then one host in > > each pod, For a total of > > 8 chunks. (The pool is a is a 6+2). The 4 racks are chosen from the > myroot > > root entry, which is as follows. > > > > > > root myroot { > >id -400 > >item rack1 weight N > >item rack2 weight N > >item rack3 weight N > >item rack4 weight N > > } > > > > This has worked fine since inception, over a year ago. And the PGs are > all > > as I expect with OSDs from the 4 racks and not on the same host or pod. > > > > The errors above, verify_upmap, started after I had the upmap_ > > max_deviation set to 1 in the balancer and having it > > move things around, creating pg_upmap entries. > > > > I then discovered, while trying to figure this out, that the device types > > are: > > > > type 0 osd > > type 1 host > > type 2 chassis > > type 3 rack > > ... > > type 6 pod > > > >
[ceph-users] Re: crush hierarchy backwards and upmaps ...
Dan, Thank you. I did what you said regarding --test-map-pgs-dump and it wants to move 3 OSDs in every PG. Yuk. So before i do that, I tried this rule, after changing all my 'pod' bucket definitions to 'chassis', and compiling andinjecting the new crushmap to an osdmap: rule mypoolname { id -5 type erasure step take myroot step choose indep 4 type rack step choose indep 2 type chassis step chooseleaf indep 1 type host step emit } --test-pg-upmap-entries shows there were NO changes to be done after comparing it with the original!!! However, --upmap-cleanup says: verify_upmap number of buckets 8 exceeds desired number of 2check_pg_upmaps verify_upmap of poolid.pgid returning -22 This is output for every current upmap, but I really do want 8 total buckets per PG, as my pool is a 6+2. The upmap-cleanup output wants me to remove all of my upmaps. This seems consistent with a bug report that says that there is a problem with the balancer on a multi-level rule such as the above, albeit on 14.2.x. Any thoughts? https://tracker.ceph.com/issues/51729 I am leaning towards just eliminating the middle rule and go directly from rack to host, even thoughit wants to move a LARGE amount of data according to a diff before and after of --test-pg-upmap-entries.In this scenario, I dont see any unexpected errors with --upmap-cleanup and I do not want to get stuck rule mypoolname { id -5 type erasure step take myroot step choose indep 4 type rack step chooseleaf indep 2 type host step emit } -Chris -Original Message- From: Dan van der Ster To: Christopher Durham Cc: Ceph Users Sent: Mon, Oct 10, 2022 12:22 pm Subject: [ceph-users] Re: crush hierarchy backwards and upmaps ... Hi, Here's a similar bug: https://tracker.ceph.com/issues/47361 Back then, upmap would generate mappings that invalidate the crush rule. I don't know if that is still the case, but indeed you'll want to correct your rule. Something else you can do before applying the new crush map is use osdmaptool to compare the PGs placement before and after, something like: osdmaptool --test-map-pgs-dump osdmap.before > before.txt osdmaptool --test-map-pgs-dump osdmap.after > after.txt diff -u before.txt after.txt The above will help you estimate how much data will move after injecting the fixed crush map. So depending on the impact you can schedule the change appropriately. I also recommend to keep a backup of the previous crushmap so that you can quickly restore it if anything goes wrong. Cheers, Dan On Mon, Oct 10, 2022, 19:31 Christopher Durham wrote: > Hello, > I am using pacific 16.2.10 on Rocky 8.6 Linux. > > After setting upmap_max_deviation to 1 on the ceph balancer in ceph-mgr, I > achieved a near perfect balance of PGs and space on my OSDs. This is great. > > However, I started getting the following errors on my ceph-mon logs, every > three minutes, for each of the OSDs that had been mappedby the balancer: > 2022-10-07T17:10:39.619+ 7f7c2786d700 1 verify_upmap unable to get > parent of osd.497, skipping for now > > After banging my head against the wall for a bit trying to figure this > out, I think I have discovered the issue: > > Currently, I have my pool EC Pool configured with the following crush rule: > > rule mypoolname { > id -5 > type erasure > step take myroot > step choose indep 4 type rack > step choose indep 2 type pod > step chooseleaf indep 1 type host > step emit > } > > Basically, pick 4 racks, then 2 pods in each rack, and then one host in > each pod, For a total of > 8 chunks. (The pool is a is a 6+2). The 4 racks are chosen from the myroot > root entry, which is as follows. > > > root myroot { > id -400 > item rack1 weight N > item rack2 weight N > item rack3 weight N > item rack4 weight N > } > > This has worked fine since inception, over a year ago. And the PGs are all > as I expect with OSDs from the 4 racks and not on the same host or pod. > > The errors above, verify_upmap, started after I had the upmap_ > max_deviation set to 1 in the balancer and having it > move things around, creating pg_upmap entries. > > I then discovered, while trying to figure this out, that the device types > are: > > type 0 osd > type 1 host > type 2 chassis > type 3 rack > ... > type 6 pod > > So pod is HIGHER on the hierarchy than rack. I have it as lower on my > rule. > > What I want to do is remove the pods completely to work around this. > Something like: > > rule mypoolname { > id -5 > type erasure > step take myroot > step choose indep 4 type rack > step chooseleaf indep 2 type host > step emit > } > > This will pick 4 racks and then 2 hosts in each rack. Will this cause any > problems? I can add the pod stuff back later as 'chassis' instead. I can > live without the 'pod' separation if needed. > > To test this, I tried doing something like this: > > 1. grab the osdma
[ceph-users] encrypt OSDs after creation
Hallo folks, i created before couple of months a quincy ceph cluster with cephadm. I didn't encpryt the OSDs at that time. What would be the process to encrypt these OSDs afterwards? The documentation states only adding `encrypted: true` to the osd manifest, which will work only upon creation. Regards, Ali ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Inherited CEPH nightmare
Hi Janne, I've changed some elements of the config now and the results are much better but still quite poor relative to what I would consider normal SSD performance. The osd_memory_target is now set to 12GB for 3 of the 4 hosts (each of these hosts has 1.5TB RAM so I can allocate loads if necessary). The other host is a more modern small form factor server with only 64GB on board, so each of the 3 OSDs in that device has 4GB per OSD. The number of PGs has been increased from 128 to 256. Not yet run JJ Balancer. In terms of performance, I measured the time it takes for ProxMox to clone a 127GB VM. It now clones in around 18 minutes, rather than 1 hour 55 mins before the config changes, so there is progress here. I also had a play around with enabling and disabling write cache. I performed a rudimentary ceph tell osd.x bench command to see what the performance would be with it on/off. The results were surprising as the disks provided far more IOPS with the cache ENABLED, rather than disabled. To round out your question, we are on Bluestore, with CEPH v 16.2.7 I still think the next steps are to change the remaining 6 consumer grade devices with the Seagate IronWolf 125 1TB SSD's which seem to perform much better according to ceph benchmarks, and after that increase the number of hosts to 6, and spread the 12 OSD's so that each host has 2 OSD's only. Any other suggestions are welcome. Many thanks. Current ceph.conf: [global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 192.168.8.4/24 fsid = 4a4b4fff-d140-4e11-a35b-cbac0e18a3ce mon_allow_pool_delete = true mon_host = 192.168.8.5 192.168.8.3 192.168.8.6 ms_bind_ipv4 = true ms_bind_ipv6 = false osd_memory_target = 12884901888 osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 192.168.8.4/24 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [mds] keyring = /var/lib/ceph/mds/ceph-$id/keyring [mds.cl1-h1-lv] host = cl1-h1-lv mds_standby_for_name = pve [mds.cl1-h2-lv] host = cl1-h2-lv mds_standby_for_name = pve [mds.cl1-h3-lv] host = cl1-h3-lv mds_standby_for_name = pve [mon.cl1-h1-lv] public_addr = 192.168.8.3 [mon.cl1-h3-lv] public_addr = 192.168.8.5 [mon.cl1-h4-lv] public_addr = 192.168.8.6 And crush map: # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class ssd device 1 osd.1 class ssd device 2 osd.2 class ssd device 3 osd.3 class ssd device 4 osd.4 class ssd device 5 osd.5 class ssd device 6 osd.6 class ssd device 7 osd.7 class ssd device 9 osd.9 class ssd device 10 osd.10 class ssd device 11 osd.11 class ssd device 12 osd.12 class ssd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 zone type 10 region type 11 root # buckets host cl1-h2-lv { id -3 # do not change unnecessarily id -4 class ssd # do not change unnecessarily # weight 2.729 alg straw2 hash 0 # rjenkins1 item osd.0 weight 0.910 item osd.5 weight 0.910 item osd.10 weight 0.910 } host cl1-h3-lv { id -5 # do not change unnecessarily id -6 class ssd # do not change unnecessarily # weight 2.729 alg straw2 hash 0 # rjenkins1 item osd.1 weight 0.910 item osd.6 weight 0.910 item osd.11 weight 0.910 } host cl1-h4-lv { id -7 # do not change unnecessarily id -8 class ssd # do not change unnecessarily # weight 2.729 alg straw2 hash 0 # rjenkins1 item osd.7 weight 0.910 item osd.2 weight 0.910 item osd.3 weight 0.910 } host cl1-h1-lv { id -9 # do not change unnecessarily id -10 class ssd# do not change unnecessarily # weight 2.729 alg straw2 hash 0 # rjenkins1 item osd.4 weight 0.910 item osd.9 weight 0.910 item osd.12 weight 0.910 } root default { id -1 # do not change unnecessarily id -2 class ssd # do not change unnecessarily # weight 10.916 alg straw2 hash 0 # rjenkins1 item cl1-h2-lv weight 2.729 item cl1-h3-lv weight 2.729 item cl1-h4-lv weight 2.729 item cl1-h1-lv weight 2.729 } # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default step
[ceph-users] Autoscaler stopped working after upgrade Octopus -> Pacific
Dear all, just upgraded our cluster from Octopus to Pacific (16.2.10). This introduced an error in autoscaler: 2022-10-11T14:47:40.421+0200 7f3ec2d03700 0 [pg_autoscaler ERROR root] pool 17 has overlapping roots: {-4, -1} 2022-10-11T14:47:40.423+0200 7f3ec2d03700 0 [pg_autoscaler ERROR root] pool 22 has overlapping roots: {-4, -1} 2022-10-11T14:47:40.423+0200 7f3ec2d03700 0 [pg_autoscaler ERROR root] pool 23 has overlapping roots: {-4, -1} 2022-10-11T14:47:40.427+0200 7f3ec2d03700 0 [pg_autoscaler ERROR root] pool 27 has overlapping roots: {-6, -4, -1} 2022-10-11T14:47:40.428+0200 7f3ec2d03700 0 [pg_autoscaler ERROR root] pool 28 has overlapping roots: {-6, -4, -1} Autoscaler status is empty: [cephmon1] /root # ceph osd pool autoscale-status [cephmon1] /root # On https://forum.proxmox.com/threads/ceph-overlapping-roots.104199/ I found something similar: --- I assume that you have at least one pool that still has the "replicated_rule" assigned, which does not make a distinction between the device class of the OSDs. This is why you see this error. The autoscaler cannot decide how many PGs the pools need. Make sure that all pools are assigned a rule that limit them to a device class and the errors should stop. --- Indeed, we have a mixed cluster (hdd + ssd) with some pools spanning hdd- only, some ssd-only and some both (ec & replicated) which don't care about the storage device class (e.g. via default "replicated_rule"): [cephmon1] /root # ceph osd crush rule ls replicated_rule ssd_only_replicated_rule hdd_only_replicated_rule default.rgw.buckets.data.ec42 test.ec42 [cephmon1] /root # That worked flawlessly until Octopus. Any idea how to make autoscaler work again with that kind of setup? Can I really have pools on one device class only in Pacific in order to get a functional autoscaler? Thanks, Andreas -- | Andreas Haupt| E-Mail: andreas.ha...@desy.de | DESY Zeuthen| WWW:http://www-zeuthen.desy.de/~ahaupt | Platanenallee 6 | Phone: +49/33762/7-7359 | D-15738 Zeuthen | Fax:+49/33762/7-7216 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to force PG merging in one step?
Hi Frank, I don't think it's the autoscaler interferring here but the default 5% target_max_misplaced_ratio. I haven't tested the impacts of increasing that to a much higher value, so be careful. Regards, Eugen Zitat von Frank Schilder : Hi all, I need to reduce the number of PGs in a pool from 2048 to 512 and would really like to do that in a single step. I executed the set pg_num 512 command, but the PGs are not all merged. Instead I get this intermediate state: pool 13 'con-fs2-meta2' replicated size 4 min_size 2 crush_rule 3 object_hash rjenkins pg_num 2048 pgp_num 1946 pg_num_target 512 pgp_num_target 512 autoscale_mode off last_change 916710 lfor 0/0/618995 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 107374182400 stripe_width 0 compression_mode none application cephfs This is really annoying, because it will not only lead to repeated redundant data movements and but I also need to rebalance this pool onto fewer OSDs, which cannot hold the 1946 PGs it will be merged to intermittently. How can I override the autoscaler interfering with admin operations in such tight corners? As you can see, we disabled autoscaler on all pools and also globally. Still, it interferes with admin commands in an unsolicited way. I would like the PG merge happen on the fly as the data moves to the new OSDs. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Invalid crush class
The only way I could reproduce this was by removing the existing class from an OSD and setting it: ---snip--- pacific:~ # ceph osd crush rm-device-class 0 done removing class of osd(s): 0 pacific:~ # ceph osd crush set-device-class jbod.hdd 0 set osd(s) 0 to class 'jbod.hdd' pacific:~ # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -10.03596 root default -30.03596 host pacific 1 hdd 0.01199 osd.1 up 1.0 1.0 2 hdd 0.01199 osd.2 up 1.0 1.0 0 jbod.hdd 0.01198 osd.0 up 1.0 1.0 pacific:~ # ceph osd crush class ls [ "hdd", "jbod.hdd" ] ---snip--- But if I remove it from the OSD the device class is gone as well: ---snip--- pacific:~ # ceph osd crush rm-device-class 0 done removing class of osd(s): 0 pacific:~ # ceph osd crush set-device-class hdd 0 set osd(s) 0 to class 'hdd' pacific:~ # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.03596 root default -3 0.03596 host pacific 0hdd 0.01198 osd.0 up 1.0 1.0 1hdd 0.01199 osd.1 up 1.0 1.0 2hdd 0.01199 osd.2 up 1.0 1.0 pacific:~ # ceph osd crush class ls [ "hdd" ] ---snip--- So I would have expected that there is one or more OSDs with that device class but you already checked that. Do you still find it in the crushmap? To retrieve it you run: ceph osd getcrushmap -o crushmap.bin crushtool -d crushmap.bin -o crushmap.txt Regards, Eugen Zitat von Michael Thomas : In 15.2.7, how can I remove an invalid crush class? I'm surprised that I was able to create it in the first place: [root@ceph1 bin]# ceph osd crush class ls [ "ssd", "JBOD.hdd", "nvme", "hdd" ] [root@ceph1 bin]# ceph osd crush class ls-osd JBOD.hdd Invalid command: invalid chars . in JBOD.hdd osd crush class ls-osd : list all osds belonging to the specific Error EINVAL: invalid command There are no devices mapped to this class: [root@ceph1 bin]# ceph osd crush tree | grep JBOD | wc -l --Mike ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io