[ceph-users] why rgw generates large quantities orphan objects?

2022-10-11 Thread 郑亮
Hi all,
Description of problem: [RGW] Buckets/objects deletion is causing large
quantities orphan raods objects

The cluster was running a cosbench workload, then remove the partial data
by deleting objects from the cosbench client, then we have deleted all the
buckets with the help of `s3cmd rb --recursive --force` command that
removed all the buckets, but that did not help in the space reclaimation.

```
[root@node01 /]# rgw-orphan-list

Available pools:

device_health_metrics

.rgw.root

os-test.rgw.buckets.non-ec

os-test.rgw.log

os-test.rgw.control

os-test.rgw.buckets.index

os-test.rgw.meta

os-test.rgw.buckets.data

deeproute-replica-hdd-pool

deeproute-replica-ssd-pool

cephfs-metadata

cephfs-replicated-pool

.nfs

Which pool do you want to search for orphans (for multiple, use
space-separated list)? os-test.rgw.buckets.data

Pool is "os-test.rgw.buckets.data".

Note: output files produced will be tagged with the current timestamp --
20221008062356.
running 'rados ls' at Sat Oct  8 06:24:05 UTC 2022

running 'rados ls' on pool os-test.rgw.buckets.data.



running 'radosgw-admin bucket radoslist' at Sat Oct  8 06:43:21 UTC 2022
computing delta at Sat Oct  8 06:47:17 UTC 2022

39662551 potential orphans found out of a possible 39844453 (99%).
The results can be found in './orphan-list-20221008062356.out'.

Intermediate files are './rados-20221008062356.intermediate' and
'./radosgw-admin-20221008062356.intermediate'.
***

*** WARNING: This is EXPERIMENTAL code and the results should be used
***  only with CAUTION!
***
Done at Sat Oct  8 06:48:07 UTC 2022.

[root@node01 /]# radosgw-admin gc list
[]

[root@node01 /]# cat orphan-list-20221008062356.out | wc -l
39662551

[root@node01 /]# rados df
POOL_NAME   USED   OBJECTS  CLONES COPIES
 MISSING_ON_PRIMARY  UNFOUND  DEGRADED RD_OPS   RD WR_OPS
 WR  USED COMPR  UNDER COMPR
.nfs 4.3 MiB 4   0 12
00 0  77398   76 MiB146   79
KiB 0 B  0 B
.rgw.root180 KiB16   0 48
00 0  28749   28 MiB  0
0 B 0 B  0 B
cephfs-metadata  932 MiB 14772   0  44316
00 01569690  3.8 GiB1258651  3.4
GiB 0 B  0 B
cephfs-replicated-pool   738 GiB300962   0 902886
00 0 794612  470 GiB 770689  245
GiB 0 B  0 B
deeproute-replica-hdd-pool  1016 GiB104276   0 312828
00 0   18176216  298 GiB  441783780  6.7
TiB 0 B  0 B
deeproute-replica-ssd-pool30 GiB  3691   0  11073
00 02466079  2.1 GiB8416232  221
GiB 0 B  0 B
device_health_metrics 50 MiB   108   0324
00 0   1836  1.8 MiB   1944   18
MiB 0 B  0 B
os-test.rgw.buckets.data 5.6 TiB  39844453   0  239066718
00 0  552896177  3.0 TiB  999441015   60
TiB 0 B  0 B
os-test.rgw.buckets.index1.8 GiB33   0 99
00 0  153600295  154 GiB  110916573   62
GiB 0 B  0 B
os-test.rgw.buckets.non-ec   2.1 MiB45   0135
00 0 574240  349 MiB 153725  139
MiB 0 B  0 B
os-test.rgw.control  0 B 8   0 24
00 0  0  0 B  0
0 B 0 B  0 B
os-test.rgw.log  3.7 MiB   346   0   1038
00 0   83877803   80 GiB6306730  7.6
GiB 0 B  0 B
os-test.rgw.meta 220 KiB23   0 69
00 0 640854  506 MiB 108229   53
MiB 0 B  0 B

total_objects40268737
total_used   7.8 TiB
total_avail  1.1 PiB
total_space  1.1 PiB
```
ceph verison:
```
[root@node01 /]# ceph versions
{
"mon": {
"ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17)
pacific (stable)": 3
},
"mgr": {
"ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17)
pacific (stable)": 2
},
"osd": {
"ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17)
pacific (stable)": 108
},
"mds": {
"ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17)
pacific (stable)": 2
},
"rgw": {
"ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17)
pacific (stable)": 9
},
"overall": {
"ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17)
pacific (stable)": 124
}
}
```

Thanks,
Best regards
Liang Zheng
___
ceph-users mailing list -- ceph-us

[ceph-users] Re: Updating Git Submodules -- a documentation question

2022-10-11 Thread Brad Hubbard
For untracked files (eg. src/pybind/cephfs/cephfs.c) all you need is
'git clean -fdx' which you ran last in this case.

Just about everything can be solved by a combination of these commands.

git submodule update --init --recursive
git clean -fdx
git submodule foreach git clean -fdx

If you have files that show up in diff output that have unwanted
changes you can also use 'git checkout .' or 'git checkout
./path/to/filename' to revert the changes.

If you still have persistent problems with a submodule directory after
that just rm the offending directory and run 'git submodule update
--init --recursive' again.

Also, rather than doing 'git checkout main; git pull' on main I would
do 'git checkout main; git fetch origin/main; git reset --hard
origin/main' as it's easy to get into a state where pull will fail.

HTH.

On Wed, Oct 12, 2022 at 12:40 PM John Zachary Dover  wrote:
>
> The following console output, which is far too long to include in
> tutorial-style documentation that people are expected to read, shows the
> sequence of commands necessary to diagnose and repair submodules that have
> fallen out of sync with the submodules in the upstream repository.
>
> In this example, my local working copy has fallen out of sync. This will be
> obvious to adepts, but this procedure does not need to be communicated to
> them.
>
> This procedure was given to me by Brad Hubbard.
>
> Untracked files:
> >   (use "git add ..." to include in what will be committed)
> > src/pybind/cephfs/build/
> > src/pybind/cephfs/cephfs.c
> > src/pybind/cephfs/cephfs.egg-info/
> > src/pybind/rados/build/
> > src/pybind/rados/rados.c
> > src/pybind/rados/rados.egg-info/
> > src/pybind/rbd/build/
> > src/pybind/rbd/rbd.c
> > src/pybind/rbd/rbd.egg-info/
> > src/pybind/rgw/build/
> > src/pybind/rgw/rgw.c
> > src/pybind/rgw/rgw.egg-info/
> >
> > nothing added to commit but untracked files present (use "git add" to
> > track)
> > [zdover@fedora ceph]$ cd src/
> > [zdover@fedora src]$ ls
> > arch  cstart.sh   nasm-wrapper
> > auth  dmclock neorados
> > bash_completion   doc objclass
> > blk   dokan   objsync
> > blkin erasure-codeocf
> > btrfs_ioc_test.c  etc-rbdmap  os
> > c-aresfmt osd
> > cephadm   global  osdc
> > ceph-clsinfo  googletest  perfglue
> > ceph_common.shinclude perf_histogram.h
> > ceph.conf.twoosds init-ceph.inpowerdns
> > ceph-coverage.in  init-radosgwps-ceph.pl
> > ceph-crash.in isa-l   push_to_qemu.pl
> > ceph-create-keys  jaegertracing   pybind
> > ceph-debugpack.in javapython-common
> > ceph_fuse.cc  journal rapidjson
> > ceph.in   json_spirit rbd_fuse
> > ceph_mds.cc   key_value_store rbdmap
> > ceph_mgr.cc   krbd.cc rbd_replay
> > ceph_mon.cc   kv  rbd-replay-many
> > ceph_osd.cc   libcephfs.ccREADME
> > ceph-osd-prestart.sh  libcephsqlite.ccrgw
> > ceph-post-file.in libkmip rocksdb
> > ceph-rbdnamer libradoss3select
> > ceph_release  librados-config.cc  sample.ceph.conf
> > ceph-run  libradosstriper script
> > ceph_syn.cc   librbd  seastar
> > ceph_ver.cloadclass.shSimpleRADOSStriper.cc
> > ceph_ver.h.in.cmake   log SimpleRADOSStriper.h
> > ceph-volume   logrotate.conf  spawn
> > civetweb  mds spdk
> > ckill.sh  messagesstop.sh
> > clientmgr telemetry
> > cls   mon test
> > cls_acl.ccmount   TODO
> > cls_crypto.cc mount.fuse.ceph tools
> > CMakeLists.txtmrgw.sh tracing
> > cmonctl   mrunvnewosd.sh
> > commonmsg vstart.sh
> > compressormstart.sh   xxHash
> > crimson   mstop.shzstd
> > crush multi-dump.sh
> > cryptomypy.ini
> > [zdover@fedora src]$ git checkout main
> > Switched to branch 'main'
> > Your branch is up to date with 'origin/main'.
> > [zdover@fedora src]$ git pull
> > Already up to date.
> > [zdover@fedora src]$ git status
> > On branch main
> > Your branch is up to date with 'origin/main'.
> >
> > Untracked files:
> >   (use "git add ..." to include in what will be committed)
> > pybind/cephfs/build/
> > pybind/cephfs/cephfs.c
> > pybind/cephfs/cephfs.egg-info/
> > pybind/rados/build/
> > pybind/rados/rados.c
> > pybind/rados/rados.egg-info/
> > pybind/rbd/build/
> > pybind/rbd/rbd.c
> > pybind/rbd/rbd.egg-info/
> > pybind/rgw/build/
> > pybind/rgw/rgw.c
> > pybin

[ceph-users] Re: Updating Git Submodules -- a documentation question

2022-10-11 Thread John Zachary Dover
The following console output, which is far too long to include in
tutorial-style documentation that people are expected to read, shows the
sequence of commands necessary to diagnose and repair submodules that have
fallen out of sync with the submodules in the upstream repository.

In this example, my local working copy has fallen out of sync. This will be
obvious to adepts, but this procedure does not need to be communicated to
them.

This procedure was given to me by Brad Hubbard.

Untracked files:
>   (use "git add ..." to include in what will be committed)
> src/pybind/cephfs/build/
> src/pybind/cephfs/cephfs.c
> src/pybind/cephfs/cephfs.egg-info/
> src/pybind/rados/build/
> src/pybind/rados/rados.c
> src/pybind/rados/rados.egg-info/
> src/pybind/rbd/build/
> src/pybind/rbd/rbd.c
> src/pybind/rbd/rbd.egg-info/
> src/pybind/rgw/build/
> src/pybind/rgw/rgw.c
> src/pybind/rgw/rgw.egg-info/
>
> nothing added to commit but untracked files present (use "git add" to
> track)
> [zdover@fedora ceph]$ cd src/
> [zdover@fedora src]$ ls
> arch  cstart.sh   nasm-wrapper
> auth  dmclock neorados
> bash_completion   doc objclass
> blk   dokan   objsync
> blkin erasure-codeocf
> btrfs_ioc_test.c  etc-rbdmap  os
> c-aresfmt osd
> cephadm   global  osdc
> ceph-clsinfo  googletest  perfglue
> ceph_common.shinclude perf_histogram.h
> ceph.conf.twoosds init-ceph.inpowerdns
> ceph-coverage.in  init-radosgwps-ceph.pl
> ceph-crash.in isa-l   push_to_qemu.pl
> ceph-create-keys  jaegertracing   pybind
> ceph-debugpack.in javapython-common
> ceph_fuse.cc  journal rapidjson
> ceph.in   json_spirit rbd_fuse
> ceph_mds.cc   key_value_store rbdmap
> ceph_mgr.cc   krbd.cc rbd_replay
> ceph_mon.cc   kv  rbd-replay-many
> ceph_osd.cc   libcephfs.ccREADME
> ceph-osd-prestart.sh  libcephsqlite.ccrgw
> ceph-post-file.in libkmip rocksdb
> ceph-rbdnamer libradoss3select
> ceph_release  librados-config.cc  sample.ceph.conf
> ceph-run  libradosstriper script
> ceph_syn.cc   librbd  seastar
> ceph_ver.cloadclass.shSimpleRADOSStriper.cc
> ceph_ver.h.in.cmake   log SimpleRADOSStriper.h
> ceph-volume   logrotate.conf  spawn
> civetweb  mds spdk
> ckill.sh  messagesstop.sh
> clientmgr telemetry
> cls   mon test
> cls_acl.ccmount   TODO
> cls_crypto.cc mount.fuse.ceph tools
> CMakeLists.txtmrgw.sh tracing
> cmonctl   mrunvnewosd.sh
> commonmsg vstart.sh
> compressormstart.sh   xxHash
> crimson   mstop.shzstd
> crush multi-dump.sh
> cryptomypy.ini
> [zdover@fedora src]$ git checkout main
> Switched to branch 'main'
> Your branch is up to date with 'origin/main'.
> [zdover@fedora src]$ git pull
> Already up to date.
> [zdover@fedora src]$ git status
> On branch main
> Your branch is up to date with 'origin/main'.
>
> Untracked files:
>   (use "git add ..." to include in what will be committed)
> pybind/cephfs/build/
> pybind/cephfs/cephfs.c
> pybind/cephfs/cephfs.egg-info/
> pybind/rados/build/
> pybind/rados/rados.c
> pybind/rados/rados.egg-info/
> pybind/rbd/build/
> pybind/rbd/rbd.c
> pybind/rbd/rbd.egg-info/
> pybind/rgw/build/
> pybind/rgw/rgw.c
> pybind/rgw/rgw.egg-info/
>
> nothing added to commit but untracked files present (use "git add" to
> track)
> [zdover@fedora src]$
>
> [zdover@fedora ceph]$ git status
> On branch main
> Your branch is up to date with 'origin/main'.
>
> Untracked files:
>   (use "git add ..." to include in what will be committed)
> src/pybind/cephfs/build/
> src/pybind/cephfs/cephfs.c
> src/pybind/cephfs/cephfs.egg-info/
> src/pybind/rados/build/
> src/pybind/rados/rados.c
> src/pybind/rados/rados.egg-info/
> src/pybind/rbd/build/
> src/pybind/rbd/rbd.c
> src/pybind/rbd/rbd.egg-info/
> src/pybind/rgw/build/
> src/pybind/rgw/rgw.c
> src/pybind/rgw/rgw.egg-info/
>
> nothing added to commit but untracked files present (use "git add" to
> track)
> [zdover@fedora ceph]$ git submodule update --force --init --recursive
> Submodule 'ceph-erasure-code-corpus' (
> https://github.com/ceph/ceph-erasure-code-corpus.git) registered for path
> 'ceph-erasure-code-corpus'
> Submodule 'ceph-object-corpus' (
> https://github.com/ceph/ceph-object-corpus.git) registered for path
> 'ceph-object-corpus'
> S

[ceph-users] Re: encrypt OSDs after creation

2022-10-11 Thread Alexander E. Patrakov
ср, 12 окт. 2022 г. в 00:32, Ali Akil :
>
> Hallo folks,
>
> i created before couple of months a quincy ceph cluster with cephadm. I
> didn't encpryt the OSDs at that time.
> What would be the process to encrypt these OSDs afterwards?
> The documentation states only adding `encrypted: true` to the osd
> manifest, which will work only upon creation.

There is no such process. Destroy one OSD, recreate it with the same
ID but with the encryption on, wait for the cluster to heal itself,
then do the same with the next OSD, rinse, repeat. You may want to set
the norebalance flag during the operation.

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to force PG merging in one step?

2022-10-11 Thread Frank Schilder
Hi Eugen,

thanks, that was a great hint! I have a strong déjà vu feeling, we discussed 
this before with increasing pg_num, didn't we? I just set it to 1 and it did 
exactly what I wanted. Its the same number of PGs backfilling, but 
pgp_num=1024, so while the rebalancing load is the same, I got rid of any 
redundant data movements and I can actually see the progress of the merge just 
with ceph status.

Related to that, I have set mon_max_pg_per_osd=300 and do have OSDs with more 
than 400 PGs. Still, I don't see the promised health warning in ceph status. Is 
this a known issue?

Opinion part.

Returning to the above setting, I have to say that the assignment of which 
parameter influences what seems a bit unintuitive if not inconsistent. The 
parameter target_max_misplaced_ratio belongs to the balancer module, but 
merging PGs clearly is a task of the pg_autoscaler module. I'm not balancing, 
I'm scaling PG numbers. Such cross dependencies make it really hard to find 
relevant information in the section of the documentation where one would be 
looking for it. It starts being distributed all over the place.

If its not possible to have such things separated and specific tasks 
consistently explained in a single section, there could at least be a hint 
including also the case of PG merging/splitting in the description of 
target_max_misplaced_ratio so that a search for these terms brings up this 
page. There should also be a cross reference from "ceph osd pool set pg[p]_num" 
to target_max_misplaced_ratio. Well, its now here in this message for google to 
reveal.

I have to add that, while I understand the motivation behind adding these baby 
sitting modules, I would actually appreciate if one could disable them. I 
personally find them to be really annoying especially in emergency situations, 
but also in normal operations. I would consider them a nice to have and not 
enforce it on people who want to be in charge.

For example, in my current situation, I'm halving the PG count of a pool. Doing 
the merge in one go or letting the target_max_misplaced_ratio "help" me leads 
to exactly the same number of PGs backfilling at any time. Which means both 
cases, target_max_misplaced_ratio=0.05 and 1 lead to exactly the same 
interference of rebalancing IO with user IO. The difference is that with 
target_max_misplaced_ratio=0.05 this phase of reduced performance will take 
longer, because every time the module decides to change pgp_num it will 
inevitably also rebalance objects again that have been moved before. I find it 
difficult to consider this an improvement. I prefer to avoid any redundant 
writes at all cost for the benefit of disk life time. If I really need to 
reduce the impact of recovery IO I can set recovery_sleep.

My personal opinion to the user group.

Thanks for your help and have a nice evening!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 11 October 2022 14:13:45
To: ceph-users@ceph.io
Subject: [ceph-users] Re: How to force PG merging in one step?

Hi Frank,

I don't think it's the autoscaler interferring here but the default 5%
target_max_misplaced_ratio. I haven't tested the impacts of increasing
that to a much higher value, so be careful.

Regards,
Eugen


Zitat von Frank Schilder :

> Hi all,
>
> I need to reduce the number of PGs in a pool from 2048 to 512 and
> would really like to do that in a single step. I executed the set
> pg_num 512 command, but the PGs are not all merged. Instead I get
> this intermediate state:
>
> pool 13 'con-fs2-meta2' replicated size 4 min_size 2 crush_rule 3
> object_hash rjenkins pg_num 2048 pgp_num 1946 pg_num_target 512
> pgp_num_target 512 autoscale_mode off last_change 916710 lfor
> 0/0/618995 flags hashpspool,nodelete,selfmanaged_snaps max_bytes
> 107374182400 stripe_width 0 compression_mode none application cephfs
>
> This is really annoying, because it will not only lead to repeated
> redundant data movements and but I also need to rebalance this pool
> onto fewer OSDs, which cannot hold the 1946 PGs it will be merged to
> intermittently. How can I override the autoscaler interfering with
> admin operations in such tight corners?
>
> As you can see, we disabled autoscaler on all pools and also
> globally. Still, it interferes with admin commands in an unsolicited
> way. I would like the PG merge happen on the fly as the data moves
> to the new OSDs.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
T

[ceph-users] Re: crush hierarchy backwards and upmaps ...

2022-10-11 Thread Dan van der Ster
Hi Chris,

Just curious, does this rule make sense and help with the multi level crush
map issue?
(Maybe it also results in zero movement, or at least less then the
alternative you proposed?)

step choose indep 4 type rack
step chooseleaf indep 2 type chassis

Cheers, Dan




On Tue, Oct 11, 2022, 19:29 Christopher Durham  wrote:

> Dan,
>
> Thank you.
>
> I did what you said regarding --test-map-pgs-dump and it wants to move 3
> OSDs in every PG. Yuk.
>
> So before i do that, I tried this rule, after changing all my 'pod' bucket
> definitions to 'chassis', and compiling and
> injecting the new crushmap to an osdmap:
>
>
> rule mypoolname {
> id -5
> type erasure
> step take myroot
> step choose indep 4 type rack
> step choose indep 2 type chassis
> step chooseleaf indep 1 type host
> step emit
>
> }
>
> --test-pg-upmap-entries shows there were NO changes to be done after
> comparing it with the original!!!
>
> However, --upmap-cleanup says:
>
> verify_upmap number of buckets 8 exceeds desired number of 2
> check_pg_upmaps verify_upmap of poolid.pgid returning -22
>
> This is output for every current upmap, but I really do want 8 total
> buckets per PG, as my pool is a 6+2.
>
> The upmap-cleanup output wants me to remove all of my upmaps.
>
> This seems consistent with a bug report that says that there is a problem
> with the balancer on a
> multi-level rule such as the above, albeit on 14.2.x. Any thoughts?
>
> https://tracker.ceph.com/issues/51729
>
> I am leaning towards just eliminating the middle rule and go directly from
> rack to host, even though
> it wants to move a LARGE amount of data according to  a diff before and
> after of --test-pg-upmap-entries.
> In this scenario, I dont see any unexpected errors with --upmap-cleanup
> and I do not want to get stuck
>
> rule mypoolname {
> id -5
> type erasure
> step take myroot
> step choose indep 4 type rack
> step chooseleaf indep 2 type host
> step emit
> }
>
> -Chris
>
>
> -Original Message-
> From: Dan van der Ster 
> To: Christopher Durham 
> Cc: Ceph Users 
> Sent: Mon, Oct 10, 2022 12:22 pm
> Subject: [ceph-users] Re: crush hierarchy backwards and upmaps ...
>
> Hi,
>
> Here's a similar bug: https://tracker.ceph.com/issues/47361
>
> Back then, upmap would generate mappings that invalidate the crush rule. I
> don't know if that is still the case, but indeed you'll want to correct
> your rule.
>
> Something else you can do before applying the new crush map is use
> osdmaptool to compare the PGs placement before and after, something like:
>
> osdmaptool --test-map-pgs-dump osdmap.before > before.txt
>
> osdmaptool --test-map-pgs-dump osdmap.after > after.txt
>
> diff -u before.txt after.txt
>
> The above will help you estimate how much data will move after injecting
> the fixed crush map. So depending on the impact you can schedule the change
> appropriately.
>
> I also recommend to keep a backup of the previous crushmap so that you can
> quickly restore it if anything goes wrong.
>
> Cheers, Dan
>
>
>
>
>
> On Mon, Oct 10, 2022, 19:31 Christopher Durham  wrote:
>
> > Hello,
> > I am using pacific 16.2.10 on Rocky 8.6 Linux.
> >
> > After setting upmap_max_deviation to 1 on the ceph balancer in ceph-mgr,
> I
> > achieved a near perfect balance of PGs and space on my OSDs. This is
> great.
> >
> > However, I started getting the following errors on my ceph-mon logs,
> every
> > three minutes, for each of the OSDs that had been mappedby the balancer:
> >2022-10-07T17:10:39.619+ 7f7c2786d700 1 verify_upmap unable to get
> > parent of osd.497, skipping for now
> >
> > After banging my head against the wall for a bit trying to figure this
> > out, I think I have discovered the issue:
> >
> > Currently, I have my pool EC Pool configured with the following crush
> rule:
> >
> > rule mypoolname {
> >id -5
> >type erasure
> >step take myroot
> >step choose indep 4 type rack
> >step choose indep 2 type pod
> >step chooseleaf indep 1 type host
> >step emit
> > }
> >
> > Basically, pick 4 racks, then 2 pods in each rack, and then one host in
> > each pod, For a total of
> > 8 chunks. (The pool is a is a 6+2). The 4 racks are chosen from the
> myroot
> > root entry, which is as follows.
> >
> >
> > root myroot {
> >id -400
> >item rack1 weight N
> >item rack2 weight N
> >item rack3 weight N
> >item rack4 weight N
> > }
> >
> > This has worked fine since inception, over a year ago. And the PGs are
> all
> > as I expect with OSDs from the 4 racks and not on the same host or pod.
> >
> > The errors above, verify_upmap, started after I had the upmap_
> > max_deviation set to 1 in the balancer and having it
> > move things around, creating pg_upmap entries.
> >
> > I then discovered, while trying to figure this out, that the device types
> > are:
> >
> > type 0 osd
> > type 1 host
> > type 2 chassis
> > type 3 rack
> > ...
> > type 6 pod
> >
> > 

[ceph-users] Re: crush hierarchy backwards and upmaps ...

2022-10-11 Thread Christopher Durham
Dan,
Thank you.
I did what you said regarding --test-map-pgs-dump and it wants to move 3 OSDs 
in every PG. Yuk.
So before i do that, I tried this rule, after changing all my 'pod' bucket 
definitions to 'chassis', and compiling andinjecting the new crushmap to an 
osdmap:


rule mypoolname {
    id -5
    type erasure
    step take myroot
    step choose indep 4 type rack
    step choose indep 2 type chassis
    step chooseleaf indep 1 type host
    step emit 
 }
--test-pg-upmap-entries shows there were NO changes to be done after comparing 
it with the original!!!
However, --upmap-cleanup says:
verify_upmap number of buckets 8 exceeds desired number of 2check_pg_upmaps 
verify_upmap of poolid.pgid returning -22
This is output for every current upmap, but I really do want 8 total buckets 
per PG, as my pool is a 6+2. 

The upmap-cleanup output wants me to remove all of my upmaps.

This seems consistent with a bug report that says that there is a problem with 
the balancer on a 
multi-level rule such as the above, albeit on 14.2.x. Any thoughts?

https://tracker.ceph.com/issues/51729

I am leaning towards just eliminating the middle rule and go directly from rack 
to host, even thoughit wants to move a LARGE amount of data according to  a 
diff before and after of --test-pg-upmap-entries.In this scenario, I dont see 
any unexpected errors with --upmap-cleanup and I do not want to get stuck

rule mypoolname {
    id -5
    type erasure
    step take myroot
    step choose indep 4 type rack
    step chooseleaf indep 2 type host
    step emit }
-Chris

 
-Original Message-
From: Dan van der Ster 
To: Christopher Durham 
Cc: Ceph Users 
Sent: Mon, Oct 10, 2022 12:22 pm
Subject: [ceph-users] Re: crush hierarchy backwards and upmaps ...

Hi,

Here's a similar bug: https://tracker.ceph.com/issues/47361

Back then, upmap would generate mappings that invalidate the crush rule. I
don't know if that is still the case, but indeed you'll want to correct
your rule.

Something else you can do before applying the new crush map is use
osdmaptool to compare the PGs placement before and after, something like:

osdmaptool --test-map-pgs-dump osdmap.before > before.txt

osdmaptool --test-map-pgs-dump osdmap.after > after.txt

diff -u before.txt after.txt

The above will help you estimate how much data will move after injecting
the fixed crush map. So depending on the impact you can schedule the change
appropriately.

I also recommend to keep a backup of the previous crushmap so that you can
quickly restore it if anything goes wrong.

Cheers, Dan





On Mon, Oct 10, 2022, 19:31 Christopher Durham  wrote:

> Hello,
> I am using pacific 16.2.10 on Rocky 8.6 Linux.
>
> After setting upmap_max_deviation to 1 on the ceph balancer in ceph-mgr, I
> achieved a near perfect balance of PGs and space on my OSDs. This is great.
>
> However, I started getting the following errors on my ceph-mon logs, every
> three minutes, for each of the OSDs that had been mappedby the balancer:
>    2022-10-07T17:10:39.619+ 7f7c2786d700 1 verify_upmap unable to get
> parent of osd.497, skipping for now
>
> After banging my head against the wall for a bit trying to figure this
> out, I think I have discovered the issue:
>
> Currently, I have my pool EC Pool configured with the following crush rule:
>
> rule mypoolname {
>    id -5
>    type erasure
>    step take myroot
>    step choose indep 4 type rack
>    step choose indep 2 type pod
>    step chooseleaf indep 1 type host
>    step emit
> }
>
> Basically, pick 4 racks, then 2 pods in each rack, and then one host in
> each pod, For a total of
> 8 chunks. (The pool is a is a 6+2). The 4 racks are chosen from the myroot
> root entry, which is as follows.
>
>
> root myroot {
>    id -400
>    item rack1 weight N
>    item rack2 weight N
>    item rack3 weight N
>    item rack4 weight N
> }
>
> This has worked fine since inception, over a year ago. And the PGs are all
> as I expect with OSDs from the 4 racks and not on the same host or pod.
>
> The errors above, verify_upmap, started after I had the upmap_
> max_deviation set to 1 in the balancer and having it
> move things around, creating pg_upmap entries.
>
> I then discovered, while trying to figure this out, that the device types
> are:
>
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> ...
> type 6 pod
>
> So pod is HIGHER on the hierarchy than rack. I have it as lower on my
> rule.
>
> What I want to do is remove the pods completely to work around this.
> Something like:
>
> rule mypoolname {
>        id -5
>        type erasure
>        step take myroot
>        step choose indep 4 type rack
>        step chooseleaf indep 2 type host
>        step emit
> }
>
> This will pick 4 racks and then 2 hosts in each rack. Will this cause any
> problems? I can add the pod stuff back later as 'chassis' instead. I can
> live without the 'pod' separation if needed.
>
> To test this, I tried doing something like this:
>
> 1. grab the osdma

[ceph-users] encrypt OSDs after creation

2022-10-11 Thread Ali Akil

Hallo folks,

i created before couple of months a quincy ceph cluster with cephadm. I
didn't encpryt the OSDs at that time.
What would be the process to encrypt these OSDs afterwards?
The documentation states only adding `encrypted: true` to the osd
manifest, which will work only upon creation.

Regards,
Ali

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Inherited CEPH nightmare

2022-10-11 Thread Tino Todino
Hi Janne,

I've changed some elements of the config now and the results are much better 
but still quite poor relative to what I would consider normal SSD performance.  

The osd_memory_target is now set to 12GB for 3 of the 4 hosts (each of these 
hosts has 1.5TB RAM so I can allocate loads if necessary).  The other host is a 
more modern small form factor server with only 64GB on board, so each of the 3 
OSDs in that device has 4GB per OSD.  The number of PGs has been increased from 
128 to 256.  Not yet run JJ Balancer.

In terms of performance, I measured the time it takes for ProxMox to clone a 
127GB VM. It now clones in around 18 minutes, rather than 1 hour 55 mins before 
the config changes, so there is progress here.

I also had a play around with enabling and disabling write cache.  I performed 
a rudimentary ceph tell osd.x bench command to see what the performance would 
be with it on/off.  The results were surprising as the disks provided far more 
IOPS with the cache ENABLED, rather than disabled.

To round out your question, we are on Bluestore, with CEPH v 16.2.7

I still think the next steps are to change the remaining 6 consumer grade 
devices with the Seagate IronWolf 125 1TB SSD's which seem to perform much 
better according to ceph benchmarks, and after that increase the number of 
hosts to 6, and spread the 12 OSD's so that each host has 2 OSD's only.

Any other suggestions are welcome.

Many thanks.

Current ceph.conf:

[global]
 auth_client_required = cephx
 auth_cluster_required = cephx
 auth_service_required = cephx
 cluster_network = 192.168.8.4/24
 fsid = 4a4b4fff-d140-4e11-a35b-cbac0e18a3ce
 mon_allow_pool_delete = true
 mon_host = 192.168.8.5 192.168.8.3 192.168.8.6
 ms_bind_ipv4 = true
 ms_bind_ipv6 = false
 osd_memory_target = 12884901888
 osd_pool_default_min_size = 2
 osd_pool_default_size = 3
 public_network = 192.168.8.4/24

[client]
 keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
 keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.cl1-h1-lv]
 host = cl1-h1-lv
 mds_standby_for_name = pve

[mds.cl1-h2-lv]
 host = cl1-h2-lv
 mds_standby_for_name = pve

[mds.cl1-h3-lv]
 host = cl1-h3-lv
 mds_standby_for_name = pve

[mon.cl1-h1-lv]
 public_addr = 192.168.8.3

[mon.cl1-h3-lv]
 public_addr = 192.168.8.5

[mon.cl1-h4-lv]
 public_addr = 192.168.8.6


And crush map:


# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host cl1-h2-lv {
id -3   # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 2.729
alg straw2
hash 0  # rjenkins1
item osd.0 weight 0.910
item osd.5 weight 0.910
item osd.10 weight 0.910
}
host cl1-h3-lv {
id -5   # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 2.729
alg straw2
hash 0  # rjenkins1
item osd.1 weight 0.910
item osd.6 weight 0.910
item osd.11 weight 0.910
}
host cl1-h4-lv {
id -7   # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 2.729
alg straw2
hash 0  # rjenkins1
item osd.7 weight 0.910
item osd.2 weight 0.910
item osd.3 weight 0.910
}
host cl1-h1-lv {
id -9   # do not change unnecessarily
id -10 class ssd# do not change unnecessarily
# weight 2.729
alg straw2
hash 0  # rjenkins1
item osd.4 weight 0.910
item osd.9 weight 0.910
item osd.12 weight 0.910
}
root default {
id -1   # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 10.916
alg straw2
hash 0  # rjenkins1
item cl1-h2-lv weight 2.729
item cl1-h3-lv weight 2.729
item cl1-h4-lv weight 2.729
item cl1-h1-lv weight 2.729
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step

[ceph-users] Autoscaler stopped working after upgrade Octopus -> Pacific

2022-10-11 Thread Andreas Haupt
Dear all,

just upgraded our cluster from Octopus to Pacific (16.2.10). This
introduced an error in autoscaler:

2022-10-11T14:47:40.421+0200 7f3ec2d03700  0 [pg_autoscaler ERROR root] pool 17 
has overlapping roots: {-4, -1}
2022-10-11T14:47:40.423+0200 7f3ec2d03700  0 [pg_autoscaler ERROR root] pool 22 
has overlapping roots: {-4, -1}
2022-10-11T14:47:40.423+0200 7f3ec2d03700  0 [pg_autoscaler ERROR root] pool 23 
has overlapping roots: {-4, -1}
2022-10-11T14:47:40.427+0200 7f3ec2d03700  0 [pg_autoscaler ERROR root] pool 27 
has overlapping roots: {-6, -4, -1}
2022-10-11T14:47:40.428+0200 7f3ec2d03700  0 [pg_autoscaler ERROR root] pool 28 
has overlapping roots: {-6, -4, -1}

Autoscaler status is empty:

[cephmon1] /root # ceph osd pool autoscale-status
[cephmon1] /root # 


On https://forum.proxmox.com/threads/ceph-overlapping-roots.104199/ I
found something similar:

---
I assume that you have at least one pool that still has the
"replicated_rule" assigned, which does not make a distinction between the
device class of the OSDs.

This is why you see this error. The autoscaler cannot decide how many PGs
the pools need. Make sure that all pools are assigned a rule that limit
them to a device class and the errors should stop.
---

Indeed, we have a mixed cluster (hdd + ssd) with some pools spanning hdd-
only, some ssd-only and some both (ec & replicated) which don't care about
the storage device class (e.g. via default "replicated_rule"):

[cephmon1] /root # ceph osd crush rule ls
replicated_rule
ssd_only_replicated_rule
hdd_only_replicated_rule
default.rgw.buckets.data.ec42
test.ec42
[cephmon1] /root #


That worked flawlessly until Octopus. Any idea how to make autoscaler work
again with that kind of setup? Can I really have pools on one device class
only in Pacific in order to get a functional autoscaler?

Thanks,
Andreas
-- 
| Andreas Haupt| E-Mail: andreas.ha...@desy.de
|  DESY Zeuthen| WWW:http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6 | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen | Fax:+49/33762/7-7216

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to force PG merging in one step?

2022-10-11 Thread Eugen Block

Hi Frank,

I don't think it's the autoscaler interferring here but the default 5%  
target_max_misplaced_ratio. I haven't tested the impacts of increasing  
that to a much higher value, so be careful.


Regards,
Eugen


Zitat von Frank Schilder :


Hi all,

I need to reduce the number of PGs in a pool from 2048 to 512 and  
would really like to do that in a single step. I executed the set  
pg_num 512 command, but the PGs are not all merged. Instead I get  
this intermediate state:


pool 13 'con-fs2-meta2' replicated size 4 min_size 2 crush_rule 3  
object_hash rjenkins pg_num 2048 pgp_num 1946 pg_num_target 512  
pgp_num_target 512 autoscale_mode off last_change 916710 lfor  
0/0/618995 flags hashpspool,nodelete,selfmanaged_snaps max_bytes  
107374182400 stripe_width 0 compression_mode none application cephfs


This is really annoying, because it will not only lead to repeated  
redundant data movements and but I also need to rebalance this pool  
onto fewer OSDs, which cannot hold the 1946 PGs it will be merged to  
intermittently. How can I override the autoscaler interfering with  
admin operations in such tight corners?


As you can see, we disabled autoscaler on all pools and also  
globally. Still, it interferes with admin commands in an unsolicited  
way. I would like the PG merge happen on the fly as the data moves  
to the new OSDs.


Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Invalid crush class

2022-10-11 Thread Eugen Block
The only way I could reproduce this was by removing the existing class  
from an OSD and setting it:


---snip---
pacific:~ # ceph osd crush rm-device-class 0
done removing class of osd(s): 0

pacific:~ # ceph osd crush set-device-class jbod.hdd 0
set osd(s) 0 to class 'jbod.hdd'

pacific:~ # ceph osd tree
ID  CLASS WEIGHT   TYPE NAME STATUS  REWEIGHT  PRI-AFF
-10.03596  root default
-30.03596  host pacific
 1   hdd  0.01199  osd.1 up   1.0  1.0
 2   hdd  0.01199  osd.2 up   1.0  1.0
 0  jbod.hdd  0.01198  osd.0 up   1.0  1.0

pacific:~ # ceph osd crush class ls
[
"hdd",
"jbod.hdd"
]
---snip---

But if I remove it from the OSD the device class is gone as well:

---snip---
pacific:~ # ceph osd crush rm-device-class 0
done removing class of osd(s): 0

pacific:~ # ceph osd crush set-device-class hdd 0
set osd(s) 0 to class 'hdd'

pacific:~ # ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME STATUS  REWEIGHT  PRI-AFF
-1 0.03596  root default
-3 0.03596  host pacific
 0hdd  0.01198  osd.0 up   1.0  1.0
 1hdd  0.01199  osd.1 up   1.0  1.0
 2hdd  0.01199  osd.2 up   1.0  1.0

pacific:~ # ceph osd crush class ls
[
"hdd"
]
---snip---

So I would have expected that there is one or more OSDs with that  
device class but you already checked that. Do you still find it in the  
crushmap? To retrieve it you run:


ceph osd getcrushmap -o crushmap.bin
crushtool -d crushmap.bin -o crushmap.txt

Regards,
Eugen

Zitat von Michael Thomas :

In 15.2.7, how can I remove an invalid crush class?  I'm surprised  
that I was able to create it in the first place:


[root@ceph1 bin]# ceph osd crush class ls
[
"ssd",
"JBOD.hdd",
"nvme",
"hdd"
]


[root@ceph1 bin]# ceph osd crush class ls-osd JBOD.hdd
Invalid command: invalid chars . in JBOD.hdd
osd crush class ls-osd  :  list all osds belonging to the  
specific 


Error EINVAL: invalid command

There are no devices mapped to this class:

[root@ceph1 bin]# ceph osd crush tree | grep JBOD | wc -l


--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io