[ceph-users] TOO_MANY_PGS after upgrade from Nautilus to Octupus
Hi, We are currently upgrading our cluster from Nautilus to Octupus. After upgrade of the mons and mgrs, we get warnings about the number of PGS. Which parameter did change during upgrade to explain those new warnings. Nothing else was changed. Is it risky to change the pgs/pool as proposed in the warnings ? In particular, to reduce from 4096 to 64 !!! Thanks in advance, Patrick root@server4 ~]# ceph -s cluster: id: ba00c030-382f-4d75-b150-5b17f77e57fe health: HEALTH_WARN clients are using insecure global_id reclaim 6 pools have too few placement groups 9 pools have too many placement groups services: mon: 3 daemons, quorum server2,server5,server6 (age 66m) mgr: server8(active, since 67m), standbys: server4, server1 osd: 244 osds: 244 up (since 12m), 244 in (since 2w) rgw: 2 daemons active (server1, server4) task status: data: pools: 16 pools, 11441 pgs objects: 2.02M objects, 5.9 TiB usage: 18 TiB used, 982 TiB / 1000 TiB avail pgs: 11441 active+clean io: client: 862 KiB/s rd, 1.4 MiB/s wr, 61 op/s rd, 100 op/s wr root@server4 ~]# ceph health detail ... [WRN] POOL_TOO_MANY_PGS: 9 pools have too many placement groups Pool default.rgw.buckets.index has 128 placement groups, should have 32 Pool default.rgw.buckets.data has 4096 placement groups, should have 64 Pool os_glance has 1024 placement groups, should have 32 ... [root@server4 ~]# ceph config get mon mon_max_pg_per_osd 250 In ceph.conf, we set also: osd_max_pg_per_osd_hard_ratio = 3 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] HELP NEEDED : cephadm adopt osd crash
Hi, We've already convert two PRODUCTION storage nodes on Octopus to cephadm without problem. On the third one, we succeeded to convert only one OSD. [root@server4 osd]# cephadm adopt --style legacy --name osd.0 Found online OSD at //var/lib/ceph/osd/ceph-0/fsid objectstore_type is bluestore Stopping old systemd unit ceph-osd@0... Disabling old systemd unit ceph-osd@0... Moving data... Chowning content... Chowning /var/lib/ceph/fsid replaced/osd.0/block... Renaming /etc/ceph/osd/0-2d973f03-82f3-499f-b5dc-d4c28dbe1b3d.json -> /etc/ceph/osd/0-2d973f03-82f3-499f-b5dc-d4c28dbe1b3d.json.adopted-by-cephadm Disabling host unit ceph-volume@ simple unit... Moving logs... Creating new units... For the others, we have this error: [root@server4 osd]# cephadm adopt --style legacy --name osd.17 Found online OSD at //var/lib/ceph/osd/ceph-17/fsid objectstore_type is bluestore Stopping old systemd unit ceph-osd@17... Disabling old systemd unit ceph-osd@17... Moving data... Traceback (most recent call last): File "/sbin/cephadm", line 6251, in r = args.func() File "/sbin/cephadm", line 1458, in _default_image return func() File "/sbin/cephadm", line 4027, in command_adopt command_adopt_ceph(daemon_type, daemon_id, fsid); File "/sbin/cephadm", line 4170, in command_adopt_ceph os.rmdir(data_dir_src) OSError: [Errno 16] Device or resource busy: '//var/lib/ceph/osd/ceph-17' The directory /var/lib/ceph/osd/ceph-17 is now empty. The directory /var/lib/ceph//osd.17 contains: [root@server4 osd.17]# ls -l total 72 -rw-r--r-- 1 ceph ceph 411 Jan 29 2018 activate.monmap -rw-r--r-- 1 ceph ceph 3 Jan 29 2018 active lrwxrwxrwx 1 root root 10 Nov 8 15:54 block -> /dev/sdad2 -rw-r--r-- 1 ceph ceph 37 Jan 29 2018 block_uuid -rw-r--r-- 1 ceph ceph 2 Jan 29 2018 bluefs -rw-r--r-- 1 ceph ceph 37 Jan 29 2018 ceph_fsid -rw-r--r-- 1 ceph ceph 1226 Nov 8 15:53 config -rw-r--r-- 1 ceph ceph 37 Jan 29 2018 fsid -rw--- 1 ceph ceph 57 Jan 29 2018 keyring -rw-r--r-- 1 ceph ceph 8 Jan 29 2018 kv_backend -rw-r--r-- 1 ceph ceph 21 Jan 29 2018 magic -rw-r--r-- 1 ceph ceph 4 Jan 29 2018 mkfs_done -rw-r--r-- 1 ceph ceph 6 Jan 29 2018 ready -rw--- 1 ceph ceph 3 Nov 8 14:47 require_osd_release -rw-r--r-- 1 ceph ceph 0 Jan 13 2020 systemd -rw-r--r-- 1 ceph ceph 10 Jan 29 2018 type -rw--- 1 root root 22 Nov 8 15:53 unit.image -rw--- 1 root root 1042 Nov 8 16:30 unit.poststop -rw--- 1 root root 1851 Nov 8 16:30 unit.run -rw-r--r-- 1 ceph ceph 3 Jan 29 2018 whoami When trying to start or redeploy osd.17, podman inspect complains about non-existent image: 2022-11-08 16:58:58,503 7f930fab3740 DEBUG Running command: /bin/podman inspect --format {{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index .Config.Labels "io.ceph.version"}} ceph--osd.17 2022-11-08 16:58:58,591 7f930fab3740 DEBUG /bin/podman: stderr Error: error getting image "ceph--osd.17": unable to find a name and tag match for ceph--osd.17 in repotags: no such image Is there a way to save osd.17 and create manually the podman image ? Thanks in advance, Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] setup problem for ingress + SSL for RGW
Hi, Our cluster runs Pacific on Rocky8. We have 3 rgw running on port 7480. I tried to setup an ingress service with a yaml definition of service: no luck service_type: ingress service_id: rgw.myceph.be placement: hosts: - ceph001 - ceph002 - ceph003 spec: backend_service: rgw.myceph.be virtual_ip: 192.168.0.10 frontend_port: 443 monitor_port: 9000 ssl_cert: | -BEGIN PRIVATE KEY- ... -END PRIVATE KEY- -BEGIN CERTIFICATE- ... -END CERTIFICATE- I tried to setup the ingress service with the dashboard... still no luck. I started debugging the problem. 1. Even if I entered the certificate and the private key in the form, CEPH complained about no haproxy.pem.key file. I added manually the file in the container folder definition. Haproxy containers started ! 2. Looking at the monitoring page of HAProxy, I realized that there was no backend server defined. In the form, I selected manually the servers running the rgw. In the container definition folder, the backend definition of haproxy.cfg looks like: ... backend backend option forwardfor balance static-rr option httpchk HEAD / HTTP/1.0 No mention of servers or port 7480 Once again, I added the definition manually : server ceph001 192.168.0.1:7480 check server ceph004 192.168.0.2:7480 check server ceph008 192.168.0.2:7480 check and redeployed the containers. It's working. Any idea ? Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] How to submit a bug report ?
Hi, I suspect a bug in cephadm to configure ingress service for rgw. Our production server was upgraded from continuously from Luminous to Pacific. When configuring ingress service for rgw, the haproxy.cfg is incomplete. The same yaml file applied on our test cluster does the job. Regards, Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Unable to deploy new manager in octopus
Hi, On my test cluster, I migrated from Nautilus to Octopus and the converted most of the daemons to cephadm. I got a lot of problem with podman 1.6.4 on CentOS 7 through an https proxy because my servers are on a private network. Now, I'm unable to deploy new managers and the cluster is in a bizarre situation: [root@cepht003 f5a025f9-fbe8-4506-8769-453902eb28d6]# ceph -s cluster: id: f5a025f9-fbe8-4506-8769-453902eb28d6 health: HEALTH_WARN client is using insecure global_id reclaim mons are allowing insecure global_id reclaim failed to probe daemons or devices 42 stray daemon(s) not managed by cephadm 2 stray host(s) with 39 daemon(s) not managed by cephadm 1 daemons have recently crashed services: mon: 5 daemons, quorum cepht003,cepht002,cepht001,cepht004,cephtstor01 (age 19m) mgr: cepht004.wyibzh(active, since 29m), standbys: cepht003.aa mds: fsdup:1 fsec:1 {fsdup:0=fsdup.cepht001.opiyzk=up:active,fsec:0=fsec.cepht003.giatub=up:active} 7 up:standby osd: 40 osds: 40 up (since 92m), 40 in (since 3d) rgw: 2 daemons active (cepht001, cepht004) task status: data: pools: 18 pools, 577 pgs objects: 6.32k objects, 2[root@cepht003 f5a025f9-fbe8-4506-8769-453902eb28d6]# ceph orch ps NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID mds.fdec.cepht004.vbuphb cepht004 running (62m) 47s ago 4h 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 5fad10ffc981 mds.fdec.cephtstor01.gtxsnr cephtstor01 running (24m) 46s ago 24m 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 24e837f6ac8a mds.fdup.cepht001.nydfzs cepht001 running (2h) 47s ago 2h 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 b1880e343ece mds.fdup.cepht003.thsnbk cepht003 running (34m) 45s ago 34m 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 ddd4e395e7b3 mds.fsdup.cepht001.opiyzk cepht001 running (4h) 47s ago 4h 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 ad081f718863 mds.fsdup.cepht004.cfnxxw cepht004 running (62m) 47s ago 20h 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 c6feed82af8f mds.fsec.cepht002.uebrlc cepht002 running (20m) 47s ago 20m 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 836f448c5708 mds.fsec.cepht003.giatub cepht003 running (76m) 45s ago 5h 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 f235957145cb mgr.cepht003.aa cepht003 stopped 45s ago 20h 15.2.6 quay.io/ceph/ceph:v15.2.6 f16a759354cc 770d7cf078ad mgr.cepht004.wyibzh cepht004 unknown 47s ago 20h 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 6baa0f625271 mon.cepht001 cepht001 running (4h) 47s ago 4h 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 e7f24769153c mon.cepht002 cepht002 running (20m) 47s ago 20m 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 dbb5be113201 mon.cepht003 cepht003 running (76m) 45s ago 5h 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 6c2d6707b3fe mon.cepht004 cepht004 running (62m) 47s ago 4h 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 7986b598fd17 mon.cephtstor01 cephtstor01 running (93m) 46s ago 2h 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 dbd9255aab10 osd.10 cephtstor01 running (93m) 46s ago 2h 15.2.16 quay.io/ceph/ceph:v15 8d5775c85c6a 01b07c4a75f7 4 GiB usage: 80 GiB used, 102 TiB / 102 TiB avail pgs: 577 active+clean When I try to create a new mgr, I get : [ceph: root@cepht002 /]# ceph orch daemon add mgr cepht002 Error EINVAL: cephadm exited with an error code: 1, stderr:Deploy daemon mgr.cepht002.kqhnbt ... Verifying port 8443 ... ERROR: TCP Port(s) '8443' required for mgr already in use But nothing runs on that port: [root@cepht002 f5a025f9-fbe8-4506-8769-453902eb28d6]# ss -lntu Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port udp UNCONN 0 0 127.0.0.1:323 *:* tcp LISTEN 0 128 192.168.64.152:6789 *:* tcp LISTEN 0 128 192.168.64.152:6800 *:* tcp LISTEN 0 128 192.168.64.152:6801 *:* tcp LISTEN 0 128 *:22 *:* tcp LISTEN 0 100 127.0.0.1:25 *:* tcp LISTEN 0 128 127.0.0.1:6010 *:* tcp LISTEN 0 128 *:10050 *:* tcp LISTEN 0 128 192.168.64.152:3300 *:* I get the same error with the command "ceph orch apply mgr ...". The same for each node in the cluster. I find no answer on Google... Any idea ? Patrick __
[ceph-users] ceph orch: list of scheduled tasks
Hi, When you change the configuration of your cluster whith 'ceph orch apply ..." or "ceph orch daemon ...", tasks are scheduled: [root@cephc003 ~]# ceph orch apply mgr --placement="cephc001 cephc002 cephc003" Scheduled mgr update... Is there a way to list all the pending tasks ? Regards, Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] all PG remapped after osd server reinstallation (Pacific)
Hi, I use a Ceph test infrastructure with only two storage servers running the OSDs. Objects are replicated between these servers: [ceph: root@cepht001 /]# ceph osd dump | grep 'replicated size' pool 1 '.rgw.root' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 237 flags hashpspool stripe_width 0 application rgw pool 2 'default.rgw.control' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 239 flags hashpspool stripe_width 0 application rgw pool 3 'default.rgw.meta' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 243 flags hashpspool stripe_width 0 application rgw pool 4 'default.rgw.log' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 244 flags hashpspool stripe_width 0 application rgw pool 6 'rbd_dup' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 975 lfor 0/975/973 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 7 'cephfs_metadata' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 1121 lfor 0/1121/1119 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 8 'cephfs_data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1005 lfor 0/1005/1003 flags hashpspool stripe_width 0 application cephfs pool 9 'device_health_metrics' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 11476 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth [ceph: root@cepht001 /]# ceph osd crush rule dump [ { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } ] Ceph version is 16.2.9 (Pacific) I reinstalled one storage server (same Ceph version). I followed the following path - set noout flag - stop all osd on this server - back up all osd definition in /var/lib/ceph//osd.X - back up all symplink related to OSD in /etc/systemd/system/ceph-.taget.wants - reinstall OS - reinstall cephadm, keyring, ... - displace a monitor to this server in order to recreate the /var/lib/ceph/ and /etc/systemd/system/ceph-.target.wants tree - restore osd definition and osd symlink services - systemctl daemon-reload - systemctl restart ceph-.target (as an alternative, I restored the OSD definitions in /var/lib/ceph/ and did a redeploy of each OSD to recreate the symlinks in systemd) All daemons are seen running by the orchestrator, health of the cluster is OK but all the pgs are remapped and half of the objects are missplaced... As if the restored OSDs are seen as a new and different group of OSDs. [ceph: root@cepht001 /]# ceph -s cluster: id: 1f0f76fa-7d62-43b9-b9d2-ee87da10fc32 health: HEALTH_OK services: mon: 3 daemons, quorum cepht001,cephtstor01,cephtstor02 (age 116m) mgr: cepht002.bxlxvc(active, since 18m), standbys: cepht003.ldxygn, cepht001.ljtuai mds: 1/1 daemons up, 2 standby osd: 41 osds: 41 up (since 2h), 41 in (since 2h); 209 remapped pgs rgw: 3 daemons active (3 hosts, 1 zones) data: volumes: 1/1 healthy pools: 8 pools, 209 pgs objects: 3.36k objects, 12 GiB usage: 29 GiB used, 104 TiB / 104 TiB avail pgs: 3361/6722 objects misplaced (50.000%) 209 active+clean+remapped How can I recover from this situation ? Is there a better way to achieve OS reinstallation than the steps I followed ? Thanks for your help, Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io