[ceph-users] Re: MDS stuck in replay
On Tue, May 31, 2022 at 3:42 AM Magnus HAGDORN wrote: > > Hi all, > it seems to be the time of stuck MDSs. We also have our ceph filesystem > degraded. The MDS is stuck in replay for about 20 hours now. > > We run a nautilus ceph cluster with about 300TB of data and many > millions of files. We run two MDSs with a particularly large directory > pinned to one of them. Both MDSs have standby MDSs. > > We are in the process of migrating to a new pacific cluster and have > been syncing files daily. Over the weekend something happened and we > ended up with slow MDS responses and some directories became very slow > (as we'd expect). We restarted the second MDS. It came back within a > minute and the problem disappeared for a little while. The slow MDS > operations came back and we restarted the other MDS. This one has been > in replay state since yesterday. > Can you temporarily turn up the MDS debug log level (debug_mds) to check what's happening to this MDS during replay? ceph config set mds debug_mds 10 Is the health of the MDS host okay? Is it low on memory? > The cluster is healthy. > Can you share the output of the `ceph status` , `ceph fs status` and `ceph --version`? > So, we are wondering what it is up to. How long it might take. And is > there something we can do to speed up the replay phase. > > Regards > magnus > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. Is e buidheann carthannais a th’ ann an > Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io Regards, Ramana ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph Repo Branch Rename - May 24
On Wed, 1 Jun 2022 at 23:52, David Galloway wrote: > > The master branch has been deleted from all recently active repos except > ceph.git. I'm slowly retargeting existing PRs from master to main. > > The tool I used to rename the branches didn't take care of that for me > unfortunately so it has to be done manually. > > As far as I know, this should conclude the branch renaming. Please let > me know if you continue to see any issues. > Perhaps, the master branch at ceph-ci repo could've been left for a few weeks since re-running failed jobs from last week's run is not possible anymore. /teuthology/teuthology/suite/util.py", line 76, in schedule_fail raise ScheduleFailError(message, name) teuthology.exceptions.ScheduleFailError: Scheduling rishabh-2022-06-01_18:47:29-fs-wip-vshankar-testing-20220527-073645-distro-basic-smithi failed: Branch 'master' not found in repo: https://github.com/ceph/teuthology! If this is a valid point, perhaps we can restore and delete it after a few weeks? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph Repo Branch Rename - May 24
The master branch has been deleted from all recently active repos except ceph.git. I'm slowly retargeting existing PRs from master to main. The tool I used to rename the branches didn't take care of that for me unfortunately so it has to be done manually. As far as I know, this should conclude the branch renaming. Please let me know if you continue to see any issues. On 5/25/22 15:46, David Galloway wrote: I was successfully able to get a 'main' build completed. This means you should be able to push your branches to ceph-ci.git and get a build now. Thank you for your patience. On 5/24/22 18:30, David Galloway wrote: This maintenance is ongoing. This was a much larger effort than anticipated. I've unpaused Jenkins but fully expect many jobs to fail for the next couple days. If you had a PR targeting master, you will need to edit the PR to target main now instead. I appreciate your patience. On 5/19/22 14:38, David Galloway wrote: Hi all, In an effort to use more inclusive language, we will be renaming all Ceph repo 'master' branches to 'main' on May 24. I anticipate making the change in the morning Eastern US time, merging all 's/master/main' pull requests I already have open, then tracking down and fixing any remaining references to the master branch. Please excuse the disruption and thank you for your patience. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] radosgw multisite sync /admin/log requests overloading system.
I have a simple multisite radosgw configuration setup for testing. There is 1 realm, 1 zonegroup, and 2 separate clusters each with its own zone. There is 1 bucket with 1 object in it and no updates currently happening. There is no group sync policy currently defined. The problem I see is that the radosgw on the secondary zone is flooding the master zone with requests for the /admin/log . The radosgw on the secondary is consuming roughly 50% of the CPU cycles. The master zone radosgw is equally actiive a d is flooding the logs (at 1/5 level) with entries like this: 2022-06-01T11:45:06.719-0400 7ff415f8b700 1 == req done req=0x7ff5e02ed680 op status=0 http_status=200 latency=0.00440s == 2022-06-01T11:45:06.719-0400 7ff415f8b700 1 beast: 0x7ff5e02ed680: 10.15.1.40 - syncuser [01/Jun/2022:11:45:06.715 -0400] "GET /admin/log?type=metadata=4=92e4fbd8-3429-4cc6-a9f4-6f756ba0c592=100&=3bc6efd6-a780-4cd1-9685-376e8b477756 HTTP/1.1" 200 44 - - - latency=0.00440s 2022-06-01T11:45:06.719-0400 7ff446fed700 1 == req done req=0x7ff5e0572680 op status=0 http_status=200 latency=0.00440s == 2022-06-01T11:45:06.719-0400 7ff446fed700 1 beast: 0x7ff5e0572680: 10.15.1.40 - syncuser [01/Jun/2022:11:45:06.715 -0400] "GET /admin/log?type=metadata=5=92e4fbd8-3429-4cc6-a9f4-6f756ba0c592=100&=3bc6efd6-a780-4cd1-9685-376e8b477756 HTTP/1.1" 200 44 - - - latency=0.00440s What is going on and how do I fix this? The period on both zones is current and at the same epoch value. Any ideas/suggestions? thanks, Wyllys Ingersoll ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Moving rbd-images across pools?
Hey guys and girls, newbie question here (still in planning phase). I'm thinking about starting out with a mini cluster with 4 nodes and perhaps 3x replication, because of budgetary reasons. In a few months or next year, I'll get extra budget and can extend to 7-8 nodes. I will then want to change to EC 4:2. But how does this work? Can I create a new pool on the same cluster with the different policy? And can I move rbd-images across while they are mounted without user impact? Or do I need to unmount the images, more the images to another pool and then mount again? Angelo. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Error CephMgrPrometheusModuleInactive
i have error im dashboard ceph -- CephMgrPrometheusModuleInactive description The mgr/prometheus module at opcpmfpskup0101.p.fnst.10.in-addr.arpa:9283 is unreachable. This could mean that the module has been disabled or the mgr itself is down. Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to ceph and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can check the mgr/prometheus module is loaded with 'ceph mgr module ls' and if it's not listed as enabled, enable it with 'ceph mgr module enable prometheus' and in log container mgr i have this error - debug 2022-06-01T07:47:13.929+ 7f21d6525700 0 log_channel(cluster) log [DBG] : pgmap v386352: 1 pgs: 1 active+clean; 0 B data, 16 MiB used, 60 GiB / 60 GiB avail debug 2022-06-01T07:47:14.039+ 7f21c7b08700 0 [progress INFO root] Processing OSDMap change 29..29 debug 2022-06-01T07:47:15.128+ 7f21a7b36700 0 [dashboard INFO request] [10.60.161.64:63651] [GET] [200] [0.011s] [admin] [933.0B] /api/summary debug 2022-06-01T07:47:15.866+ 7f21bdfe2700 0 [prometheus INFO cherrypy.access.139783044050056] 10.56.0.223 - - [01/Jun/2022:07:47:15] "GET /metrics HTTP/1.1" 200 101826 "" "Prometheus/2.33.4" 10.56.0.223 - - [01/Jun/2022:07:47:15] "GET /metrics HTTP/1.1" 200 101826 "" "Prometheus/2.33.4" debug 2022-06-01T07:47:15.928+ 7f21d6525700 0 log_channel(cluster) log [DBG] : pgmap v386353: 1 pgs: 1 active+clean; 0 B data, 16 MiB used, 60 GiB / 60 GiB avail debug 2022-06-01T07:47:16.126+ 7f21a6333700 0 [dashboard INFO request] [10.60.161.64:63651] [GET] [200] [0.003s] [admin] [69.0B] /api/feature_toggles debug 2022-06-01T07:47:17.129+ 7f21cd313700 0 [progress WARNING root] complete: ev f9e995f4-d172-465f-a91a-de6e35319717 does not exist debug 2022-06-01T07:47:17.129+ 7f21cd313700 0 [progress WARNING root] complete: ev 1bb8e9ee-7403-42ad-96e4-4324ae6d8c15 does not exist debug 2022-06-01T07:47:17.130+ 7f21cd313700 0 [progress WARNING root] complete: ev 6b9a0cd9-b185-4c08-ad99-e7fc2f976590 does not exist debug 2022-06-01T07:47:17.130+ 7f21cd313700 0 [progress WARNING root] complete: ev d9bffc48-d463-43bf-a25b-7853b2f334a0 does not exist debug 2022-06-01T07:47:17.130+ 7f21cd313700 0 [progress WARNING root] complete: ev c5bf893d-2eac-4bb6-994f-cbcf3822c30c does not exist debug 2022-06-01T07:47:17.131+ 7f21cd313700 0 [progress WARNING root] complete: ev 43511d64-6636-455e-8df5-bed1aa853f3e does not exist debug 2022-06-01T07:47:17.131+ 7f21cd313700 0 [progress WARNING root] complete: ev 857aabc5-e61b-4a76-90b2-62631bfeba00 does not exist 10.56.0.221 - - [01/Jun/2022:07:47:00] "GET /metrics HTTP/1.1" 200 101830 "" "Prometheus/2.33.4" debug 2022-06-01T07:47:01.632+ 7f21a7b36700 0 [dashboard ERROR exception] Internal Server Error Traceback (most recent call last): File "/lib/python3.6/site-packages/cherrypy/lib/static.py", line 58, in serve_file st = os.stat(path) FileNotFoundError: [Errno 2] No such file or directory: '/usr/share/ceph/mgr/dashboard/frontend/dist/en-US/prometheus_receiver' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 47, in dashboard_exception_handler return handler(*args, **kwargs) File "/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in __call__ return self.callable(*self.args, **self.kwargs) File "/usr/share/ceph/mgr/dashboard/controllers/home.py", line 135, in __call__ return serve_file(full_path) File "/lib/python3.6/site-packages/cherrypy/lib/static.py", line 65, in serve_file raise cherrypy.NotFound() but my cluster show everythings is ok #ceph -s cluster: id: 868c3ad2-da76-11ec-b977-005056aa7589 health: HEALTH_OK services: mon: 3 daemons, quorum opcpmfpskup0105,opcpmfpskup0101,opcpmfpskup0103 (age 38m) mgr: opcpmfpskup0105.mureyk(active, since 8d), standbys: opcpmfpskup0101.uvkngk osd: 3 osds: 3 up (since 38m), 3 in (since 84m) data: pools: 1 pools, 1 pgs objects: 0 objects, 0 B usage: 16 MiB used, 60 GiB / 60 GiB avail pgs: 1 active+clean anyone can explain this ? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Degraded data redundancy and too many PGs per OSD
Hi, how did you end up with that many PGs per OSD? According to your output the pg_autoscaler is enabled, if that was done by the autoscaler I would create a tracker issue for that. Then I would either disable it or set the mode to "warn" and then reduce the pg_num for some of the pools. What does your crush rule 2 look like? Can you share the dump of the rule with the ID 2? ceph osd crush rule ls ceph osd crush rule dump Zitat von farhad kh : hi i have a problem in my cluster i used cache tier for rgw data In this way, three hosts for cache and three hosts for data I have used SSDs for cache and HDD for data i set 20 GiB quota for cache pool when one host of cache tier shulde be offline released this warning and i decreased quota to 10 GiB but it not resolved and in dashboard not correct number of pg status ( 1 active+undersize) what happening in my cluster ? why this is not resolved? anyone can explain this situation? ##ceph -s opcpmfpsksa0101: Mon May 30 12:05:12 2022 cluster: id: 54d2b1d6-207e-11ec-8c73-005056ac51bf health: HEALTH_WARN 1 hosts fail cephadm check 1 pools have many more objects per pg than average Degraded data redundancy: 1750/53232 objects degraded (3.287%), 1 pg degraded, 1 pg undersized too many PGs per OSD (259 > max 250) services: mon: 3 daemons, quorum opcpmfpsksa0101,opcpmfpsksa0103,opcpmfpsksa0105 (age 3d) mgr: opcpmfpsksa0101.apmwdm(active, since 5h) osd: 12 osds: 10 up (since 95m), 10 in (since 85m) rgw: 2 daemons active (2 hosts, 1 zones) data: pools: 9 pools, 865 pgs objects: 17.74k objects, 41 GiB usage: 128 GiB used, 212 GiB / 340 GiB avail pgs: 1750/53232 objects degraded (3.287%) 864 active+clean 1 active+undersized+degraded - ## ceph health detail HEALTH_WARN 1 hosts fail cephadm check; 1 pools have many more objects per pg than average; Degraded data redundancy: 1665/56910 objects degraded (2.926%), 1 pg degraded, 1 pg undersized; too many PGs per OSD (259 > max 250) [WRN] CEPHADM_HOST_CHECK_FAILED: 1 hosts fail cephadm check host opcpcfpsksa0101 (10.56.12.210) failed check: Failed to connect to opcpcfpsksa0101 (10.56.12.210). Please make sure that the host is reachable and accepts connections using the cephadm SSH key To add the cephadm SSH key to the host: ceph cephadm get-pub-key > ~/ceph.pub ssh-copy-id -f -i ~/ceph.pub root@10.56.12.210 To check that the host is reachable open a new shell with the --no-hosts flag: cephadm shell --no-hosts Then run the following: ceph cephadm get-ssh-config > ssh_config ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key chmod 0600 ~/cephadm_private_key ssh -F ssh_config -i ~/cephadm_private_key root@10.56.12.210 [WRN] MANY_OBJECTS_PER_PG: 1 pools have many more objects per pg than average pool cache-pool objects per pg (1665) is more than 79.2857 times cluster average (21) [WRN] PG_DEGRADED: Degraded data redundancy: 1665/56910 objects degraded (2.926%), 1 pg degraded, 1 pg undersized pg 9.0 is stuck undersized for 88m, current state active+undersized+degraded, last acting [10,11] [WRN] TOO_MANY_PGS: too many PGs per OSD (259 > max 250) -- ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL%USE VAR PGS STATUS TYPE NAME -1 0.35156 - 340 GiB 128 GiB 121 GiB 12 MiB 6.9 GiB 212 GiB 37.58 1.00- root default -3 0.01959 - 0 B 0 B 0 B 0 B 0 B 0 B 0 0- host opcpcfpsksa0101 0ssd 0.00980 0 0 B 0 B 0 B 0 B 0 B 0 B 0 00down osd.0 9ssd 0.00980 0 0 B 0 B 0 B 0 B 0 B 0 B 0 00down osd.9 -5 0.01959 - 20 GiB 5.1 GiB 4.0 GiB 588 KiB 1.1 GiB 15 GiB 25.29 0.67- host opcpcfpsksa0103 7ssd 0.00980 0.85004 10 GiB 483 MiB 75 MiB 539 KiB 407 MiB 9.5 GiB 4.72 0.133 up osd.7 10ssd 0.00980 0.55011 10 GiB 4.6 GiB 3.9 GiB 49 KiB 703 MiB 5.4 GiB 45.85 1.225 up osd.10 -16 0.01959 - 20 GiB 5.5 GiB 4.0 GiB 542 KiB 1.5 GiB 15 GiB 27.28 0.73- host opcpcfpsksa0105 8ssd 0.00980 0.70007 10 GiB 851 MiB 75 MiB 121 KiB 775 MiB 9.2 GiB 8.31 0.22 10 up osd.8 11ssd 0.00980 0.45013 10 GiB 4.6 GiB 3.9 GiB 421 KiB 742 MiB 5.4 GiB 46.24 1.235 up osd.11 -10 0.09760 - 100 GiB 39 GiB 38 GiB 207 KiB 963 MiB 61 GiB 38.59 1.03- host opcsdfpsksa0101 1hdd 0.04880 1.0 50 GiB 19 GiB