[ceph-users] Re: MDS stuck in replay

2022-06-01 Thread Ramana Venkatesh Raja
On Tue, May 31, 2022 at 3:42 AM Magnus HAGDORN  wrote:
>
> Hi all,
> it seems to be the time of stuck MDSs. We also have our ceph filesystem
> degraded. The MDS is stuck in replay for about 20 hours now.
>
> We run a nautilus ceph cluster with about 300TB of data and many
> millions of files. We run two MDSs with a particularly large directory
> pinned to one of them. Both MDSs have standby MDSs.
>
>  We are in the process of migrating to a new pacific cluster and have
> been syncing files daily. Over the weekend something happened and we
> ended up with slow MDS responses and some directories became very slow
> (as we'd expect). We restarted the second MDS. It came back within a
> minute and the problem disappeared for a little while. The slow MDS
> operations came back and we restarted the other MDS. This one has been
> in replay state since yesterday.
>

Can you temporarily turn up the MDS debug log level (debug_mds) to
check what's happening to this MDS during replay?
ceph config set mds debug_mds 10

Is the health of the MDS host okay? Is it low on memory?

> The cluster is healthy.
>

Can you share the output of the `ceph status` , `ceph fs status`  and
`ceph --version`?

> So, we are wondering what it is up to. How long it might take. And is
> there something we can do to speed up the replay phase.
>
> Regards
> magnus
> The University of Edinburgh is a charitable body, registered in Scotland, 
> with registration number SC005336. Is e buidheann carthannais a th’ ann an 
> Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

Regards,
Ramana

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Repo Branch Rename - May 24

2022-06-01 Thread Rishabh Dave
On Wed, 1 Jun 2022 at 23:52, David Galloway  wrote:
>
> The master branch has been deleted from all recently active repos except
> ceph.git.  I'm slowly retargeting existing PRs from master to main.
>
> The tool I used to rename the branches didn't take care of that for me
> unfortunately so it has to be done manually.
>
> As far as I know, this should conclude the branch renaming.  Please let
> me know if you continue to see any issues.
>

Perhaps, the master branch at ceph-ci repo could've been left for a
few weeks since re-running failed jobs from last week's run is not
possible anymore.

/teuthology/teuthology/suite/util.py", line 76, in schedule_fail
raise ScheduleFailError(message, name)
teuthology.exceptions.ScheduleFailError: Scheduling
rishabh-2022-06-01_18:47:29-fs-wip-vshankar-testing-20220527-073645-distro-basic-smithi
failed: Branch 'master' not found in repo:
https://github.com/ceph/teuthology!

If this is a valid point, perhaps we can restore and delete it after a
few weeks?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Repo Branch Rename - May 24

2022-06-01 Thread David Galloway
The master branch has been deleted from all recently active repos except 
ceph.git.  I'm slowly retargeting existing PRs from master to main.


The tool I used to rename the branches didn't take care of that for me 
unfortunately so it has to be done manually.


As far as I know, this should conclude the branch renaming.  Please let 
me know if you continue to see any issues.


On 5/25/22 15:46, David Galloway wrote:

I was successfully able to get a 'main' build completed.

This means you should be able to push your branches to ceph-ci.git and 
get a build now.


Thank you for your patience.

On 5/24/22 18:30, David Galloway wrote:
This maintenance is ongoing. This was a much larger effort than 
anticipated.


I've unpaused Jenkins but fully expect many jobs to fail for the next 
couple days.


If you had a PR targeting master, you will need to edit the PR to 
target main now instead.


I appreciate your patience.

On 5/19/22 14:38, David Galloway wrote:

Hi all,

In an effort to use more inclusive language, we will be renaming all 
Ceph repo 'master' branches to 'main' on May 24.


I anticipate making the change in the morning Eastern US time, 
merging all 's/master/main' pull requests I already have open, then 
tracking down and fixing any remaining references to the master branch.


Please excuse the disruption and thank you for your patience.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] radosgw multisite sync /admin/log requests overloading system.

2022-06-01 Thread Wyll Ingersoll
I have a simple multisite radosgw configuration setup for testing. There is 1 
realm, 1 zonegroup, and 2 separate clusters each with its own zone.  There is 1 
bucket with 1 object in it and no updates currently happening.  There is no 
group sync policy currently defined.

The problem I see is that the radosgw on the secondary zone is flooding the 
master zone with requests for the /admin/log . The radosgw on the secondary is 
consuming roughly 50% of the CPU cycles. The master zone radosgw is equally 
actiive a d is flooding the logs (at 1/5 level) with entries like this:

2022-06-01T11:45:06.719-0400 7ff415f8b700  1 == req done req=0x7ff5e02ed680 
op status=0 http_status=200 latency=0.00440s ==
2022-06-01T11:45:06.719-0400 7ff415f8b700  1 beast: 0x7ff5e02ed680: 10.15.1.40 
- syncuser [01/Jun/2022:11:45:06.715 -0400] "GET 
/admin/log?type=metadata=4=92e4fbd8-3429-4cc6-a9f4-6f756ba0c592=100&=3bc6efd6-a780-4cd1-9685-376e8b477756
 HTTP/1.1" 200 44 - - - latency=0.00440s
2022-06-01T11:45:06.719-0400 7ff446fed700  1 == req done req=0x7ff5e0572680 
op status=0 http_status=200 latency=0.00440s ==
2022-06-01T11:45:06.719-0400 7ff446fed700  1 beast: 0x7ff5e0572680: 10.15.1.40 
- syncuser [01/Jun/2022:11:45:06.715 -0400] "GET 
/admin/log?type=metadata=5=92e4fbd8-3429-4cc6-a9f4-6f756ba0c592=100&=3bc6efd6-a780-4cd1-9685-376e8b477756
 HTTP/1.1" 200 44 - - - latency=0.00440s


What is going on and how do I fix this?  The period on both zones is current 
and at the same epoch value.
Any ideas/suggestions?

thanks,
   Wyllys Ingersoll

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Moving rbd-images across pools?

2022-06-01 Thread Angelo Hongens

Hey guys and girls, newbie question here (still in planning phase).

I'm thinking about starting out with a mini cluster with 4 nodes and 
perhaps 3x replication, because of budgetary reasons. In a few months or 
next year, I'll get extra budget and can extend to 7-8 nodes. I will 
then want to change to EC 4:2.


But how does this work? Can I create a new pool on the same cluster with 
the different policy? And can I move rbd-images across while they are 
mounted without user impact? Or do I need to unmount the images, more 
the images to another pool and then mount again?


Angelo.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Error CephMgrPrometheusModuleInactive

2022-06-01 Thread farhad kh
i have error im dashboard ceph
--
CephMgrPrometheusModuleInactive
description
The mgr/prometheus module at opcpmfpskup0101.p.fnst.10.in-addr.arpa:9283 is
unreachable. This could mean that the module has been disabled or the mgr
itself is down. Without the mgr/prometheus module metrics and alerts will
no longer function. Open a shell to ceph and use 'ceph -s' to to determine
whether the mgr is active. If the mgr is not active, restart it, otherwise
you can check the mgr/prometheus module is loaded with 'ceph mgr module ls'
and if it's not listed as enabled, enable it with 'ceph mgr module enable
prometheus'

and in log container mgr i have this error
-
debug 2022-06-01T07:47:13.929+ 7f21d6525700  0 log_channel(cluster) log
[DBG] : pgmap v386352: 1 pgs: 1 active+clean; 0 B data, 16 MiB used, 60 GiB
/ 60 GiB avail
debug 2022-06-01T07:47:14.039+ 7f21c7b08700  0 [progress INFO root]
Processing OSDMap change 29..29
debug 2022-06-01T07:47:15.128+ 7f21a7b36700  0 [dashboard INFO request]
[10.60.161.64:63651] [GET] [200] [0.011s] [admin] [933.0B] /api/summary
debug 2022-06-01T07:47:15.866+ 7f21bdfe2700  0 [prometheus INFO
cherrypy.access.139783044050056] 10.56.0.223 - - [01/Jun/2022:07:47:15]
"GET /metrics HTTP/1.1" 200 101826 "" "Prometheus/2.33.4"
10.56.0.223 - - [01/Jun/2022:07:47:15] "GET /metrics HTTP/1.1" 200 101826
"" "Prometheus/2.33.4"
debug 2022-06-01T07:47:15.928+ 7f21d6525700  0 log_channel(cluster) log
[DBG] : pgmap v386353: 1 pgs: 1 active+clean; 0 B data, 16 MiB used, 60 GiB
/ 60 GiB avail
debug 2022-06-01T07:47:16.126+ 7f21a6333700  0 [dashboard INFO request]
[10.60.161.64:63651] [GET] [200] [0.003s] [admin] [69.0B]
/api/feature_toggles
debug 2022-06-01T07:47:17.129+ 7f21cd313700  0 [progress WARNING root]
complete: ev f9e995f4-d172-465f-a91a-de6e35319717 does not exist
debug 2022-06-01T07:47:17.129+ 7f21cd313700  0 [progress WARNING root]
complete: ev 1bb8e9ee-7403-42ad-96e4-4324ae6d8c15 does not exist
debug 2022-06-01T07:47:17.130+ 7f21cd313700  0 [progress WARNING root]
complete: ev 6b9a0cd9-b185-4c08-ad99-e7fc2f976590 does not exist
debug 2022-06-01T07:47:17.130+ 7f21cd313700  0 [progress WARNING root]
complete: ev d9bffc48-d463-43bf-a25b-7853b2f334a0 does not exist
debug 2022-06-01T07:47:17.130+ 7f21cd313700  0 [progress WARNING root]
complete: ev c5bf893d-2eac-4bb6-994f-cbcf3822c30c does not exist
debug 2022-06-01T07:47:17.131+ 7f21cd313700  0 [progress WARNING root]
complete: ev 43511d64-6636-455e-8df5-bed1aa853f3e does not exist
debug 2022-06-01T07:47:17.131+ 7f21cd313700  0 [progress WARNING root]
complete: ev 857aabc5-e61b-4a76-90b2-62631bfeba00 does not exist


10.56.0.221 - - [01/Jun/2022:07:47:00] "GET /metrics HTTP/1.1" 200 101830
"" "Prometheus/2.33.4"
debug 2022-06-01T07:47:01.632+ 7f21a7b36700  0 [dashboard ERROR
exception] Internal Server Error
Traceback (most recent call last):
  File "/lib/python3.6/site-packages/cherrypy/lib/static.py", line 58, in
serve_file
st = os.stat(path)
FileNotFoundError: [Errno 2] No such file or directory:
'/usr/share/ceph/mgr/dashboard/frontend/dist/en-US/prometheus_receiver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 47, in
dashboard_exception_handler
return handler(*args, **kwargs)
  File "/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in
__call__
return self.callable(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/home.py", line 135, in
__call__
return serve_file(full_path)
  File "/lib/python3.6/site-packages/cherrypy/lib/static.py", line 65, in
serve_file
raise cherrypy.NotFound()

but my cluster show everythings  is ok

#ceph -s
  cluster:
id: 868c3ad2-da76-11ec-b977-005056aa7589
health: HEALTH_OK

  services:
mon: 3 daemons, quorum opcpmfpskup0105,opcpmfpskup0101,opcpmfpskup0103
(age 38m)
mgr: opcpmfpskup0105.mureyk(active, since 8d), standbys:
opcpmfpskup0101.uvkngk
osd: 3 osds: 3 up (since 38m), 3 in (since 84m)

  data:
pools:   1 pools, 1 pgs
objects: 0 objects, 0 B
usage:   16 MiB used, 60 GiB / 60 GiB avail
pgs: 1 active+clean

anyone can explain this ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Degraded data redundancy and too many PGs per OSD

2022-06-01 Thread Eugen Block

Hi,

how did you end up with that many PGs per OSD? According to your  
output the pg_autoscaler is enabled, if that was done by the  
autoscaler I would create a tracker issue for that. Then I would  
either disable it or set the mode to "warn" and then reduce the pg_num  
for some of the pools.
What does your crush rule 2 look like? Can you share the dump of the  
rule with the ID 2?


ceph osd crush rule ls
ceph osd crush rule dump 


Zitat von farhad kh :


hi
i have a problem in my cluster
i used cache tier for rgw data
In this way, three hosts for cache and three hosts for data I have used
SSDs for cache and HDD for data
i set 20 GiB quota for cache pool
when one host of cache tier shulde be offline
released this warning and i decreased quota to 10 GiB but it not resolved
and in dashboard not correct number of pg status ( 1 active+undersize)
what happening in my cluster ?
why this is not resolved?
anyone can explain this situation?

##ceph -s
opcpmfpsksa0101: Mon May 30 12:05:12 2022

  cluster:
id: 54d2b1d6-207e-11ec-8c73-005056ac51bf
health: HEALTH_WARN
1 hosts fail cephadm check
1 pools have many more objects per pg than average
Degraded data redundancy: 1750/53232 objects degraded (3.287%),
1 pg degraded, 1 pg undersized
too many PGs per OSD (259 > max 250)

  services:
mon: 3 daemons, quorum opcpmfpsksa0101,opcpmfpsksa0103,opcpmfpsksa0105
(age 3d)
mgr: opcpmfpsksa0101.apmwdm(active, since 5h)
osd: 12 osds: 10 up (since 95m), 10 in (since 85m)
rgw: 2 daemons active (2 hosts, 1 zones)

  data:
pools:   9 pools, 865 pgs
objects: 17.74k objects, 41 GiB
usage:   128 GiB used, 212 GiB / 340 GiB avail
pgs: 1750/53232 objects degraded (3.287%)
 864 active+clean
 1   active+undersized+degraded

-
## ceph health detail
HEALTH_WARN 1 hosts fail cephadm check; 1 pools have many more objects per
pg than average; Degraded data redundancy: 1665/56910 objects degraded
(2.926%), 1 pg degraded, 1 pg undersized; too many PGs per OSD (259 > max
250)
[WRN] CEPHADM_HOST_CHECK_FAILED: 1 hosts fail cephadm check
host opcpcfpsksa0101 (10.56.12.210) failed check: Failed to connect to
opcpcfpsksa0101 (10.56.12.210).
Please make sure that the host is reachable and accepts connections using
the cephadm SSH key

To add the cephadm SSH key to the host:

ceph cephadm get-pub-key > ~/ceph.pub
ssh-copy-id -f -i ~/ceph.pub root@10.56.12.210


To check that the host is reachable open a new shell with the --no-hosts
flag:

cephadm shell --no-hosts


Then run the following:

ceph cephadm get-ssh-config > ssh_config
ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
chmod 0600 ~/cephadm_private_key
ssh -F ssh_config -i ~/cephadm_private_key root@10.56.12.210

[WRN] MANY_OBJECTS_PER_PG: 1 pools have many more objects per pg than
average
pool cache-pool objects per pg (1665) is more than 79.2857 times
cluster average (21)
[WRN] PG_DEGRADED: Degraded data redundancy: 1665/56910 objects degraded
(2.926%), 1 pg degraded, 1 pg undersized
pg 9.0 is stuck undersized for 88m, current state
active+undersized+degraded, last acting [10,11]
[WRN] TOO_MANY_PGS: too many PGs per OSD (259 > max 250)
--
ceph osd df tree
ID   CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META
AVAIL%USE   VAR   PGS  STATUS  TYPE NAME
 -1 0.35156 -  340 GiB  128 GiB  121 GiB   12 MiB  6.9 GiB
 212 GiB  37.58  1.00-  root default
 -3 0.01959 -  0 B  0 B  0 B  0 B  0 B
 0 B  0 0-  host opcpcfpsksa0101
  0ssd  0.00980 0  0 B  0 B  0 B  0 B  0 B
 0 B  0 00down  osd.0
  9ssd  0.00980 0  0 B  0 B  0 B  0 B  0 B
 0 B  0 00down  osd.9
 -5 0.01959 -   20 GiB  5.1 GiB  4.0 GiB  588 KiB  1.1 GiB
  15 GiB  25.29  0.67-  host opcpcfpsksa0103
  7ssd  0.00980   0.85004   10 GiB  483 MiB   75 MiB  539 KiB  407 MiB
 9.5 GiB   4.72  0.133  up  osd.7
 10ssd  0.00980   0.55011   10 GiB  4.6 GiB  3.9 GiB   49 KiB  703 MiB
 5.4 GiB  45.85  1.225  up  osd.10
-16 0.01959 -   20 GiB  5.5 GiB  4.0 GiB  542 KiB  1.5 GiB
  15 GiB  27.28  0.73-  host opcpcfpsksa0105
  8ssd  0.00980   0.70007   10 GiB  851 MiB   75 MiB  121 KiB  775 MiB
 9.2 GiB   8.31  0.22   10  up  osd.8
 11ssd  0.00980   0.45013   10 GiB  4.6 GiB  3.9 GiB  421 KiB  742 MiB
 5.4 GiB  46.24  1.235  up  osd.11
-10 0.09760 -  100 GiB   39 GiB   38 GiB  207 KiB  963 MiB
  61 GiB  38.59  1.03-  host opcsdfpsksa0101
  1hdd  0.04880   1.0   50 GiB   19 GiB