On 10/21/20 6:47 AM, Frank Schilder wrote:
Hi Michael,

some quick thoughts.


That you can create a pool with 1 PG is a good sign, the crush rule is OK. That 
pg query says it doesn't have PG 1.0 points in the right direction. There is an 
inconsistency in the cluster. This is also indicated by the fact that no upmaps 
seem to exist (the clean-up script was empty). With the osd map you extracted, 
you could check what the osd map believes the mapping of the PGs of pool 1 are:

   # osdmaptool osd.map --test-map-pgs-dump --pool 1

https://pastebin.com/seh6gb7R

As I suspected, it thinks that OSDs 0, 41 are the acting set.

or if it also claims the PG does not exist. It looks like something went wrong 
during pool creation and you are not the only one having problems with this 
particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . 
Sounds a lot like a bug in cephadm.

In principle, it looks like the idea to delete and recreate the health metrics 
pool is a way forward. Please look at the procedure mentioned in the thread 
quoted above. Deletion of the pool there lead to some crashes and some surgery 
on some OSDs was necessary. However, in your case it might just work, because 
you redeployed the OSDs in question already - if I remember correctly.

That is correct. The original OSDs 0 and 41 were removed and redeployed on new disks.

In order to do so cleanly, however, you will probably want to shut down all 
clients accessing this pool. Note that clients accessing the health metrics 
pool are not FS clients, so the mds cannot tell you anything about them. The 
only command that seems to list all clients is

   # ceph daemon mon.MON-ID sessions

that needs to be executed on all mon hosts. On the other hand, you could also 
just go ahead and see if something crashes (an MGR module probably) or disable 
all MGR modules during this recovery attempt. I found some info that cephadm 
creates this pool and starts an MGR module.

If you google "device_health_metric pool" you should find descriptions of 
similar cases. It looks solvable.

Unfortunately, in Octopus you can not disable the devicehealth manager module, and the manager is required for operation. So I just went ahead and removed the pool with everything still running. Fortunately, this did not appear to cause any problems, and the single unknown PG has disappeared from the ceph health output.

I will look at the incomplete PG issue. I hope this is just some PG tuning. At 
least pg query didn't complain :)

I have OSDs ready to add to the pool, in case you think we should try.

The stuck MDS request could be an attempt to access an unfound object. It 
should be possible to locate the fs client and find out what it was trying to 
do. I see this sometimes when people are too impatient. They manage to trigger 
a race condition and an MDS operation gets stuck (there are MDS bugs and in my 
case it was an ls command that got stuck). Usually, evicting the client 
temporarily solves the issue (but tell the user :).

I found the fs client and rebooted it. The MDS still reports the slow OPs, but according to the mds logs the offending ops were established before the client was rebooted, and the offending client session (now defunct) has been blacklisted. I'll check back later to see if the slow OPS get cleared from 'ceph status'.

Regards,

--Mike
________________________________________
From: Michael Thomas <w...@caltech.edu>
Sent: 20 October 2020 23:48:36
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/20/20 1:18 PM, Frank Schilder wrote:
Dear Michael,

Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD 
mapping?

I meant here with crush rule replicated_host_nvme. Sorry, forgot.

Seems to have worked fine:

https://pastebin.com/PFgDE4J1

Yes, the OSD was still out when the previous health report was created.

Hmm, this is odd. If this is correct, then it did report a slow op even though 
it was out of the cluster:

from https://pastebin.com/3G3ij9ui:
[WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons 
[osd.0,osd.41] have slow ops.

Not sure what to make of that. It looks almost like you have a ghost osd.41.


I think (some of) the slow ops you are seeing are directed to the 
health_metrics pool and can be ignored. If it is too annoying, you could try to 
find out who runs the client with IDs client.7524484 and disable it. Might be 
an MGR module.

I'm also pretty certain that the slow ops are related to the health
metrics pool, which is why I've been ignoring them.

What I'm not sure about is whether re-creating the device_health_metrics
pool will cause any problems in the ceph cluster.

Looking at the data you provided and also some older threads of yours 
(https://www.mail-archive.com/ceph-users@ceph.io/msg05842.html), I start 
considering that we are looking at the fall-out of a past admin operation. A 
possibility is, that an upmap for PG 1.0 exists that conflicts with the crush 
rule replicated_host_nvme and, hence, prevents the assignment of OSDs to PG 
1.0. For example, the upmap specifies HDDs, but the crush rule required NVMEs. 
This result is an empty set.

So var I've been unable to locate the client with the ID 7524484.  It's
not showing up in the manager dashboard -> Filesystems page, nor in the
output of 'ceph tell mds.ceph1 client ls'.

I'm digging through the compress logs for the past week to see if I can
find the culprit.

I couldn't really find a simple command to list up-maps. The only 
non-destructive way seems to be to extract the osdmap and create a clean-up 
command file. The cleanup file should contain a command for every PG with an 
upmap. To check this, you can execute (see also 
https://docs.ceph.com/en/latest/man/8/osdmaptool/)

    # ceph osd getmap > osd.map
    # osdmaptool osd.map --upmap-cleanup cleanup.cmd

If you do this, could you please post as usual the contents of cleanup.cmd?

It was empty:

[root@ceph1 ~]# ceph osd getmap > osd.map
got osdmap epoch 52833

[root@ceph1 ~]# osdmaptool osd.map --upmap-cleanup cleanup.cmd
osdmaptool: osdmap file 'osd.map'
writing upmap command output to: cleanup.cmd
checking for upmap cleanups

[root@ceph1 ~]# wc cleanup.cmd
0 0 0 cleanup.cmd

Also, with the OSD map of your cluster, you can simulate certain admin 
operations and check resulting PG mappings for pools and other things without 
having to touch the cluster; see 
https://docs.ceph.com/en/latest/man/8/osdmaptool/.


To dig a little bit deeper, could you please post as usual the output of:

- ceph pg 1.0 query
- ceph pg 7.39d query

Oddly, it claims that it doesn't have pgid 1.0.

https://pastebin.com/pHh33Dq7

It would also be helpful if you could post the decoded crush map. You can get 
the map as a txt-file as follows:

    # ceph osd getcrushmap -o crush-orig.bin
    # crushtool -d crush-orig.bin -o crush.txt

and post the contents of file crush.txt.

https://pastebin.com/EtEGpWy3

Did the slow MDS request complete by now?

Nope.

--Mike

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to