[ceph-users] Re: multiple OSD crash, unfound objects

Frank Schilder Wed, 21 Oct 2020 08:48:41 -0700

Hi Michael,

some quick thoughts.

That you can create a pool with 1 PG is a good sign, the crush rule is OK. That 
pg query says it doesn't have PG 1.0 points in the right direction. There is an 
inconsistency in the cluster. This is also indicated by the fact that no upmaps 
seem to exist (the clean-up script was empty). With the osd map you extracted, 
you could check what the osd map believes the mapping of the PGs of pool 1 are:

  # osdmaptool osd.map --test-map-pgs-dump --pool 1

or if it also claims the PG does not exist. It looks like something went wrong 
during pool creation and you are not the only one having problems with this 
particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . 
Sounds a lot like a bug in cephadm.

In principle, it looks like the idea to delete and recreate the health metrics 
pool is a way forward. Please look at the procedure mentioned in the thread 
quoted above. Deletion of the pool there lead to some crashes and some surgery 
on some OSDs was necessary. However, in your case it might just work, because 
you redeployed the OSDs in question already - if I remember correctly.

In order to do so cleanly, however, you will probably want to shut down all 
clients accessing this pool. Note that clients accessing the health metrics 
pool are not FS clients, so the mds cannot tell you anything about them. The 
only command that seems to list all clients is

  # ceph daemon mon.MON-ID sessions

that needs to be executed on all mon hosts. On the other hand, you could also 
just go ahead and see if something crashes (an MGR module probably) or disable 
all MGR modules during this recovery attempt. I found some info that cephadm 
creates this pool and starts an MGR module.

If you google "device_health_metric pool" you should find descriptions of 
similar cases. It looks solvable.

I will look at the incomplete PG issue. I hope this is just some PG tuning. At 
least pg query didn't complain :)

The stuck MDS request could be an attempt to access an unfound object. It 
should be possible to locate the fs client and find out what it was trying to 
do. I see this sometimes when people are too impatient. They manage to trigger 
a race condition and an MDS operation gets stuck (there are MDS bugs and in my 
case it was an ls command that got stuck). Usually, evicting the client 
temporarily solves the issue (but tell the user :).

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Michael Thomas <w...@caltech.edu>
Sent: 20 October 2020 23:48:36
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/20/20 1:18 PM, Frank Schilder wrote:
> Dear Michael,
>
>>> Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an 
>>> OSD mapping?
>
> I meant here with crush rule replicated_host_nvme. Sorry, forgot.

Seems to have worked fine:

https://pastebin.com/PFgDE4J1

>> Yes, the OSD was still out when the previous health report was created.
>
> Hmm, this is odd. If this is correct, then it did report a slow op even 
> though it was out of the cluster:
>
>> from https://pastebin.com/3G3ij9ui:
>> [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons 
>> [osd.0,osd.41] have slow ops.
>
> Not sure what to make of that. It looks almost like you have a ghost osd.41.
>
>
> I think (some of) the slow ops you are seeing are directed to the 
> health_metrics pool and can be ignored. If it is too annoying, you could try 
> to find out who runs the client with IDs client.7524484 and disable it. Might 
> be an MGR module.

I'm also pretty certain that the slow ops are related to the health
metrics pool, which is why I've been ignoring them.

What I'm not sure about is whether re-creating the device_health_metrics
pool will cause any problems in the ceph cluster.

> Looking at the data you provided and also some older threads of yours 
> (https://www.mail-archive.com/ceph-users@ceph.io/msg05842.html), I start 
> considering that we are looking at the fall-out of a past admin operation. A 
> possibility is, that an upmap for PG 1.0 exists that conflicts with the crush 
> rule replicated_host_nvme and, hence, prevents the assignment of OSDs to PG 
> 1.0. For example, the upmap specifies HDDs, but the crush rule required 
> NVMEs. This result is an empty set.

So var I've been unable to locate the client with the ID 7524484.  It's
not showing up in the manager dashboard -> Filesystems page, nor in the
output of 'ceph tell mds.ceph1 client ls'.

I'm digging through the compress logs for the past week to see if I can
find the culprit.

> I couldn't really find a simple command to list up-maps. The only 
> non-destructive way seems to be to extract the osdmap and create a clean-up 
> command file. The cleanup file should contain a command for every PG with an 
> upmap. To check this, you can execute (see also 
> https://docs.ceph.com/en/latest/man/8/osdmaptool/)
>
>    # ceph osd getmap > osd.map
>    # osdmaptool osd.map --upmap-cleanup cleanup.cmd
>
> If you do this, could you please post as usual the contents of cleanup.cmd?

It was empty:

[root@ceph1 ~]# ceph osd getmap > osd.map
got osdmap epoch 52833

[root@ceph1 ~]# osdmaptool osd.map --upmap-cleanup cleanup.cmd
osdmaptool: osdmap file 'osd.map'
writing upmap command output to: cleanup.cmd
checking for upmap cleanups

[root@ceph1 ~]# wc cleanup.cmd
0 0 0 cleanup.cmd

> Also, with the OSD map of your cluster, you can simulate certain admin 
> operations and check resulting PG mappings for pools and other things without 
> having to touch the cluster; see 
> https://docs.ceph.com/en/latest/man/8/osdmaptool/.
>
>
> To dig a little bit deeper, could you please post as usual the output of:
>
> - ceph pg 1.0 query
> - ceph pg 7.39d query

Oddly, it claims that it doesn't have pgid 1.0.

https://pastebin.com/pHh33Dq7

> It would also be helpful if you could post the decoded crush map. You can get 
> the map as a txt-file as follows:
>
>    # ceph osd getcrushmap -o crush-orig.bin
>    # crushtool -d crush-orig.bin -o crush.txt
>
> and post the contents of file crush.txt.

https://pastebin.com/EtEGpWy3

> Did the slow MDS request complete by now?

Nope.

--Mike
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: multiple OSD crash, unfound objects

Reply via email to