[ceph-users] Re: multiple OSD crash, unfound objects

Frank Schilder Tue, 20 Oct 2020 15:19:30 -0700

Dear Michael,

> > Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an 
> > OSD mapping?


I meant here with crush rule replicated_host_nvme. Sorry, forgot.


> Yes, the OSD was still out when the previous health report was created.

Hmm, this is odd. If this is correct, then it did report a slow op even though 
it was out of the cluster:

> from https://pastebin.com/3G3ij9ui:
> [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons 
> [osd.0,osd.41] have slow ops.

Not sure what to make of that. It looks almost like you have a ghost osd.41.


I think (some of) the slow ops you are seeing are directed to the 
health_metrics pool and can be ignored. If it is too annoying, you could try to 
find out who runs the client with IDs client.7524484 and disable it. Might be 
an MGR module.


Looking at the data you provided and also some older threads of yours 
(https://www.mail-archive.com/ceph-users@ceph.io/msg05842.html), I start 
considering that we are looking at the fall-out of a past admin operation. A 
possibility is, that an upmap for PG 1.0 exists that conflicts with the crush 
rule replicated_host_nvme and, hence, prevents the assignment of OSDs to PG 
1.0. For example, the upmap specifies HDDs, but the crush rule required NVMEs. 
This result is an empty set.

I couldn't really find a simple command to list up-maps. The only 
non-destructive way seems to be to extract the osdmap and create a clean-up 
command file. The cleanup file should contain a command for every PG with an 
upmap. To check this, you can execute (see also 
https://docs.ceph.com/en/latest/man/8/osdmaptool/)

  # ceph osd getmap > osd.map
  # osdmaptool osd.map --upmap-cleanup cleanup.cmd

If you do this, could you please post as usual the contents of cleanup.cmd?

Also, with the OSD map of your cluster, you can simulate certain admin 
operations and check resulting PG mappings for pools and other things without 
having to touch the cluster; see 
https://docs.ceph.com/en/latest/man/8/osdmaptool/.


To dig a little bit deeper, could you please post as usual the output of:

- ceph pg 1.0 query
- ceph pg 7.39d query

It would also be helpful if you could post the decoded crush map. You can get 
the map as a txt-file as follows:

  # ceph osd getcrushmap -o crush-orig.bin
  # crushtool -d crush-orig.bin -o crush.txt

and post the contents of file crush.txt.


Did the slow MDS request complete by now?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

Contents of previous messages removed.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: multiple OSD crash, unfound objects

Reply via email to