Dear Michael,
this is a bit of a nut. I can't see anything obvious. I have two hypotheses
that you might consider testing.
1) Problem with 1 incomplete PG.
In the shadow hierarchy for your cluster I can see quite a lot of nodes like
{
"id": -135,
"name":
Dear Michael,
> Please mark OSD 41 as "in" again and wait for some slow ops to show up.
I forgot. "wait for some slow ops to show up" ... and then what?
Could you please go to the host of the affected OSD and look at the output of
"ceph daemon osd.ID ops" or "ceph daemon osd.ID
Dear Michael,
thanks for this initial work. I will need to look through the files you posted
in more detail. In the meantime:
Please mark OSD 41 as "in" again and wait for some slow ops to show up. As far
as I can see, marking it "out" might have cleared hanging slow ops (there were
1000