[ceph-users] Re: OSDs cannot join cluster anymore

Eugen Block Wed, 21 Jun 2023 01:32:21 -0700

Hi,

Yes, we drained the nodes. It needed two weeks to finish theprocess, and yes, I think this is the root cause.So we still have the nodes but when I try to restart one of thoseOSDs it still cannot join:

if the nodes were drained successfully (can you confirm that all PGswere active+clean after draining before you removed the nodes?) thenthe disks on the removed nodes wouldn't have any data to bring back.The question would be, why do the remaining OSDs still referenceremoved OSDs. Or am I misunderstanding something? I think it wouldhelp to know the whole story, can you provide more details? Also somemore general cluster info would be helpful:

$ ceph -s
$ ceph osd tree
$ ceph health detail


Zitat von Malte Stroem <malte.str...@gmail.com>:

Hello Eugen,

thank you. Yesterday I thought: Well, Eugen can help!
Yes, we drained the nodes. It needed two weeks to finish theprocess, and yes, I think this is the root cause.
So we still have the nodes but when I try to restart one of thoseOSDs it still cannot join:
Jun 21 09:46:03 ceph-node bash[2323668]: Running command:/usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-66/blockJun 21 09:46:03 ceph-node bash[2323668]: Running command:/usr/bin/chown -R ceph:ceph /dev/dm-19Jun 21 09:46:03 ceph-node bash[2323668]: Running command:/usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-66Jun 21 09:46:03 ceph-node bash[2323668]: --> ceph-volume lvmactivate successful for osd ID: 66 Jun 21 09:51:04 ceph-nodebash[2323668]: debug 2023-06-21T07:51:04.176+0000 7fabef5a1200 0monclient(hunting): authenticate timed out after 300Jun 21 09:56:04 ceph-node bash[2323668]: debug2023-06-21T07:56:04.179+0000 7fabef5a1200 0 monclient(hunting):authenticate timed out after 300Jun 21 10:01:04 ceph-node bash[2323668]: debug2023-06-21T08:01:04.177+0000 7fabef5a1200 0 monclient(hunting):authenticate timed out after 300Jun 21 10:06:04 ceph-node bash[2323668]: debug2023-06-21T08:06:04.179+0000 7fabef5a1200 0 monclient(hunting):authenticate timed out after 300Jun 21 10:11:04 ceph-node bash[2323668]: debug2023-06-21T08:11:04.174+0000 7fabef5a1200 0 monclient(hunting):authenticate timed out after 300
Same messages on all OSDs.

We still have some nodes running and did not restart those OSDs.

Best,
Malte

Am 21.06.23 um 09:50 schrieb Eugen Block:
Hi,
can you share more details what exactly you did? How did you removethe nodes? Hopefully, you waited for the draining to finish? But ifthe remaining OSDs wait for removed OSDs it sounds like thedraining was not finished.
Zitat von Malte Stroem <malte.str...@gmail.com>:
Hello,

we removed some nodes from our cluster. This worked without problems.
Now, lots of OSDs do not want to join the cluster anymore if wereboot one of the still available nodes.
It always runs into timeouts:

--> ceph-volume lvm activate successful for osd ID: XX
monclient(hunting): authenticate timed out after 300

MONs and MGRs are running fine.

Network is working, netcat to the MONs' ports are open.
Setting a higher debug level has no effect even if we add it tothe ceph.conf file.
The PGs are pretty unhappy, e. g.:
7.143 87771 0 0 0 0314744902235 0 0 10081 10081 down2023-06-20T09:16:03.546158+0000 961275'1395646 961300:9605547[209,NONE,NONE] 209 [209,NONE,NONE] 209 961231'13955122023-06-19T23:46:40.101791+0000 961231'13955122023-06-19T23:46:40.101791+0000
PG query wants us to set an OSD lost however I do not want to do this.

OSDs are blocked by OSDs from the removed nodes:

ceph osd blocked-by
osd  num_blocked
152           38
244           41
144           54
...
We added the removed hosts again and tried to start the OSDs onthis node and they also failed into the timeout mentioned above.
This is a containerized cluster running version 16.2.10.

Replication is 3, some pools use an erasure coded profile.

Best regards,
Malte


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSDs cannot join cluster anymore

Reply via email to