How many active MDS's did you have? (max_mds == 1, right?) Stop the other two MDS's so you can focus on getting exactly one running. Tail the log file and see what it is reporting. Increase mds_beacon_grace to 600 so that the mon doesn't fail this MDS while it is rejoining.
Is that single MDS running out of memory during the rejoin phase? -- dan On Fri, Dec 4, 2020 at 10:49 AM Anton Aleksandrov <an...@aleksandrov.eu> wrote: > > Hello community, > > we are on ceph 13.2.8 - today something happenned with one MDS and cephs > status tells, that filesystem is degraded. It won't mount either. I have > take server with MDS, that was not working down. There are 2 more MDS > servers, but they stay in "rejoin" state. Also only 1 is shown in > "services", even though there are 2. > > Both running MDS servers have these lines in their logs: > > heartbeat_map is_healthy 'MDSRank' had timed out after 15 > mds.beacon.mds2 Skipping beacon heartbeat to monitors (last acked > 28.8979s ago); MDS internal heartbeat is not healthy! > > On one of MDS nodes I enabled more detailed debug, so I am getting there > also: > > mds.beacon.mds3 Sending beacon up:standby seq 178 > mds.beacon.mds3 received beacon reply up:standby seq 178 rtt 0.000999968 > > Makes no sense and too much stress in my head... Anyone could help please? > > Anton. > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io