Thanks John,

I got this in the mds log too:

2017-07-11 07:10:06.293219 7f1836837700  1 mds.beacon.b _send skipping
beacon, heartbeat map not healthy
2017-07-11 07:10:08.330979 7f183b942700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15

but that respawn happened 2 minutes after I got this:

2017-07-11 07:10:10.948237 7f183993e700  0 mds.beacon.b handle_mds_beacon
no longer laggy

Which makes me confused. Could it be a Network issue? Local network
communication was fine by then. It might be a bug.

When it was recovering it was stuck at rejoin_joint_start state for almost
50 minutes.
2017-07-11 07:13:36.587188 7f264a112700  1 mds.0.890528 rejoin_joint_start
[...]
2017-07-11 07:56:21.521006 7f0f78917700  1 mds.0.890537 recovery_done --
successful recovery!
2017-07-11 07:56:21.522570 7f0f78917700  1 mds.0.890537 active_start
2017-07-11 07:56:21.533507 7f0f78917700  1 mds.0.890537 cluster recovered.

I watched with "ceph daemon mds.b perf dump mds" that it was scanning the
inodes. But when this happens (quite often) I have no idea when it will
stop.
Many other times this happened was because of a crash (
http://tracker.ceph.com/issues/20535) but today was not the case.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Tue, Jul 11, 2017 at 11:36 AM, John Spray <jsp...@redhat.com> wrote:

> On Tue, Jul 11, 2017 at 3:23 PM, Webert de Souza Lima
> <webert.b...@gmail.com> wrote:
> > Hello,
> >
> > today I got a MDS respawn with the following message:
> >
> > 2017-07-11 07:07:55.397645 7ffb7a1d7700  1 mds.b handle_mds_map i
> > (10.0.1.2:6822/28190) dne in the mdsmap, respawning myself
>
> "dne in the mdsmap" is what an MDS says when the monitors have
> concluded that the MDS is dead, but the MDS is really alive.  "dne"
> stands for "does not exist", so the MDS is complaining that it has
> been removed from the mdsmap.
>
> The message could definitely be better worded!
>
> You can see this happen in certain buggy cases where the MDS is
> failing to send beacon messages to the mons, even though it is really
> alive -- if you're stuck in rejoin, then that is probably related: try
> increasing the log verbosity to work out where the MDS is stuck while
> it's sitting in the rejoin state.
>
> John
>
> >
> > it happened 3 times within 5 minutes. After so, the MDS took 50 minutes
> to
> > recover.
> > I can't find what exactly that message means and how to avoid it.
> >
> > I'll be glad to provide any further information. Thanks!
> >
> >
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > Belo Horizonte - Brasil
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to