I'd also try to boot up only one mds until it's fully up and running. Not
both of them.
Sometimes they go switching states between each other.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Thu, Mar 29, 2018 at 7:32 AM, John Spray <jsp...@redhat.com> wrote:

> On Thu, Mar 29, 2018 at 8:16 AM, Zhang Qiang <dotslash...@gmail.com>
> wrote:
> > Hi,
> >
> > Ceph version 10.2.3. After a power outage, I tried to start the MDS
> > deamons, but they stuck forever replaying journals, I had no idea why
> > they were taking that long, because this is just a small cluster for
> > testing purpose with only hundreds MB data. I restarted them, and the
> > error below was encountered.
>
> Usually if an MDS is stuck in replay, it's because it's waiting for
> the OSDs to service the reads of the journal.  Are all your PGs up and
> healthy?
>
> >
> > Any chance I can restore them?
> >
> > Mar 28 14:20:30 node01 systemd: Started Ceph metadata server daemon.
> > Mar 28 14:20:30 node01 systemd: Starting Ceph metadata server daemon...
> > Mar 28 14:20:30 node01 ceph-mds: 2018-03-28 14:20:30.796255
> > 7f0150c8c180 -1 deprecation warning: MDS id 'mds.0' is invalid and
> > will be forbidden in a future version.  MDS names may not start with a
> > numeric digit.
>
> If you're really using "0" as an MDS name, now would be a good time to
> fix that -- most people use a hostname or something like that.  The
> reason that numeric MDS names are invalid is that it makes commands
> like "ceph mds fail 0" ambiguous (do we mean the name 0 or the rank
> 0?).
>
> > Mar 28 14:20:30 node01 ceph-mds: starting mds.0 at :/0
> > Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: In function 'const
> > entity_inst_t MDSMap::get_inst(mds_rank_t)' thread 7f014ac6c700 time
> > 2018-03-28 14:20:30.942480
> > Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: 582: FAILED
> assert(up.count(m))
> > Mar 28 14:20:30 node01 ceph-mds: ceph version 10.2.3
> > (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> > Mar 28 14:20:30 node01 ceph-mds: 1: (ceph::__ceph_assert_fail(char
> > const*, char const*, int, char const*)+0x85) [0x7f01512aba45]
> > Mar 28 14:20:30 node01 ceph-mds: 2: (MDSMap::get_inst(int)+0x20f)
> > [0x7f0150ee5e3f]
> > Mar 28 14:20:30 node01 ceph-mds: 3:
> > (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x7b9)
> > [0x7f0150ed6e49]
>
> This is a weird assertion.  I can't see how it could be reached :-/
>
> John
>
> > Mar 28 14:20:30 node01 ceph-mds: 4:
> > (MDSDaemon::handle_mds_map(MMDSMap*)+0xe3d) [0x7f0150eb396d]
> > Mar 28 14:20:30 node01 ceph-mds: 5:
> > (MDSDaemon::handle_core_message(Message*)+0x7b3) [0x7f0150eb4eb3]
> > Mar 28 14:20:30 node01 ceph-mds: 6:
> > (MDSDaemon::ms_dispatch(Message*)+0xdb) [0x7f0150eb514b]
> > Mar 28 14:20:30 node01 ceph-mds: 7: (DispatchQueue::entry()+0x78a)
> > [0x7f01513ad4aa]
> > Mar 28 14:20:30 node01 ceph-mds: 8:
> > (DispatchQueue::DispatchThread::entry()+0xd) [0x7f015129098d]
> > Mar 28 14:20:30 node01 ceph-mds: 9: (()+0x7dc5) [0x7f0150095dc5]
> > Mar 28 14:20:30 node01 ceph-mds: 10: (clone()+0x6d) [0x7f014eb61ced]
> > Mar 28 14:20:30 node01 ceph-mds: NOTE: a copy of the executable, or
> > `objdump -rdS <executable>` is needed to interpret this.
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to