Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

Sean Sullivan Wed, 23 May 2018 09:01:17 -0700

Thanks Yan! I did this for the bug ticket and missed these replies. I hope
I did it correctly. Here are the pastes of the dumps:


https://pastebin.com/kw4bZVZT -- primary
https://pastebin.com/sYZQx0ER -- secondary


they are not that long here is the output of one:


   1. Thread 17 "mds_rank_progr" received signal SIGSEGV, Segmentation fault
   .
   2. [Switching to Thread 0x7fe3b100a700 (LWP 120481)]
   3. 0x00005617aacc48c2 in Server::handle_client_getattr (this=this@entry=
   0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at
   /build/ceph-12.2.5/src/mds/Server.cc:3065
   4. 3065    /build/ceph-12.2.5/src/mds/Server.cc: No such file or
   directory.
   5. (gdb) t
   6. [Current thread is 17 (Thread 0x7fe3b100a700 (LWP 120481))]
   7. (gdb) bt
   8. #0  0x00005617aacc48c2 in Server::handle_client_getattr (
   this=this@entry=0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true)
   at /build/ceph-12.2.5/src/mds/Server.cc:3065
   9. #1  0x00005617aacfc98b in Server::dispatch_client_request (
   this=this@entry=0x5617b5acbcd0, mdr=...) at
   /build/ceph-12.2.5/src/mds/Server.cc:1802
   10. #2  0x00005617aacfce9b in Server::handle_client_request (
   this=this@entry=0x5617b5acbcd0, req=req@entry=0x5617bdfa8700)at
   /build/ceph-12.2.5/src/mds/Server.cc:1716
   11. #3  0x00005617aad017b6 in Server::dispatch (this=0x5617b5acbcd0,
   m=m@entry=0x5617bdfa8700) at /build/ceph-12.2.5/src/mds/Server.cc:258
   12. #4  0x00005617aac6afac in MDSRank::handle_deferrable_message (
   this=this@entry=0x5617b5d22000, m=m@entry=0x5617bdfa8700)at
   /build/ceph-12.2.5/src/mds/MDSRank.cc:716
   13. #5  0x00005617aac795cb in MDSRank::_dispatch (this=this@entry=
   0x5617b5d22000, m=0x5617bdfa8700, new_msg=new_msg@entry=false) at
   /build/ceph-12.2.5/src/mds/MDSRank.cc:551
   14. #6  0x00005617aac7a472 in MDSRank::retry_dispatch (this=
   0x5617b5d22000, m=<optimized out>) at
   /build/ceph-12.2.5/src/mds/MDSRank.cc:998
   15. #7  0x00005617aaf0207b in Context::complete (r=0, this=0x5617bd568080
   ) at /build/ceph-12.2.5/src/include/Context.h:70
   16. #8  MDSInternalContextBase::complete (this=0x5617bd568080, r=0) at
   /build/ceph-12.2.5/src/mds/MDSContext.cc:30
   17. #9  0x00005617aac78bf7 in MDSRank::_advance_queues (this=
   0x5617b5d22000) at /build/ceph-12.2.5/src/mds/MDSRank.cc:776
   18. #10 0x00005617aac7921a in MDSRank::ProgressThread::entry (this=
   0x5617b5d22d40) at /build/ceph-12.2.5/src/mds/MDSRank.cc:502
   19. #11 0x00007fe3bb3066ba in start_thread (arg=0x7fe3b100a700) at
   pthread_create.c:333
   20. #12 0x00007fe3ba37241d in clone () at
   ../sysdeps/unix/sysv/linux/x86_64/clone.S:109



I
* set the debug level to mds=20 mon=1,
*  attached gdb prior to trying to mount aufs from a separate client,
*  typed continue, attempted the mount,
* then backtraced after it seg faulted.

I hope this is more helpful. Is there something else I should try to get
more info? I was hoping for something closer to a python trace where it
says a variable is a different type or a missing delimiter. womp. I am
definitely out of my depth but now is a great time to learn! Can anyone
shed some more light as to what may be wrong?



On Fri, May 4, 2018 at 7:49 PM, Yan, Zheng <uker...@gmail.com> wrote:

> On Wed, May 2, 2018 at 7:19 AM, Sean Sullivan <lookcr...@gmail.com> wrote:
> > Forgot to reply to all:
> >
> > Sure thing!
> >
> > I couldn't install the ceph-mds-dbg packages without upgrading. I just
> > finished upgrading the cluster to 12.2.5. The issue still persists in
> 12.2.5
> >
> > From here I'm not really sure how to do generate the backtrace so I hope
> I
> > did it right. For others on Ubuntu this is what I did:
> >
> > * firstly up the debug_mds to 20 and debug_ms to 1:
> > ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1'
> >
> > * install the debug packages
> > ceph-mds-dbg in my case
> >
> > * I also added these options to /etc/ceph/ceph.conf just in case they
> > restart.
> >
> > * Now allow pids to dump (stolen partly from redhat docs and partly from
> > ubuntu)
> > echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a
> > /etc/systemd/system.conf
> > sysctl fs.suid_dumpable=2
> > sysctl kernel.core_pattern=/tmp/core
> > systemctl daemon-reload
> > systemctl restart ceph-mds@$(hostname -s)
> >
> > * A crash was created in /var/crash by apport but gdb cant read it. I
> used
> > apport-unpack and then ran GDB on what is inside:
> >
>
> core dump should be in /tmp/core
>
> > apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/
> > cd /root/crash_dump/
> > gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee
> > /root/ceph_mds_$(hostname -s)_backtrace
> >
> > * This left me with the attached backtraces (which I think are wrong as I
> > see a lot of ?? yet gdb says
> > /usr/lib/debug/.build-id/1d/23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug
> was
> > loaded)
> >
> >  kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD
> >  kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY
> >
>
> Try running ceph-mds inside gdb. It should be easy to locate the bug
> once we have correct coredump file.
>
> Regards
> Yan, Zheng
>
>
> >
> > The log files are pretty large (one 4.1G and the other 200MB)
> >
> > kh10-8 (200MB) mds log --
> > https://griffin-objstore.opensciencedatacloud.org/logs/
> ceph-mds.kh10-8.log
> > kh09-8 (4.1GB) mds log --
> > https://griffin-objstore.opensciencedatacloud.org/logs/
> ceph-mds.kh09-8.log
> >
> > On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly <pdonn...@redhat.com>
> > wrote:
> >>
> >> Hello Sean,
> >>
> >> On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan <lookcr...@gmail.com>
> >> wrote:
> >> > I was creating a new user and mount point. On another hardware node I
> >> > mounted CephFS as admin to mount as root. I created /aufstest and then
> >> > unmounted. From there it seems that both of my mds nodes crashed for
> >> > some
> >> > reason and I can't start them any more.
> >> >
> >> > https://pastebin.com/1ZgkL9fa -- my mds log
> >> >
> >> > I have never had this happen in my tests so now I have live data here.
> >> > If
> >> > anyone can lend a hand or point me in the right direction while
> >> > troubleshooting that would be a godsend!
> >>
> >> Thanks for keeping the list apprised of your efforts. Since this is so
> >> easily reproduced for you, I would suggest that you next get higher
> >> debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is
> >> a segmentation fault, a backtrace with debug symbols from gdb would
> >> also be helpful.
> >>
> >> --
> >> Patrick Donnelly
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

Reply via email to