Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
On Thu, May 24, 2018 at 12:00 AM, Sean Sullivanwrote: > Thanks Yan! I did this for the bug ticket and missed these replies. I hope I > did it correctly. Here are the pastes of the dumps: > > https://pastebin.com/kw4bZVZT -- primary > https://pastebin.com/sYZQx0ER -- secondary > > > they are not that long here is the output of one: > > Thread 17 "mds_rank_progr" received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7fe3b100a700 (LWP 120481)] > 0x5617aacc48c2 in Server::handle_client_getattr > (this=this@entry=0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at > /build/ceph-12.2.5/src/mds/Server.cc:3065 > 3065/build/ceph-12.2.5/src/mds/Server.cc: No such file or directory. > (gdb) t > [Current thread is 17 (Thread 0x7fe3b100a700 (LWP 120481))] > (gdb) bt > #0 0x5617aacc48c2 in Server::handle_client_getattr > (this=this@entry=0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at > /build/ceph-12.2.5/src/mds/Server.cc:3065 > #1 0x5617aacfc98b in Server::dispatch_client_request > (this=this@entry=0x5617b5acbcd0, mdr=...) at > /build/ceph-12.2.5/src/mds/Server.cc:1802 > #2 0x5617aacfce9b in Server::handle_client_request > (this=this@entry=0x5617b5acbcd0, req=req@entry=0x5617bdfa8700)at > /build/ceph-12.2.5/src/mds/Server.cc:1716 > #3 0x5617aad017b6 in Server::dispatch (this=0x5617b5acbcd0, > m=m@entry=0x5617bdfa8700) at /build/ceph-12.2.5/src/mds/Server.cc:258 > #4 0x5617aac6afac in MDSRank::handle_deferrable_message > (this=this@entry=0x5617b5d22000, m=m@entry=0x5617bdfa8700)at > /build/ceph-12.2.5/src/mds/MDSRank.cc:716 > #5 0x5617aac795cb in MDSRank::_dispatch > (this=this@entry=0x5617b5d22000, m=0x5617bdfa8700, > new_msg=new_msg@entry=false) at /build/ceph-12.2.5/src/mds/MDSRank.cc:551 > #6 0x5617aac7a472 in MDSRank::retry_dispatch (this=0x5617b5d22000, > m=) at /build/ceph-12.2.5/src/mds/MDSRank.cc:998 > #7 0x5617aaf0207b in Context::complete (r=0, this=0x5617bd568080) at > /build/ceph-12.2.5/src/include/Context.h:70 > #8 MDSInternalContextBase::complete (this=0x5617bd568080, r=0) at > /build/ceph-12.2.5/src/mds/MDSContext.cc:30 > #9 0x5617aac78bf7 in MDSRank::_advance_queues (this=0x5617b5d22000) at > /build/ceph-12.2.5/src/mds/MDSRank.cc:776 > #10 0x5617aac7921a in MDSRank::ProgressThread::entry > (this=0x5617b5d22d40) at /build/ceph-12.2.5/src/mds/MDSRank.cc:502 > #11 0x7fe3bb3066ba in start_thread (arg=0x7fe3b100a700) at > pthread_create.c:333 > #12 0x7fe3ba37241d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 > > > > I > * set the debug level to mds=20 mon=1, > * attached gdb prior to trying to mount aufs from a separate client, > * typed continue, attempted the mount, > * then backtraced after it seg faulted. > > I hope this is more helpful. Is there something else I should try to get > more info? I was hoping for something closer to a python trace where it says > a variable is a different type or a missing delimiter. womp. I am definitely > out of my depth but now is a great time to learn! Can anyone shed some more > light as to what may be wrong? > I updated https://tracker.ceph.com/issues/23972. It's a kernel bug, which sends malformed request to mds. Regards Yan, Zheng ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
Thanks Yan! I did this for the bug ticket and missed these replies. I hope I did it correctly. Here are the pastes of the dumps: https://pastebin.com/kw4bZVZT -- primary https://pastebin.com/sYZQx0ER -- secondary they are not that long here is the output of one: 1. Thread 17 "mds_rank_progr" received signal SIGSEGV, Segmentation fault . 2. [Switching to Thread 0x7fe3b100a700 (LWP 120481)] 3. 0x5617aacc48c2 in Server::handle_client_getattr (this=this@entry= 0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at /build/ceph-12.2.5/src/mds/Server.cc:3065 4. 3065/build/ceph-12.2.5/src/mds/Server.cc: No such file or directory. 5. (gdb) t 6. [Current thread is 17 (Thread 0x7fe3b100a700 (LWP 120481))] 7. (gdb) bt 8. #0 0x5617aacc48c2 in Server::handle_client_getattr ( this=this@entry=0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at /build/ceph-12.2.5/src/mds/Server.cc:3065 9. #1 0x5617aacfc98b in Server::dispatch_client_request ( this=this@entry=0x5617b5acbcd0, mdr=...) at /build/ceph-12.2.5/src/mds/Server.cc:1802 10. #2 0x5617aacfce9b in Server::handle_client_request ( this=this@entry=0x5617b5acbcd0, req=req@entry=0x5617bdfa8700)at /build/ceph-12.2.5/src/mds/Server.cc:1716 11. #3 0x5617aad017b6 in Server::dispatch (this=0x5617b5acbcd0, m=m@entry=0x5617bdfa8700) at /build/ceph-12.2.5/src/mds/Server.cc:258 12. #4 0x5617aac6afac in MDSRank::handle_deferrable_message ( this=this@entry=0x5617b5d22000, m=m@entry=0x5617bdfa8700)at /build/ceph-12.2.5/src/mds/MDSRank.cc:716 13. #5 0x5617aac795cb in MDSRank::_dispatch (this=this@entry= 0x5617b5d22000, m=0x5617bdfa8700, new_msg=new_msg@entry=false) at /build/ceph-12.2.5/src/mds/MDSRank.cc:551 14. #6 0x5617aac7a472 in MDSRank::retry_dispatch (this= 0x5617b5d22000, m=) at /build/ceph-12.2.5/src/mds/MDSRank.cc:998 15. #7 0x5617aaf0207b in Context::complete (r=0, this=0x5617bd568080 ) at /build/ceph-12.2.5/src/include/Context.h:70 16. #8 MDSInternalContextBase::complete (this=0x5617bd568080, r=0) at /build/ceph-12.2.5/src/mds/MDSContext.cc:30 17. #9 0x5617aac78bf7 in MDSRank::_advance_queues (this= 0x5617b5d22000) at /build/ceph-12.2.5/src/mds/MDSRank.cc:776 18. #10 0x5617aac7921a in MDSRank::ProgressThread::entry (this= 0x5617b5d22d40) at /build/ceph-12.2.5/src/mds/MDSRank.cc:502 19. #11 0x7fe3bb3066ba in start_thread (arg=0x7fe3b100a700) at pthread_create.c:333 20. #12 0x7fe3ba37241d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 I * set the debug level to mds=20 mon=1, * attached gdb prior to trying to mount aufs from a separate client, * typed continue, attempted the mount, * then backtraced after it seg faulted. I hope this is more helpful. Is there something else I should try to get more info? I was hoping for something closer to a python trace where it says a variable is a different type or a missing delimiter. womp. I am definitely out of my depth but now is a great time to learn! Can anyone shed some more light as to what may be wrong? On Fri, May 4, 2018 at 7:49 PM, Yan, Zhengwrote: > On Wed, May 2, 2018 at 7:19 AM, Sean Sullivan wrote: > > Forgot to reply to all: > > > > Sure thing! > > > > I couldn't install the ceph-mds-dbg packages without upgrading. I just > > finished upgrading the cluster to 12.2.5. The issue still persists in > 12.2.5 > > > > From here I'm not really sure how to do generate the backtrace so I hope > I > > did it right. For others on Ubuntu this is what I did: > > > > * firstly up the debug_mds to 20 and debug_ms to 1: > > ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1' > > > > * install the debug packages > > ceph-mds-dbg in my case > > > > * I also added these options to /etc/ceph/ceph.conf just in case they > > restart. > > > > * Now allow pids to dump (stolen partly from redhat docs and partly from > > ubuntu) > > echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a > > /etc/systemd/system.conf > > sysctl fs.suid_dumpable=2 > > sysctl kernel.core_pattern=/tmp/core > > systemctl daemon-reload > > systemctl restart ceph-mds@$(hostname -s) > > > > * A crash was created in /var/crash by apport but gdb cant read it. I > used > > apport-unpack and then ran GDB on what is inside: > > > > core dump should be in /tmp/core > > > apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/ > > cd /root/crash_dump/ > > gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee > > /root/ceph_mds_$(hostname -s)_backtrace > > > > * This left me with the attached backtraces (which I think are wrong as I > > see a lot of ?? yet gdb says > > /usr/lib/debug/.build-id/1d/23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug > was > > loaded) > > > > kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD > > kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY > > >
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
On Wed, May 2, 2018 at 7:19 AM, Sean Sullivanwrote: > Forgot to reply to all: > > Sure thing! > > I couldn't install the ceph-mds-dbg packages without upgrading. I just > finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5 > > From here I'm not really sure how to do generate the backtrace so I hope I > did it right. For others on Ubuntu this is what I did: > > * firstly up the debug_mds to 20 and debug_ms to 1: > ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1' > > * install the debug packages > ceph-mds-dbg in my case > > * I also added these options to /etc/ceph/ceph.conf just in case they > restart. > > * Now allow pids to dump (stolen partly from redhat docs and partly from > ubuntu) > echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a > /etc/systemd/system.conf > sysctl fs.suid_dumpable=2 > sysctl kernel.core_pattern=/tmp/core > systemctl daemon-reload > systemctl restart ceph-mds@$(hostname -s) > > * A crash was created in /var/crash by apport but gdb cant read it. I used > apport-unpack and then ran GDB on what is inside: > core dump should be in /tmp/core > apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/ > cd /root/crash_dump/ > gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee > /root/ceph_mds_$(hostname -s)_backtrace > > * This left me with the attached backtraces (which I think are wrong as I > see a lot of ?? yet gdb says > /usr/lib/debug/.build-id/1d/23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug was > loaded) > > kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD > kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY > Try running ceph-mds inside gdb. It should be easy to locate the bug once we have correct coredump file. Regards Yan, Zheng > > The log files are pretty large (one 4.1G and the other 200MB) > > kh10-8 (200MB) mds log -- > https://griffin-objstore.opensciencedatacloud.org/logs/ceph-mds.kh10-8.log > kh09-8 (4.1GB) mds log -- > https://griffin-objstore.opensciencedatacloud.org/logs/ceph-mds.kh09-8.log > > On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly > wrote: >> >> Hello Sean, >> >> On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan >> wrote: >> > I was creating a new user and mount point. On another hardware node I >> > mounted CephFS as admin to mount as root. I created /aufstest and then >> > unmounted. From there it seems that both of my mds nodes crashed for >> > some >> > reason and I can't start them any more. >> > >> > https://pastebin.com/1ZgkL9fa -- my mds log >> > >> > I have never had this happen in my tests so now I have live data here. >> > If >> > anyone can lend a hand or point me in the right direction while >> > troubleshooting that would be a godsend! >> >> Thanks for keeping the list apprised of your efforts. Since this is so >> easily reproduced for you, I would suggest that you next get higher >> debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is >> a segmentation fault, a backtrace with debug symbols from gdb would >> also be helpful. >> >> -- >> Patrick Donnelly > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
Most of this is over my head but the last line of the logs on both mds servers show something similar to: 0> 2018-05-01 15:37:46.871932 7fd10163b700 -1 *** Caught signal (Segmentation fault) ** in thread 7fd10163b700 thread_name:mds_rank_progr When I search for this in ceph user and devel mailing list the only mention I can see is from 12.0.3: https://marc.info/?l=ceph-devel=149726392820648=2 -- ceph-devel I don't see any mention of journal.cc in my logs however so I hope they are not related. I also have not experienced any major loss in my cluster as of yet and cephfs-journal-tool shows my journals as healthy. To trigger this bug I created a cephfs directory and user called aufstest. Here is the part of the log with the crash mentioning aufstest. https://pastebin.com/EL5ALLuE I created a new bug ticket on ceph.com with all of the current info as I believe this isn't a problem with my setup specifically and anyone else trying this will have the same issue. https://tracker.ceph.com/issues/23972 I hope this is the correct path. If anyone can guide me in the right direction for troubleshooting this further I would be grateful. On Tue, May 1, 2018 at 6:19 PM, Sean Sullivanwrote: > Forgot to reply to all: > > > Sure thing! > > I couldn't install the ceph-mds-dbg packages without upgrading. I just > finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5 > > From here I'm not really sure how to do generate the backtrace so I hope I > did it right. For others on Ubuntu this is what I did: > > * firstly up the debug_mds to 20 and debug_ms to 1: > ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1' > > * install the debug packages > ceph-mds-dbg in my case > > * I also added these options to /etc/ceph/ceph.conf just in case they > restart. > > * Now allow pids to dump (stolen partly from redhat docs and partly from > ubuntu) > echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a > /etc/systemd/system.conf > sysctl fs.suid_dumpable=2 > sysctl kernel.core_pattern=/tmp/core > systemctl daemon-reload > systemctl restart ceph-mds@$(hostname -s) > > * A crash was created in /var/crash by apport but gdb cant read it. I used > apport-unpack and then ran GDB on what is inside: > > apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/ > cd /root/crash_dump/ > gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee > /root/ceph_mds_$(hostname -s)_backtrace > > * This left me with the attached backtraces (which I think are wrong as I > see a lot of ?? yet gdb says /usr/lib/debug/.build-id/1d/ > 23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug was loaded) > > kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD > kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY > > > The log files are pretty large (one 4.1G and the other 200MB) > > kh10-8 (200MB) mds log -- https://griffin-objstore.op > ensciencedatacloud.org/logs/ceph-mds.kh10-8.log > kh09-8 (4.1GB) mds log -- https://griffin-objstore.op > ensciencedatacloud.org/logs/ceph-mds.kh09-8.log > > On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly > wrote: > >> Hello Sean, >> >> On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan >> wrote: >> > I was creating a new user and mount point. On another hardware node I >> > mounted CephFS as admin to mount as root. I created /aufstest and then >> > unmounted. From there it seems that both of my mds nodes crashed for >> some >> > reason and I can't start them any more. >> > >> > https://pastebin.com/1ZgkL9fa -- my mds log >> > >> > I have never had this happen in my tests so now I have live data here. >> If >> > anyone can lend a hand or point me in the right direction while >> > troubleshooting that would be a godsend! >> >> Thanks for keeping the list apprised of your efforts. Since this is so >> easily reproduced for you, I would suggest that you next get higher >> debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is >> a segmentation fault, a backtrace with debug symbols from gdb would >> also be helpful. >> >> -- >> Patrick Donnelly >> > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
Forgot to reply to all: Sure thing! I couldn't install the ceph-mds-dbg packages without upgrading. I just finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5 >From here I'm not really sure how to do generate the backtrace so I hope I did it right. For others on Ubuntu this is what I did: * firstly up the debug_mds to 20 and debug_ms to 1: ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1' * install the debug packages ceph-mds-dbg in my case * I also added these options to /etc/ceph/ceph.conf just in case they restart. * Now allow pids to dump (stolen partly from redhat docs and partly from ubuntu) echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a /etc/systemd/system.conf sysctl fs.suid_dumpable=2 sysctl kernel.core_pattern=/tmp/core systemctl daemon-reload systemctl restart ceph-mds@$(hostname -s) * A crash was created in /var/crash by apport but gdb cant read it. I used apport-unpack and then ran GDB on what is inside: apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/ cd /root/crash_dump/ gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee /root/ceph_mds_$(hostname -s)_backtrace * This left me with the attached backtraces (which I think are wrong as I see a lot of ?? yet gdb says /usr/lib/debug/.build-id/1d/ 23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug was loaded) kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY The log files are pretty large (one 4.1G and the other 200MB) kh10-8 (200MB) mds log -- https://griffin-objstore. opensciencedatacloud.org/logs/ceph-mds.kh10-8.log kh09-8 (4.1GB) mds log -- https://griffin-objstore. opensciencedatacloud.org/logs/ceph-mds.kh09-8.log On Tue, May 1, 2018 at 12:09 AM, Patrick Donnellywrote: > Hello Sean, > > On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan > wrote: > > I was creating a new user and mount point. On another hardware node I > > mounted CephFS as admin to mount as root. I created /aufstest and then > > unmounted. From there it seems that both of my mds nodes crashed for some > > reason and I can't start them any more. > > > > https://pastebin.com/1ZgkL9fa -- my mds log > > > > I have never had this happen in my tests so now I have live data here. If > > anyone can lend a hand or point me in the right direction while > > troubleshooting that would be a godsend! > > Thanks for keeping the list apprised of your efforts. Since this is so > easily reproduced for you, I would suggest that you next get higher > debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is > a segmentation fault, a backtrace with debug symbols from gdb would > also be helpful. > > -- > Patrick Donnelly > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
Hello Sean, On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivanwrote: > I was creating a new user and mount point. On another hardware node I > mounted CephFS as admin to mount as root. I created /aufstest and then > unmounted. From there it seems that both of my mds nodes crashed for some > reason and I can't start them any more. > > https://pastebin.com/1ZgkL9fa -- my mds log > > I have never had this happen in my tests so now I have live data here. If > anyone can lend a hand or point me in the right direction while > troubleshooting that would be a godsend! Thanks for keeping the list apprised of your efforts. Since this is so easily reproduced for you, I would suggest that you next get higher debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is a segmentation fault, a backtrace with debug symbols from gdb would also be helpful. -- Patrick Donnelly ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
I forgot that I left my VM mount command running. It hangs my VM but more alarming is that it crashes my MDS servers on the ceph cluster. The ceph cluster is all hardware nodes and the openstack vm does not have an admin keyring (although the cephX keyring for cephfs generated does have write permissions to the ec42 pool. +-+ | | | Luminous CephFS Cluster | | version 12.2.4 | | Ubuntu 16.04 | | 4.10.0-38-generic (all hardware nodes)| | | ++ +---+++ || | | || | Openstack VM | | Ceph Monitor A | Ceph Monitor B| Ceph Monitor C| | Ubuntu 16.04 +---> | Ceph Mon Server | Ceph MDS A| Ceph MDS Failover | | 4.13.0-39-generic | | kh08-8 | Kh09-8 | kh10-8| | Cephfs via kernel | | | || ++ +---+++ | | |ec42 16384 PGs | |CephFS Data Pool | |Erasure coded with 4/2 profile | | | +-+ | | | cephfs_metadata 4096 PGs| | CephFS Metadata Pool | | Replicated pool (n=3) | | | +-+ As far as I am aware this shouldn't happen. I will try upgrading as soon as I can but I didn't see anything like this mentioned in the change log and am worried this will still exist in 12.2.5. Has anyone seen this before? On Mon, Apr 30, 2018 at 7:24 PM, Sean Sullivanwrote: > So I think I can reliably reproduce this crash from a ceph client. > > ``` > root@kh08-8:~# ceph -s > cluster: > id: 9f58ee5a-7c5d-4d68-81ee-debe16322544 > health: HEALTH_OK > > services: > mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8 > mgr: kh08-8(active) > mds: cephfs-1/1/1 up {0=kh09-8=up:active}, 1 up:standby > osd: 570 osds: 570 up, 570 in > ``` > > > then from a client try to mount aufs over cephfs: > ``` > mount -vvv -t aufs -o br=/cephfs=rw:/mnt/aufs=rw -o udba=reval none /aufs > ``` > > Now watch as your ceph mds servers fail: > > ``` > root@kh08-8:~# ceph -s > cluster: > id: 9f58ee5a-7c5d-4d68-81ee-debe16322544 > health: HEALTH_WARN > insufficient standby MDS daemons available > > services: > mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8 > mgr: kh08-8(active) > mds: cephfs-1/1/1 up {0=kh10-8=up:active(laggy or crashed)} > ``` > > > I am now stuck in a degraded and I can't seem to get them to start again. > > On Mon, Apr 30, 2018 at 5:06 PM, Sean Sullivan > wrote: > >> I had 2 MDS servers (one active one standby) and both were down. I took a >> dumb chance and marked the active as down (it said it was up but laggy). >> Then started the primary again and now both are back up. I have never seen >> this before I am also not sure of what I just did. >> >> On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan >> wrote: >> >>> I was creating a new user and mount point. On another hardware node I >>> mounted CephFS as admin to mount as root. I created /aufstest and then >>> unmounted. From there it seems that both of my mds nodes crashed for some >>> reason and I can't start them any more. >>> >>> https://pastebin.com/1ZgkL9fa -- my mds log >>> >>> I have never had this happen in my tests so now I have live data here. >>> If anyone can lend a hand or point me in the right direction while >>>
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
So I think I can reliably reproduce this crash from a ceph client. ``` root@kh08-8:~# ceph -s cluster: id: 9f58ee5a-7c5d-4d68-81ee-debe16322544 health: HEALTH_OK services: mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8 mgr: kh08-8(active) mds: cephfs-1/1/1 up {0=kh09-8=up:active}, 1 up:standby osd: 570 osds: 570 up, 570 in ``` then from a client try to mount aufs over cephfs: ``` mount -vvv -t aufs -o br=/cephfs=rw:/mnt/aufs=rw -o udba=reval none /aufs ``` Now watch as your ceph mds servers fail: ``` root@kh08-8:~# ceph -s cluster: id: 9f58ee5a-7c5d-4d68-81ee-debe16322544 health: HEALTH_WARN insufficient standby MDS daemons available services: mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8 mgr: kh08-8(active) mds: cephfs-1/1/1 up {0=kh10-8=up:active(laggy or crashed)} ``` I am now stuck in a degraded and I can't seem to get them to start again. On Mon, Apr 30, 2018 at 5:06 PM, Sean Sullivanwrote: > I had 2 MDS servers (one active one standby) and both were down. I took a > dumb chance and marked the active as down (it said it was up but laggy). > Then started the primary again and now both are back up. I have never seen > this before I am also not sure of what I just did. > > On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan > wrote: > >> I was creating a new user and mount point. On another hardware node I >> mounted CephFS as admin to mount as root. I created /aufstest and then >> unmounted. From there it seems that both of my mds nodes crashed for some >> reason and I can't start them any more. >> >> https://pastebin.com/1ZgkL9fa -- my mds log >> >> I have never had this happen in my tests so now I have live data here. If >> anyone can lend a hand or point me in the right direction while >> troubleshooting that would be a godsend! >> >> I tried cephfs-journal-tool inspect and it reports that the journal >> should be fine. I am not sure why it's crashing: >> >> /home/lacadmin# cephfs-journal-tool journal inspect >> Overall journal integrity: OK >> >> >> >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
I had 2 MDS servers (one active one standby) and both were down. I took a dumb chance and marked the active as down (it said it was up but laggy). Then started the primary again and now both are back up. I have never seen this before I am also not sure of what I just did. On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivanwrote: > I was creating a new user and mount point. On another hardware node I > mounted CephFS as admin to mount as root. I created /aufstest and then > unmounted. From there it seems that both of my mds nodes crashed for some > reason and I can't start them any more. > > https://pastebin.com/1ZgkL9fa -- my mds log > > I have never had this happen in my tests so now I have live data here. If > anyone can lend a hand or point me in the right direction while > troubleshooting that would be a godsend! > > I tried cephfs-journal-tool inspect and it reports that the journal should > be fine. I am not sure why it's crashing: > > /home/lacadmin# cephfs-journal-tool journal inspect > Overall journal integrity: OK > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com