subject:"Re\: \[ceph\-users\] 12.2.4 Both Ceph MDS nodes crashed. Please help."

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-05-23 Thread Yan, Zheng

On Thu, May 24, 2018 at 12:00 AM, Sean Sullivan  wrote:
> Thanks Yan! I did this for the bug ticket and missed these replies. I hope I
> did it correctly. Here are the pastes of the dumps:
>
> https://pastebin.com/kw4bZVZT -- primary
> https://pastebin.com/sYZQx0ER -- secondary
>
>
> they are not that long here is the output of one:
>
> Thread 17 "mds_rank_progr" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fe3b100a700 (LWP 120481)]
> 0x5617aacc48c2 in Server::handle_client_getattr
> (this=this@entry=0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at
> /build/ceph-12.2.5/src/mds/Server.cc:3065
> 3065/build/ceph-12.2.5/src/mds/Server.cc: No such file or directory.
> (gdb) t
> [Current thread is 17 (Thread 0x7fe3b100a700 (LWP 120481))]
> (gdb) bt
> #0  0x5617aacc48c2 in Server::handle_client_getattr
> (this=this@entry=0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at
> /build/ceph-12.2.5/src/mds/Server.cc:3065
> #1  0x5617aacfc98b in Server::dispatch_client_request
> (this=this@entry=0x5617b5acbcd0, mdr=...) at
> /build/ceph-12.2.5/src/mds/Server.cc:1802
> #2  0x5617aacfce9b in Server::handle_client_request
> (this=this@entry=0x5617b5acbcd0, req=req@entry=0x5617bdfa8700)at
> /build/ceph-12.2.5/src/mds/Server.cc:1716
> #3  0x5617aad017b6 in Server::dispatch (this=0x5617b5acbcd0,
> m=m@entry=0x5617bdfa8700) at /build/ceph-12.2.5/src/mds/Server.cc:258
> #4  0x5617aac6afac in MDSRank::handle_deferrable_message
> (this=this@entry=0x5617b5d22000, m=m@entry=0x5617bdfa8700)at
> /build/ceph-12.2.5/src/mds/MDSRank.cc:716
> #5  0x5617aac795cb in MDSRank::_dispatch
> (this=this@entry=0x5617b5d22000, m=0x5617bdfa8700,
> new_msg=new_msg@entry=false) at /build/ceph-12.2.5/src/mds/MDSRank.cc:551
> #6  0x5617aac7a472 in MDSRank::retry_dispatch (this=0x5617b5d22000,
> m=) at /build/ceph-12.2.5/src/mds/MDSRank.cc:998
> #7  0x5617aaf0207b in Context::complete (r=0, this=0x5617bd568080) at
> /build/ceph-12.2.5/src/include/Context.h:70
> #8  MDSInternalContextBase::complete (this=0x5617bd568080, r=0) at
> /build/ceph-12.2.5/src/mds/MDSContext.cc:30
> #9  0x5617aac78bf7 in MDSRank::_advance_queues (this=0x5617b5d22000) at
> /build/ceph-12.2.5/src/mds/MDSRank.cc:776
> #10 0x5617aac7921a in MDSRank::ProgressThread::entry
> (this=0x5617b5d22d40) at /build/ceph-12.2.5/src/mds/MDSRank.cc:502
> #11 0x7fe3bb3066ba in start_thread (arg=0x7fe3b100a700) at
> pthread_create.c:333
> #12 0x7fe3ba37241d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>
>
>
> I
> * set the debug level to mds=20 mon=1,
> *  attached gdb prior to trying to mount aufs from a separate client,
> *  typed continue, attempted the mount,
> * then backtraced after it seg faulted.
>
> I hope this is more helpful. Is there something else I should try to get
> more info? I was hoping for something closer to a python trace where it says
> a variable is a different type or a missing delimiter. womp. I am definitely
> out of my depth but now is a great time to learn! Can anyone shed some more
> light as to what may be wrong?
>

I updated https://tracker.ceph.com/issues/23972.  It's a kernel bug,
which sends malformed request to mds.

Regards
Yan, Zheng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-05-23 Thread Sean Sullivan

Thanks Yan! I did this for the bug ticket and missed these replies. I hope
I did it correctly. Here are the pastes of the dumps:

https://pastebin.com/kw4bZVZT -- primary
https://pastebin.com/sYZQx0ER -- secondary


they are not that long here is the output of one:


   1. Thread 17 "mds_rank_progr" received signal SIGSEGV, Segmentation fault
   .
   2. [Switching to Thread 0x7fe3b100a700 (LWP 120481)]
   3. 0x5617aacc48c2 in Server::handle_client_getattr (this=this@entry=
   0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at
   /build/ceph-12.2.5/src/mds/Server.cc:3065
   4. 3065/build/ceph-12.2.5/src/mds/Server.cc: No such file or
   directory.
   5. (gdb) t
   6. [Current thread is 17 (Thread 0x7fe3b100a700 (LWP 120481))]
   7. (gdb) bt
   8. #0  0x5617aacc48c2 in Server::handle_client_getattr (
   this=this@entry=0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true)
   at /build/ceph-12.2.5/src/mds/Server.cc:3065
   9. #1  0x5617aacfc98b in Server::dispatch_client_request (
   this=this@entry=0x5617b5acbcd0, mdr=...) at
   /build/ceph-12.2.5/src/mds/Server.cc:1802
   10. #2  0x5617aacfce9b in Server::handle_client_request (
   this=this@entry=0x5617b5acbcd0, req=req@entry=0x5617bdfa8700)at
   /build/ceph-12.2.5/src/mds/Server.cc:1716
   11. #3  0x5617aad017b6 in Server::dispatch (this=0x5617b5acbcd0,
   m=m@entry=0x5617bdfa8700) at /build/ceph-12.2.5/src/mds/Server.cc:258
   12. #4  0x5617aac6afac in MDSRank::handle_deferrable_message (
   this=this@entry=0x5617b5d22000, m=m@entry=0x5617bdfa8700)at
   /build/ceph-12.2.5/src/mds/MDSRank.cc:716
   13. #5  0x5617aac795cb in MDSRank::_dispatch (this=this@entry=
   0x5617b5d22000, m=0x5617bdfa8700, new_msg=new_msg@entry=false) at
   /build/ceph-12.2.5/src/mds/MDSRank.cc:551
   14. #6  0x5617aac7a472 in MDSRank::retry_dispatch (this=
   0x5617b5d22000, m=) at
   /build/ceph-12.2.5/src/mds/MDSRank.cc:998
   15. #7  0x5617aaf0207b in Context::complete (r=0, this=0x5617bd568080
   ) at /build/ceph-12.2.5/src/include/Context.h:70
   16. #8  MDSInternalContextBase::complete (this=0x5617bd568080, r=0) at
   /build/ceph-12.2.5/src/mds/MDSContext.cc:30
   17. #9  0x5617aac78bf7 in MDSRank::_advance_queues (this=
   0x5617b5d22000) at /build/ceph-12.2.5/src/mds/MDSRank.cc:776
   18. #10 0x5617aac7921a in MDSRank::ProgressThread::entry (this=
   0x5617b5d22d40) at /build/ceph-12.2.5/src/mds/MDSRank.cc:502
   19. #11 0x7fe3bb3066ba in start_thread (arg=0x7fe3b100a700) at
   pthread_create.c:333
   20. #12 0x7fe3ba37241d in clone () at
   ../sysdeps/unix/sysv/linux/x86_64/clone.S:109



I
* set the debug level to mds=20 mon=1,
*  attached gdb prior to trying to mount aufs from a separate client,
*  typed continue, attempted the mount,
* then backtraced after it seg faulted.

I hope this is more helpful. Is there something else I should try to get
more info? I was hoping for something closer to a python trace where it
says a variable is a different type or a missing delimiter. womp. I am
definitely out of my depth but now is a great time to learn! Can anyone
shed some more light as to what may be wrong?



On Fri, May 4, 2018 at 7:49 PM, Yan, Zheng  wrote:

> On Wed, May 2, 2018 at 7:19 AM, Sean Sullivan  wrote:
> > Forgot to reply to all:
> >
> > Sure thing!
> >
> > I couldn't install the ceph-mds-dbg packages without upgrading. I just
> > finished upgrading the cluster to 12.2.5. The issue still persists in
> 12.2.5
> >
> > From here I'm not really sure how to do generate the backtrace so I hope
> I
> > did it right. For others on Ubuntu this is what I did:
> >
> > * firstly up the debug_mds to 20 and debug_ms to 1:
> > ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1'
> >
> > * install the debug packages
> > ceph-mds-dbg in my case
> >
> > * I also added these options to /etc/ceph/ceph.conf just in case they
> > restart.
> >
> > * Now allow pids to dump (stolen partly from redhat docs and partly from
> > ubuntu)
> > echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a
> > /etc/systemd/system.conf
> > sysctl fs.suid_dumpable=2
> > sysctl kernel.core_pattern=/tmp/core
> > systemctl daemon-reload
> > systemctl restart ceph-mds@$(hostname -s)
> >
> > * A crash was created in /var/crash by apport but gdb cant read it. I
> used
> > apport-unpack and then ran GDB on what is inside:
> >
>
> core dump should be in /tmp/core
>
> > apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/
> > cd /root/crash_dump/
> > gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee
> > /root/ceph_mds_$(hostname -s)_backtrace
> >
> > * This left me with the attached backtraces (which I think are wrong as I
> > see a lot of ?? yet gdb says
> > /usr/lib/debug/.build-id/1d/23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug
> was
> > loaded)
> >
> >  kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD
> >  kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY
> >
>

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-05-04 Thread Yan, Zheng

On Wed, May 2, 2018 at 7:19 AM, Sean Sullivan  wrote:
> Forgot to reply to all:
>
> Sure thing!
>
> I couldn't install the ceph-mds-dbg packages without upgrading. I just
> finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5
>
> From here I'm not really sure how to do generate the backtrace so I hope I
> did it right. For others on Ubuntu this is what I did:
>
> * firstly up the debug_mds to 20 and debug_ms to 1:
> ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1'
>
> * install the debug packages
> ceph-mds-dbg in my case
>
> * I also added these options to /etc/ceph/ceph.conf just in case they
> restart.
>
> * Now allow pids to dump (stolen partly from redhat docs and partly from
> ubuntu)
> echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a
> /etc/systemd/system.conf
> sysctl fs.suid_dumpable=2
> sysctl kernel.core_pattern=/tmp/core
> systemctl daemon-reload
> systemctl restart ceph-mds@$(hostname -s)
>
> * A crash was created in /var/crash by apport but gdb cant read it. I used
> apport-unpack and then ran GDB on what is inside:
>

core dump should be in /tmp/core

> apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/
> cd /root/crash_dump/
> gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee
> /root/ceph_mds_$(hostname -s)_backtrace
>
> * This left me with the attached backtraces (which I think are wrong as I
> see a lot of ?? yet gdb says
> /usr/lib/debug/.build-id/1d/23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug was
> loaded)
>
>  kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD
>  kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY
>

Try running ceph-mds inside gdb. It should be easy to locate the bug
once we have correct coredump file.

Regards
Yan, Zheng


>
> The log files are pretty large (one 4.1G and the other 200MB)
>
> kh10-8 (200MB) mds log --
> https://griffin-objstore.opensciencedatacloud.org/logs/ceph-mds.kh10-8.log
> kh09-8 (4.1GB) mds log --
> https://griffin-objstore.opensciencedatacloud.org/logs/ceph-mds.kh09-8.log
>
> On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly 
> wrote:
>>
>> Hello Sean,
>>
>> On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan 
>> wrote:
>> > I was creating a new user and mount point. On another hardware node I
>> > mounted CephFS as admin to mount as root. I created /aufstest and then
>> > unmounted. From there it seems that both of my mds nodes crashed for
>> > some
>> > reason and I can't start them any more.
>> >
>> > https://pastebin.com/1ZgkL9fa -- my mds log
>> >
>> > I have never had this happen in my tests so now I have live data here.
>> > If
>> > anyone can lend a hand or point me in the right direction while
>> > troubleshooting that would be a godsend!
>>
>> Thanks for keeping the list apprised of your efforts. Since this is so
>> easily reproduced for you, I would suggest that you next get higher
>> debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is
>> a segmentation fault, a backtrace with debug symbols from gdb would
>> also be helpful.
>>
>> --
>> Patrick Donnelly
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-05-04 Thread Sean Sullivan

Most of this is over my head but the last line of the logs on both mds
servers show something similar to:

 0> 2018-05-01 15:37:46.871932 7fd10163b700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7fd10163b700 thread_name:mds_rank_progr

When I search for this in ceph user and devel mailing list the only mention
I can see is from 12.0.3:

https://marc.info/?l=ceph-devel=149726392820648=2 -- ceph-devel

I don't see any mention of journal.cc in my logs however so I hope they are
not related. I also have not experienced any major loss in my cluster as of
yet and cephfs-journal-tool shows my journals as healthy.  To trigger this
bug I created a cephfs directory and user called aufstest. Here is the part
of the log with the crash mentioning aufstest.

https://pastebin.com/EL5ALLuE



I created a new bug ticket on ceph.com with all of the current info as I
believe this isn't a problem with my setup specifically and anyone else
trying this will have the same issue.
https://tracker.ceph.com/issues/23972

I hope this is the correct path. If anyone can guide me in the right
direction for troubleshooting this further I would be grateful.

On Tue, May 1, 2018 at 6:19 PM, Sean Sullivan  wrote:

> Forgot to reply to all:
>
>
> Sure thing!
>
> I couldn't install the ceph-mds-dbg packages without upgrading. I just
> finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5
>
> From here I'm not really sure how to do generate the backtrace so I hope I
> did it right. For others on Ubuntu this is what I did:
>
> * firstly up the debug_mds to 20 and debug_ms to 1:
> ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1'
>
> * install the debug packages
> ceph-mds-dbg in my case
>
> * I also added these options to /etc/ceph/ceph.conf just in case they
> restart.
>
> * Now allow pids to dump (stolen partly from redhat docs and partly from
> ubuntu)
> echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a
> /etc/systemd/system.conf
> sysctl fs.suid_dumpable=2
> sysctl kernel.core_pattern=/tmp/core
> systemctl daemon-reload
> systemctl restart ceph-mds@$(hostname -s)
>
> * A crash was created in /var/crash by apport but gdb cant read it. I used
> apport-unpack and then ran GDB on what is inside:
>
> apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/
> cd /root/crash_dump/
> gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee
> /root/ceph_mds_$(hostname -s)_backtrace
>
> * This left me with the attached backtraces (which I think are wrong as I
> see a lot of ?? yet gdb says /usr/lib/debug/.build-id/1d/
> 23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug was loaded)
>
>  kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD
>  kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY
>
>
> The log files are pretty large (one 4.1G and the other 200MB)
>
> kh10-8 (200MB) mds log -- https://griffin-objstore.op
> ensciencedatacloud.org/logs/ceph-mds.kh10-8.log
> kh09-8 (4.1GB) mds log -- https://griffin-objstore.op
> ensciencedatacloud.org/logs/ceph-mds.kh09-8.log
>
> On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly 
> wrote:
>
>> Hello Sean,
>>
>> On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan 
>> wrote:
>> > I was creating a new user and mount point. On another hardware node I
>> > mounted CephFS as admin to mount as root. I created /aufstest and then
>> > unmounted. From there it seems that both of my mds nodes crashed for
>> some
>> > reason and I can't start them any more.
>> >
>> > https://pastebin.com/1ZgkL9fa -- my mds log
>> >
>> > I have never had this happen in my tests so now I have live data here.
>> If
>> > anyone can lend a hand or point me in the right direction while
>> > troubleshooting that would be a godsend!
>>
>> Thanks for keeping the list apprised of your efforts. Since this is so
>> easily reproduced for you, I would suggest that you next get higher
>> debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is
>> a segmentation fault, a backtrace with debug symbols from gdb would
>> also be helpful.
>>
>> --
>> Patrick Donnelly
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-05-01 Thread Sean Sullivan

Forgot to reply to all:

Sure thing!

I couldn't install the ceph-mds-dbg packages without upgrading. I just
finished upgrading the cluster to 12.2.5. The issue still persists in 12.2.5

>From here I'm not really sure how to do generate the backtrace so I hope I
did it right. For others on Ubuntu this is what I did:

* firstly up the debug_mds to 20 and debug_ms to 1:
ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1'

* install the debug packages
ceph-mds-dbg in my case

* I also added these options to /etc/ceph/ceph.conf just in case they
restart.

* Now allow pids to dump (stolen partly from redhat docs and partly from
ubuntu)
echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a
/etc/systemd/system.conf
sysctl fs.suid_dumpable=2
sysctl kernel.core_pattern=/tmp/core
systemctl daemon-reload
systemctl restart ceph-mds@$(hostname -s)

* A crash was created in /var/crash by apport but gdb cant read it. I used
apport-unpack and then ran GDB on what is inside:

apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/
cd /root/crash_dump/
gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee
/root/ceph_mds_$(hostname -s)_backtrace

* This left me with the attached backtraces (which I think are wrong as I
see a lot of ?? yet gdb says /usr/lib/debug/.build-id/1d/
23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug was loaded)

 kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD
 kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY


The log files are pretty large (one 4.1G and the other 200MB)

kh10-8 (200MB) mds log -- https://griffin-objstore.
opensciencedatacloud.org/logs/ceph-mds.kh10-8.log
kh09-8 (4.1GB) mds log -- https://griffin-objstore.
opensciencedatacloud.org/logs/ceph-mds.kh09-8.log

On Tue, May 1, 2018 at 12:09 AM, Patrick Donnelly 
wrote:

> Hello Sean,
>
> On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan 
> wrote:
> > I was creating a new user and mount point. On another hardware node I
> > mounted CephFS as admin to mount as root. I created /aufstest and then
> > unmounted. From there it seems that both of my mds nodes crashed for some
> > reason and I can't start them any more.
> >
> > https://pastebin.com/1ZgkL9fa -- my mds log
> >
> > I have never had this happen in my tests so now I have live data here. If
> > anyone can lend a hand or point me in the right direction while
> > troubleshooting that would be a godsend!
>
> Thanks for keeping the list apprised of your efforts. Since this is so
> easily reproduced for you, I would suggest that you next get higher
> debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is
> a segmentation fault, a backtrace with debug symbols from gdb would
> also be helpful.
>
> --
> Patrick Donnelly
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Patrick Donnelly

Hello Sean,

On Mon, Apr 30, 2018 at 2:32 PM, Sean Sullivan  wrote:
> I was creating a new user and mount point. On another hardware node I
> mounted CephFS as admin to mount as root. I created /aufstest and then
> unmounted. From there it seems that both of my mds nodes crashed for some
> reason and I can't start them any more.
>
> https://pastebin.com/1ZgkL9fa -- my mds log
>
> I have never had this happen in my tests so now I have live data here. If
> anyone can lend a hand or point me in the right direction while
> troubleshooting that would be a godsend!

Thanks for keeping the list apprised of your efforts. Since this is so
easily reproduced for you, I would suggest that you next get higher
debug logs (debug_mds=20/debug_ms=1) from the MDS. And, since this is
a segmentation fault, a backtrace with debug symbols from gdb would
also be helpful.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Sean Sullivan

I forgot that I left my VM mount command running. It hangs my VM but more
alarming is that it crashes my MDS servers on the ceph cluster. The ceph
cluster is all hardware nodes and the openstack vm does not have an admin
keyring (although the cephX keyring for cephfs generated does have write
permissions to the ec42 pool.


 +-+
 |
   |
 |   Luminous CephFS
Cluster   |
 |   version 12.2.4
|
 |   Ubuntu 16.04
|
 |   4.10.0-38-generic (all
hardware nodes)|
 |
   |
++
 +---+++
||   |   |
  ||
|  Openstack VM  |   |  Ceph Monitor A   |  Ceph
Monitor B|  Ceph Monitor C|
|  Ubuntu 16.04  +--->   |  Ceph Mon Server  |  Ceph
MDS A|  Ceph MDS Failover |
|  4.13.0-39-generic |   |  kh08-8   |  Kh09-8
  |  kh10-8|
|  Cephfs via kernel |   |   |
  ||
++
 +---+++
 |
   |
 |ec42
   16384 PGs   |
 |CephFS Data Pool
   |
 |Erasure coded with
4/2 profile   |
 |
   |

 +-+
 |
   |
 |   cephfs_metadata
   4096 PGs|
 |   CephFS Metadata Pool
|
 |   Replicated pool (n=3)
   |
 |
   |

 +-+

As far as I am aware this shouldn't happen. I will try upgrading as soon as
I can but I didn't see anything like this mentioned in the change log and
am worried this will still exist in 12.2.5. Has anyone seen this before?


On Mon, Apr 30, 2018 at 7:24 PM, Sean Sullivan  wrote:

> So I think I can reliably reproduce this crash from a ceph client.
>
> ```
> root@kh08-8:~# ceph -s
>   cluster:
> id: 9f58ee5a-7c5d-4d68-81ee-debe16322544
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
> mgr: kh08-8(active)
> mds: cephfs-1/1/1 up  {0=kh09-8=up:active}, 1 up:standby
> osd: 570 osds: 570 up, 570 in
> ```
>
>
> then from a client try to mount aufs over cephfs:
> ```
> mount -vvv -t aufs -o br=/cephfs=rw:/mnt/aufs=rw -o udba=reval none /aufs
> ```
>
> Now watch as your ceph mds servers fail:
>
> ```
> root@kh08-8:~# ceph -s
>   cluster:
> id: 9f58ee5a-7c5d-4d68-81ee-debe16322544
> health: HEALTH_WARN
> insufficient standby MDS daemons available
>
>   services:
> mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
> mgr: kh08-8(active)
> mds: cephfs-1/1/1 up  {0=kh10-8=up:active(laggy or crashed)}
> ```
>
>
> I am now stuck in a degraded and I can't seem to get them to start again.
>
> On Mon, Apr 30, 2018 at 5:06 PM, Sean Sullivan 
> wrote:
>
>> I had 2 MDS servers (one active one standby) and both were down. I took a
>> dumb chance and marked the active as down (it said it was up but laggy).
>> Then started the primary again and now both are back up. I have never seen
>> this before I am also not sure of what I just did.
>>
>> On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan 
>> wrote:
>>
>>> I was creating a new user and mount point. On another hardware node I
>>> mounted CephFS as admin to mount as root. I created /aufstest and then
>>> unmounted. From there it seems that both of my mds nodes crashed for some
>>> reason and I can't start them any more.
>>>
>>> https://pastebin.com/1ZgkL9fa -- my mds log
>>>
>>> I have never had this happen in my tests so now I have live data here.
>>> If anyone can lend a hand or point me in the right direction while
>>>

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Sean Sullivan

So I think I can reliably reproduce this crash from a ceph client.

```
root@kh08-8:~# ceph -s
  cluster:
id: 9f58ee5a-7c5d-4d68-81ee-debe16322544
health: HEALTH_OK

  services:
mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
mgr: kh08-8(active)
mds: cephfs-1/1/1 up  {0=kh09-8=up:active}, 1 up:standby
osd: 570 osds: 570 up, 570 in
```


then from a client try to mount aufs over cephfs:
```
mount -vvv -t aufs -o br=/cephfs=rw:/mnt/aufs=rw -o udba=reval none /aufs
```

Now watch as your ceph mds servers fail:

```
root@kh08-8:~# ceph -s
  cluster:
id: 9f58ee5a-7c5d-4d68-81ee-debe16322544
health: HEALTH_WARN
insufficient standby MDS daemons available

  services:
mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
mgr: kh08-8(active)
mds: cephfs-1/1/1 up  {0=kh10-8=up:active(laggy or crashed)}
```


I am now stuck in a degraded and I can't seem to get them to start again.

On Mon, Apr 30, 2018 at 5:06 PM, Sean Sullivan  wrote:

> I had 2 MDS servers (one active one standby) and both were down. I took a
> dumb chance and marked the active as down (it said it was up but laggy).
> Then started the primary again and now both are back up. I have never seen
> this before I am also not sure of what I just did.
>
> On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan 
> wrote:
>
>> I was creating a new user and mount point. On another hardware node I
>> mounted CephFS as admin to mount as root. I created /aufstest and then
>> unmounted. From there it seems that both of my mds nodes crashed for some
>> reason and I can't start them any more.
>>
>> https://pastebin.com/1ZgkL9fa -- my mds log
>>
>> I have never had this happen in my tests so now I have live data here. If
>> anyone can lend a hand or point me in the right direction while
>> troubleshooting that would be a godsend!
>>
>> I tried cephfs-journal-tool inspect and it reports that the journal
>> should be fine. I am not sure why it's crashing:
>>
>> /home/lacadmin# cephfs-journal-tool journal inspect
>> Overall journal integrity: OK
>>
>>
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

2018-04-30 Thread Sean Sullivan

I had 2 MDS servers (one active one standby) and both were down. I took a
dumb chance and marked the active as down (it said it was up but laggy).
Then started the primary again and now both are back up. I have never seen
this before I am also not sure of what I just did.

On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan  wrote:

> I was creating a new user and mount point. On another hardware node I
> mounted CephFS as admin to mount as root. I created /aufstest and then
> unmounted. From there it seems that both of my mds nodes crashed for some
> reason and I can't start them any more.
>
> https://pastebin.com/1ZgkL9fa -- my mds log
>
> I have never had this happen in my tests so now I have live data here. If
> anyone can lend a hand or point me in the right direction while
> troubleshooting that would be a godsend!
>
> I tried cephfs-journal-tool inspect and it reports that the journal should
> be fine. I am not sure why it's crashing:
>
> /home/lacadmin# cephfs-journal-tool journal inspect
> Overall journal integrity: OK
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

9 matches

Site Navigation

Mail list logo

Footer information