Re: MDS crash, wont startup again

2012-06-04 Thread Greg Farnum
On Thursday, May 24, 2012 at 5:29 AM, Felix Feinhals wrote:
> Hi,
>  
> i was using the Debian Packages, but i tried now from source.
> I used the same version from GIT
> (cb7f1c9c7520848b0899b26440ac34a8acea58d1) and compiled it. Same crash
> report.
> Then i applied your patch but again the same crash, i think the
> backtrace is also the same:
>  
> (gdb) thread 1
> [Switching to thread 1 (Thread 9564)]#0 0x7f33a3e58ebb in raise
> (sig=)
> at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
> 41 in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c
> (gdb) backtrace
> #0 0x7f33a3e58ebb in raise (sig=)
> at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
> #1 0x0081423e in reraise_fatal (signum=11) at
> global/signal_handler.cc:58 (http://signal_handler.cc:58)
> #2 handle_fatal_signal (signum=11) at global/signal_handler.cc:104 
> (http://signal_handler.cc:104)
> #3 
> #4 SnapRealm::have_past_parents_open (this=0x0, first=..., last=...)
> at mds/snap.cc:112 (http://snap.cc:112)
> #5 0x0055d58b in MDCache::check_realm_past_parents
> (this=0x27a7200, realm=0x0)
> at mds/MDCache.cc:4495 (http://MDCache.cc:4495)
> #6 0x00572eec in
> MDCache::choose_lock_states_and_reconnect_caps (this=0x27a7200)
> at mds/MDCache.cc:4533 (http://MDCache.cc:4533)
> #7 0x005931a0 in MDCache::rejoin_gather_finish
> (this=0x27a7200) at mds/MDCache.cc: (http://MDCache.cc:)
> #8 0x0059b9d5 in MDCache::rejoin_send_rejoins
> (this=0x27a7200) at mds/MDCache.cc:3388 (http://MDCache.cc:3388)
> #9 0x004a8721 in MDS::rejoin_joint_start (this=0x27bc000) at
> mds/MDS.cc:1404 (http://MDS.cc:1404)
> #10 0x004c253a in MDS::handle_mds_map (this=0x27bc000,
> m=)
> at mds/MDS.cc:968 (http://MDS.cc:968)
> #11 0x004c4513 in MDS::handle_core_message (this=0x27bc000,
> m=0x27ab800) at mds/MDS.cc:1651 (http://MDS.cc:1651)
> #12 0x004c45ef in MDS::_dispatch (this=0x27bc000, m=0x27ab800)
> at mds/MDS.cc:1790 (http://MDS.cc:1790)
> #13 0x004c628b in MDS::ms_dispatch (this=0x27bc000,
> m=0x27ab800) at mds/MDS.cc:1602 (http://MDS.cc:1602)
> #14 0x00732609 in Messenger::ms_deliver_dispatch
> (this=0x279f680) at msg/Messenger.h:178
> #15 SimpleMessenger::dispatch_entry (this=0x279f680) at
> msg/SimpleMessenger.cc:363 (http://SimpleMessenger.cc:363)
> #16 0x007207ad in SimpleMessenger::DispatchThread::entry() ()
> #17 0x7f33a3e508ca in start_thread (arg=) at
> pthread_create.c:300
> #18 0x7f33a26d892d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #19 0x in ?? ()
>  
> Any more ideas? :)
> Or can i get you more debugging output?

Sorry for the delay — I'm afraid that's a hazard of using the MDS before we're 
ready to support it. :(
Anyway, I haven't had a lot of time to look into this, but that makes it look 
like there's an actual problem, where one of the inodes can't find the 
"SnapRealm" which it lives in. Things that will make this easier to diagnose 
(in the event that somebody gets the time) include generating high-level debug 
logs and placing them somewhere accessible (start up the MDS with "debug mds = 
20" added to the config file); if you want you could also try the below patch 
(which will cause the MDS to dump its full inode cache upon triggering this 
bug) and we can see if there's anything really obvious.
(This is a fine thing to make bug reports on at tracker.newdream.net, btw — and 
that allows attachments of things like log files.)
-Greg

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 143faca..6aa5923 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -4527,6 +4527,11 @@ void MDCache::choose_lock_states_and_reconnect_caps()
dout(15) << " chose lock states on " << *in << dendl;

SnapRealm *realm = in->find_snaprealm();
+ if (!realm) {
+ dout(0) << "serious error, could not find snaprealm for in " << *in
+ << ", triggering cache dump" << dendl;
+ dump_cache();
+ }

check_realm_past_parents(realm);



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS crash, wont startup again

2012-05-24 Thread Felix Feinhals
Hi,

i was using the Debian Packages, but i tried now from source.
I used the same version from GIT
(cb7f1c9c7520848b0899b26440ac34a8acea58d1) and compiled it. Same crash
report.
Then i applied your patch but again the same crash, i think the
backtrace is also the same:

 (gdb) thread 1
[Switching to thread 1 (Thread 9564)]#0  0x7f33a3e58ebb in raise
(sig=)
at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
41  in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c
(gdb) backtrace
#0  0x7f33a3e58ebb in raise (sig=)
at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
#1  0x0081423e in reraise_fatal (signum=11) at
global/signal_handler.cc:58
#2  handle_fatal_signal (signum=11) at global/signal_handler.cc:104
#3  
#4  SnapRealm::have_past_parents_open (this=0x0, first=..., last=...)
at mds/snap.cc:112
#5  0x0055d58b in MDCache::check_realm_past_parents
(this=0x27a7200, realm=0x0)
at mds/MDCache.cc:4495
#6  0x00572eec in
MDCache::choose_lock_states_and_reconnect_caps (this=0x27a7200)
at mds/MDCache.cc:4533
#7  0x005931a0 in MDCache::rejoin_gather_finish
(this=0x27a7200) at mds/MDCache.cc:
#8  0x0059b9d5 in MDCache::rejoin_send_rejoins
(this=0x27a7200) at mds/MDCache.cc:3388
#9  0x004a8721 in MDS::rejoin_joint_start (this=0x27bc000) at
mds/MDS.cc:1404
#10 0x004c253a in MDS::handle_mds_map (this=0x27bc000,
m=)
at mds/MDS.cc:968
#11 0x004c4513 in MDS::handle_core_message (this=0x27bc000,
m=0x27ab800) at mds/MDS.cc:1651
#12 0x004c45ef in MDS::_dispatch (this=0x27bc000, m=0x27ab800)
at mds/MDS.cc:1790
#13 0x004c628b in MDS::ms_dispatch (this=0x27bc000,
m=0x27ab800) at mds/MDS.cc:1602
#14 0x00732609 in Messenger::ms_deliver_dispatch
(this=0x279f680) at msg/Messenger.h:178
#15 SimpleMessenger::dispatch_entry (this=0x279f680) at
msg/SimpleMessenger.cc:363
#16 0x007207ad in SimpleMessenger::DispatchThread::entry() ()
#17 0x7f33a3e508ca in start_thread (arg=) at
pthread_create.c:300
#18 0x7f33a26d892d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#19 0x in ?? ()

Any more ideas? :)
Or can i get you more debugging output?



2012/5/23 Gregory Farnum :
> On Wed, May 23, 2012 at 5:28 AM, Felix Feinhals
>  wrote:
>> Hey,
>>
>> ok i installed libc-dbg and run your commands now this comes up:
>>
>> gdb /usr/bin/ceph-mds core
>>
>> snip
>>
>> GNU gdb (GDB) 7.0.1-debian
>> Copyright (C) 2009 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later 
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-linux-gnu".
>> For bug reporting instructions, please see:
>> ...
>> Reading symbols from /usr/bin/ceph-mds...Reading symbols from
>> /usr/lib/debug/usr/bin/ceph-mds...done.
>> (no debugging symbols found)...done.
>> [New Thread 22980]
>> [New Thread 22984]
>> [New Thread 22986]
>> [New Thread 22979]
>> [New Thread 22970]
>> [New Thread 22981]
>> [New Thread 22971]
>> [New Thread 22976]
>> [New Thread 22973]
>> [New Thread 22975]
>> [New Thread 22974]
>> [New Thread 22972]
>> [New Thread 22978]
>> [New Thread 22982]
>>
>> warning: Can't read pathname for load map: Input/output error.
>> Reading symbols from /lib/libpthread.so.0...Reading symbols from
>> /usr/lib/debug/lib/libpthread-2.11.3.so...done.
>> (no debugging symbols found)...done.
>> Loaded symbols for /lib/libpthread.so.0
>> Reading symbols from /usr/lib/libcrypto++.so.8...(no debugging symbols
>> found)...done.
>> Loaded symbols for /usr/lib/libcrypto++.so.8
>> Reading symbols from /lib/libuuid.so.1...(no debugging symbols found)...done.
>> Loaded symbols for /lib/libuuid.so.1
>> Reading symbols from /lib/librt.so.1...Reading symbols from
>> /usr/lib/debug/lib/librt-2.11.3.so...done.
>> (no debugging symbols found)...done.
>> Loaded symbols for /lib/librt.so.1
>> Reading symbols from /usr/lib/libtcmalloc.so.0...(no debugging symbols
>> found)...done.
>> Loaded symbols for /usr/lib/libtcmalloc.so.0
>> Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols
>> found)...done.
>> Loaded symbols for /usr/lib/libstdc++.so.6
>> Reading symbols from /lib/libm.so.6...Reading symbols from
>> /usr/lib/debug/lib/libm-2.11.3.so...done.
>> (no debugging symbols found)...done.
>> Loaded symbols for /lib/libm.so.6
>> Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols 
>> found)...done.
>> Loaded symbols for /lib/libgcc_s.so.1
>> Reading symbols from /lib/libc.so.6...Reading symbols from
>> /usr/lib/debug/lib/libc-2.11.3.so...done.
>> (no debugging symbols found)...done.
>> Loaded symbols for /lib/libc.so.6
>> Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols
>> from /usr/lib/debug/lib/ld-2.11.3.so...done.
>> (no debugging symbols fou

Re: MDS crash, wont startup again

2012-05-23 Thread Gregory Farnum
On Wed, May 23, 2012 at 5:28 AM, Felix Feinhals
 wrote:
> Hey,
>
> ok i installed libc-dbg and run your commands now this comes up:
>
> gdb /usr/bin/ceph-mds core
>
> snip
>
> GNU gdb (GDB) 7.0.1-debian
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> For bug reporting instructions, please see:
> ...
> Reading symbols from /usr/bin/ceph-mds...Reading symbols from
> /usr/lib/debug/usr/bin/ceph-mds...done.
> (no debugging symbols found)...done.
> [New Thread 22980]
> [New Thread 22984]
> [New Thread 22986]
> [New Thread 22979]
> [New Thread 22970]
> [New Thread 22981]
> [New Thread 22971]
> [New Thread 22976]
> [New Thread 22973]
> [New Thread 22975]
> [New Thread 22974]
> [New Thread 22972]
> [New Thread 22978]
> [New Thread 22982]
>
> warning: Can't read pathname for load map: Input/output error.
> Reading symbols from /lib/libpthread.so.0...Reading symbols from
> /usr/lib/debug/lib/libpthread-2.11.3.so...done.
> (no debugging symbols found)...done.
> Loaded symbols for /lib/libpthread.so.0
> Reading symbols from /usr/lib/libcrypto++.so.8...(no debugging symbols
> found)...done.
> Loaded symbols for /usr/lib/libcrypto++.so.8
> Reading symbols from /lib/libuuid.so.1...(no debugging symbols found)...done.
> Loaded symbols for /lib/libuuid.so.1
> Reading symbols from /lib/librt.so.1...Reading symbols from
> /usr/lib/debug/lib/librt-2.11.3.so...done.
> (no debugging symbols found)...done.
> Loaded symbols for /lib/librt.so.1
> Reading symbols from /usr/lib/libtcmalloc.so.0...(no debugging symbols
> found)...done.
> Loaded symbols for /usr/lib/libtcmalloc.so.0
> Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols
> found)...done.
> Loaded symbols for /usr/lib/libstdc++.so.6
> Reading symbols from /lib/libm.so.6...Reading symbols from
> /usr/lib/debug/lib/libm-2.11.3.so...done.
> (no debugging symbols found)...done.
> Loaded symbols for /lib/libm.so.6
> Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
> Loaded symbols for /lib/libgcc_s.so.1
> Reading symbols from /lib/libc.so.6...Reading symbols from
> /usr/lib/debug/lib/libc-2.11.3.so...done.
> (no debugging symbols found)...done.
> Loaded symbols for /lib/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols
> from /usr/lib/debug/lib/ld-2.11.3.so...done.
> (no debugging symbols found)...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> Reading symbols from /usr/lib/libunwind.so.7...(no debugging symbols
> found)...done.
> Loaded symbols for /usr/lib/libunwind.so.7
> Core was generated by `/usr/bin/ceph-mds -i c --pid-file
> /var/run/ceph/mds.c.pid -c /etc/ceph/ceph.con'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x7f10c00d2ebb in raise (sig=) at
> ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
> 41      ../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory.
>        in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c
>
> snip
>
> Now
>
> thread apply all bt
>
> ...
>
> thread 1
> [Switching to thread 1 (Thread 22977)]#0  0x7f10c00d2ebb in raise
> (sig=) at
> ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
> 41      in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c
>
>
> Thread 1 (Thread 22977):
> ---Type  to continue, or q  to quit---
> #0  0x7f10c00d2ebb in raise (sig=) at
> ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
> #1  0x0081469e in reraise_fatal (signum=11) at
> global/signal_handler.cc:58
> #2  handle_fatal_signal (signum=11) at global/signal_handler.cc:104
> #3  
> #4  SnapRealm::have_past_parents_open (this=0x0, first=..., last=...)
> at mds/snap.cc:112
>
> #5  0x0055d58b in MDCache::check_realm_past_parents
> (this=0x2b49200, realm=0x0) at mds/MDCache.cc:4495
> #6  0x00572eec in
> MDCache::choose_lock_states_and_reconnect_caps (this=0x2b49200) at
> mds/MDCache.cc:4533
> #7  0x005931a0 in MDCache::rejoin_gather_finish
> (this=0x2b49200) at mds/MDCache.cc:
> #8  0x0059b9d5 in MDCache::rejoin_send_rejoins
> (this=0x2b49200) at mds/MDCache.cc:3388
> #9  0x004a8721 in MDS::rejoin_joint_start (this=0x2b5e000) at
> mds/MDS.cc:1404
> #10 0x004c253a in MDS::handle_mds_map (this=0x2b5e000,
> m=) at mds/MDS.cc:968
> #11 0x004c4513 in MDS::handle_core_message (this=0x2b5e000,
> m=0x2b4d800) at mds/MDS.cc:1651
> #12 0x004c45ef in MDS::_dispatch (this=0x2b5e000, m=0x2b4d800)
> at mds/MDS.cc:1790
> #13 0x004c628b in MDS::ms_dispatch (this=0x2b5e000,
> m=0x2b4d800) at mds/MDS.cc:1602
> #14 0x007acb49 in Messenger::ms_deliver_dispatch
> (this=0x2b41680) at msg/Messenger.h:178
> #15 SimpleMessenger::dispatch_entry 

Re: MDS crash, wont startup again

2012-05-23 Thread Felix Feinhals
Hey,

ok i installed libc-dbg and run your commands now this comes up:

gdb /usr/bin/ceph-mds core

snip

GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /usr/bin/ceph-mds...Reading symbols from
/usr/lib/debug/usr/bin/ceph-mds...done.
(no debugging symbols found)...done.
[New Thread 22980]
[New Thread 22984]
[New Thread 22986]
[New Thread 22979]
[New Thread 22970]
[New Thread 22981]
[New Thread 22971]
[New Thread 22976]
[New Thread 22973]
[New Thread 22975]
[New Thread 22974]
[New Thread 22972]
[New Thread 22978]
[New Thread 22982]

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/libpthread.so.0...Reading symbols from
/usr/lib/debug/lib/libpthread-2.11.3.so...done.
(no debugging symbols found)...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /usr/lib/libcrypto++.so.8...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libcrypto++.so.8
Reading symbols from /lib/libuuid.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libuuid.so.1
Reading symbols from /lib/librt.so.1...Reading symbols from
/usr/lib/debug/lib/librt-2.11.3.so...done.
(no debugging symbols found)...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /usr/lib/libtcmalloc.so.0...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libtcmalloc.so.0
Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libstdc++.so.6
Reading symbols from /lib/libm.so.6...Reading symbols from
/usr/lib/debug/lib/libm-2.11.3.so...done.
(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/libc.so.6...Reading symbols from
/usr/lib/debug/lib/libc-2.11.3.so...done.
(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols
from /usr/lib/debug/lib/ld-2.11.3.so...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib/libunwind.so.7...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libunwind.so.7
Core was generated by `/usr/bin/ceph-mds -i c --pid-file
/var/run/ceph/mds.c.pid -c /etc/ceph/ceph.con'.
Program terminated with signal 11, Segmentation fault.
#0  0x7f10c00d2ebb in raise (sig=) at
../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
41  ../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory.
in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c

snip

Now

thread apply all bt

...

thread 1
[Switching to thread 1 (Thread 22977)]#0  0x7f10c00d2ebb in raise
(sig=) at
../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
41  in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c


Thread 1 (Thread 22977):
---Type  to continue, or q  to quit---
#0  0x7f10c00d2ebb in raise (sig=) at
../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
#1  0x0081469e in reraise_fatal (signum=11) at
global/signal_handler.cc:58
#2  handle_fatal_signal (signum=11) at global/signal_handler.cc:104
#3  
#4  SnapRealm::have_past_parents_open (this=0x0, first=..., last=...)
at mds/snap.cc:112

#5  0x0055d58b in MDCache::check_realm_past_parents
(this=0x2b49200, realm=0x0) at mds/MDCache.cc:4495
#6  0x00572eec in
MDCache::choose_lock_states_and_reconnect_caps (this=0x2b49200) at
mds/MDCache.cc:4533
#7  0x005931a0 in MDCache::rejoin_gather_finish
(this=0x2b49200) at mds/MDCache.cc:
#8  0x0059b9d5 in MDCache::rejoin_send_rejoins
(this=0x2b49200) at mds/MDCache.cc:3388
#9  0x004a8721 in MDS::rejoin_joint_start (this=0x2b5e000) at
mds/MDS.cc:1404
#10 0x004c253a in MDS::handle_mds_map (this=0x2b5e000,
m=) at mds/MDS.cc:968
#11 0x004c4513 in MDS::handle_core_message (this=0x2b5e000,
m=0x2b4d800) at mds/MDS.cc:1651
#12 0x004c45ef in MDS::_dispatch (this=0x2b5e000, m=0x2b4d800)
at mds/MDS.cc:1790
#13 0x004c628b in MDS::ms_dispatch (this=0x2b5e000,
m=0x2b4d800) at mds/MDS.cc:1602
#14 0x007acb49 in Messenger::ms_deliver_dispatch
(this=0x2b41680) at msg/Messenger.h:178
#15 SimpleMessenger::dispatch_entry (this=0x2b41680) at
msg/SimpleMessenger.cc:363
#16 0x007336ed in SimpleMessenger::DispatchThread::entry() ()
#17 0x7f10c00ca8ca in start_thread (arg=) at
pthread_create.c:300
#18 0x7f10be95292d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#19 0x in ?

Re: MDS crash, wont startup again

2012-05-22 Thread Greg Farnum


On Tuesday, May 22, 2012 at 3:12 AM, Felix Feinhals wrote:

> I am not quite sure on how to get you the coredump infos. I installed
> all ceph-dbg packages and executed:
> 
> gdb /usr/bin/ceph-mds core
> 
> snip
> 
> GNU gdb (GDB) 7.0.1-debian
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> For bug reporting instructions, please see:
> ...
> Reading symbols from /usr/bin/ceph-mds...Reading symbols from
> /usr/lib/debug/usr/bin/ceph-mds...done.
> (no debugging symbols found)...done.
> [New Thread 22980]
> [New Thread 22984]
> [New Thread 22986]
> [New Thread 22979]
> [New Thread 22970]
> [New Thread 22981]
> [New Thread 22971]
> [New Thread 22976]
> [New Thread 22973]
> [New Thread 22975]
> [New Thread 22974]
> [New Thread 22972]
> [New Thread 22978]
> [New Thread 22982]
> 
> warning: Can't read pathname for load map: Input/output error.
> Reading symbols from /lib/libpthread.so.0...(no debugging symbols 
> found)...done.
> Loaded symbols for /lib/libpthread.so.0
> Reading symbols from /usr/lib/libcrypto++.so.8...(no debugging symbols
> found)...done.
> Loaded symbols for /usr/lib/libcrypto++.so.8
> Reading symbols from /lib/libuuid.so.1...(no debugging symbols found)...done.
> Loaded symbols for /lib/libuuid.so.1
> Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
> Loaded symbols for /lib/librt.so.1
> Reading symbols from /usr/lib/libtcmalloc.so.0...(no debugging symbols
> found)...done.
> Loaded symbols for /usr/lib/libtcmalloc.so.0
> Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols
> found)...done.
> Loaded symbols for /usr/lib/libstdc++.so.6
> Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
> Loaded symbols for /lib/libm.so.6
> Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
> Loaded symbols for /lib/libgcc_s.so.1
> Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
> Loaded symbols for /lib/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging
> symbols found)...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> Reading symbols from /usr/lib/libunwind.so.7...(no debugging symbols
> found)...done.
> Loaded symbols for /usr/lib/libunwind.so.7
> Core was generated by `/usr/bin/ceph-mds -i c --pid-file
> /var/run/ceph/mds.c.pid -c /etc/ceph/ceph.con'.
> Program terminated with signal 11, Segmentation fault.
> #0 0x7f10c00d2ebb in raise () from /lib/libpthread.so.0
> 

Argh. This is finicky and annoying; don't feel bad. :) There are two 
possibilities here:
1) If I remember correctly, PATH and the actual debug symbol install locations 
often don't match up. Check out where the debug packages actually installed to, 
and make sure that directory is in PATH when running gdb.
2) The default thread you're getting a backtrace on doesn't look to be the one 
we actually care about (notice how the backtrace is through completely 
different parts of the code); it's conceivable that there just aren't any debug 
symbols for those libraries. Try running "thread apply all bt" (I think that's 
the right command) and looking for one that matches the backtrace in the log 
file. Then switch to it ("thread x" where x is the thread number) and get the 
backtrace of that.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS crash, wont startup again

2012-05-21 Thread Gregory Farnum
On Mon, May 21, 2012 at 5:38 AM, Felix Feinhals
 wrote:
> Hi Josh,
>
> i quoted the trace and some other stats in my first email, maybe it
> got stuck in the spam filters.
> Well next try:
>
> snip
>
> -3> 2012-05-10 14:52:29.509940 7fb1c9351700 1 mds.0.40 handle_mds_map
>  i am now mds.0.40
>  -2> 2012-05-10 14:52:29.509956 7fb1c9351700 1 mds.0.40 handle_mds_map
>  state change up:reconnect --> up:rejoin
>  -1> 2012-05-10 14:52:29.509963 7fb1c9351700 1 mds.0.40 rejoin_joint_start
>  0> 2012-05-10 14:52:29.512503 7fb1c9351700 -1 *** Caught signal
>  (Segmentation fault) **
>  in thread 7fb1c9351700
>
> ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
>  1: ceph-mds() [0x814279]
>  2: (()+0xeff0) [0x7fb1cddbfff0]
>  3: (SnapRealm::have_past_parents_open(snapid_t, snapid_t)+0x4f) [0x6cb5ef]
>  4: (MDCache::check_realm_past_parents(SnapRealm*)+0x2b) [0x55d58b]
>  5: (MDCache::choose_lock_states_and_reconnect_caps()+0x29c) [0x572eec]
>  6: (MDCache::rejoin_gather_finish()+0x90) [0x5931a0]
>  7: (MDCache::rejoin_send_rejoins()+0x2c05) [0x59b9d5]
>  8: (MDS::rejoin_joint_start()+0x131) [0x4a8721]
>  9: (MDS::handle_mds_map(MMDSMap*)+0x2c4a) [0x4c253a]
>  10: (MDS::handle_core_message(Message*)+0x913) [0x4c4513]
>  11: (MDS::_dispatch(Message*)+0x2f) [0x4c45ef]
>  12: (MDS::ms_dispatch(Message*)+0x1fb) [0x4c628b]
>  13: (SimpleMessenger::dispatch_entry()+0x979) [0x7acb49]
>  14: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7336ed]
>  15: (()+0x68ca) [0x7fb1cddb78ca]
>  16: (clone()+0x6d) [0x7fb1cc63f92d]

There's nothing obvious here — can you run gdb on the core and get
another backtrace and the info from levels 3-5?

> snip
>
> I though ceph chooses which MDS is active and which is standby, i just
> have 3 in the cluster config:
Yes, it does — if you don't increase the number of allowed MDSes
you'll just get one of them active.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS crash, wont startup again

2012-05-21 Thread Felix Feinhals
Hi Josh,

i quoted the trace and some other stats in my first email, maybe it
got stuck in the spam filters.
Well next try:

snip

-3> 2012-05-10 14:52:29.509940 7fb1c9351700 1 mds.0.40 handle_mds_map
 i am now mds.0.40
 -2> 2012-05-10 14:52:29.509956 7fb1c9351700 1 mds.0.40 handle_mds_map
 state change up:reconnect --> up:rejoin
 -1> 2012-05-10 14:52:29.509963 7fb1c9351700 1 mds.0.40 rejoin_joint_start
 0> 2012-05-10 14:52:29.512503 7fb1c9351700 -1 *** Caught signal
 (Segmentation fault) **
 in thread 7fb1c9351700

ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
 1: ceph-mds() [0x814279]
 2: (()+0xeff0) [0x7fb1cddbfff0]
 3: (SnapRealm::have_past_parents_open(snapid_t, snapid_t)+0x4f) [0x6cb5ef]
 4: (MDCache::check_realm_past_parents(SnapRealm*)+0x2b) [0x55d58b]
 5: (MDCache::choose_lock_states_and_reconnect_caps()+0x29c) [0x572eec]
 6: (MDCache::rejoin_gather_finish()+0x90) [0x5931a0]
 7: (MDCache::rejoin_send_rejoins()+0x2c05) [0x59b9d5]
 8: (MDS::rejoin_joint_start()+0x131) [0x4a8721]
 9: (MDS::handle_mds_map(MMDSMap*)+0x2c4a) [0x4c253a]
 10: (MDS::handle_core_message(Message*)+0x913) [0x4c4513]
 11: (MDS::_dispatch(Message*)+0x2f) [0x4c45ef]
 12: (MDS::ms_dispatch(Message*)+0x1fb) [0x4c628b]
 13: (SimpleMessenger::dispatch_entry()+0x979) [0x7acb49]
 14: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7336ed]
 15: (()+0x68ca) [0x7fb1cddb78ca]
 16: (clone()+0x6d) [0x7fb1cc63f92d]

snip

I though ceph chooses which MDS is active and which is standby, i just
have 3 in the cluster config:

[mds.a]
host = x

[mds.b]
host = y

[mds.c]
host = z

no global MDS config.
Should i reconfigure this?



2012/5/17 Josh Durgin :
> On 05/16/2012 01:11 AM, Felix Feinhals wrote:
>>
>> Hi again,
>>
>> anything on this Problem? Seems that the only choice for me is to
>> reinitialize the whole cephfs (mkcephfs...)
>> :(
>
>
> Hi Felix, it looks like your first mail never reached the list.
>
>
>> 2012/5/10 Felix Feinhals:
>>>
>>> Hi List,
>>>
>>> we installed a ceph cluster with ceph version 0.46.
>>> 3 OSDs, 3 MONs and 3 MDSs.
>>>
>>> After copying a bunch of files to a ceph-fuse mount all MDS daemons
>>> crash and now i cant bring them back online.
>>> I already tried to restart the daemons in different order and also
>>> removed one OSD, nothing really happened only now we have pgs with
>>> active+remapped which i think is normal.
>>> Any hints?
>
>
> Are all three MDS active? At this point, more than one active MDS is
> likely to crash. You can have one active and others standby.
>
> If you've got only one active, what was the backtrace of the crash?
> It'll be at the end of the MDS log (by default in /var/log/ceph).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS crash, wont startup again

2012-05-17 Thread Josh Durgin

On 05/16/2012 01:11 AM, Felix Feinhals wrote:

Hi again,

anything on this Problem? Seems that the only choice for me is to
reinitialize the whole cephfs (mkcephfs...)
:(


Hi Felix, it looks like your first mail never reached the list.


2012/5/10 Felix Feinhals:

Hi List,

we installed a ceph cluster with ceph version 0.46.
3 OSDs, 3 MONs and 3 MDSs.

After copying a bunch of files to a ceph-fuse mount all MDS daemons
crash and now i cant bring them back online.
I already tried to restart the daemons in different order and also
removed one OSD, nothing really happened only now we have pgs with
active+remapped which i think is normal.
Any hints?


Are all three MDS active? At this point, more than one active MDS is
likely to crash. You can have one active and others standby.

If you've got only one active, what was the backtrace of the crash?
It'll be at the end of the MDS log (by default in /var/log/ceph).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS crash, wont startup again

2012-05-16 Thread Felix Feinhals
Hi again,

anything on this Problem? Seems that the only choice for me is to
reinitialize the whole cephfs (mkcephfs...)
:(

2012/5/10 Felix Feinhals :
> Hi List,
>
> we installed a ceph cluster with ceph version 0.46.
> 3 OSDs, 3 MONs and 3 MDSs.
>
> After copying a bunch of files to a ceph-fuse mount all MDS daemons
> crash and now i cant bring them back online.
> I already tried to restart the daemons in different order and also
> removed one OSD, nothing really happened only now we have pgs with
> active+remapped which i think is normal.
> Any hints?
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html