Re: MDS crash, wont startup again

2012-05-24 Thread Felix Feinhals
Hi,

i was using the Debian Packages, but i tried now from source.
I used the same version from GIT
(cb7f1c9c7520848b0899b26440ac34a8acea58d1) and compiled it. Same crash
report.
Then i applied your patch but again the same crash, i think the
backtrace is also the same:

 (gdb) thread 1
[Switching to thread 1 (Thread 9564)]#0  0x7f33a3e58ebb in raise
(sig=value optimized out)
at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
41  in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c
(gdb) backtrace
#0  0x7f33a3e58ebb in raise (sig=value optimized out)
at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:41
#1  0x0081423e in reraise_fatal (signum=11) at
global/signal_handler.cc:58
#2  handle_fatal_signal (signum=11) at global/signal_handler.cc:104
#3  signal handler called
#4  SnapRealm::have_past_parents_open (this=0x0, first=..., last=...)
at mds/snap.cc:112
#5  0x0055d58b in MDCache::check_realm_past_parents
(this=0x27a7200, realm=0x0)
at mds/MDCache.cc:4495
#6  0x00572eec in
MDCache::choose_lock_states_and_reconnect_caps (this=0x27a7200)
at mds/MDCache.cc:4533
#7  0x005931a0 in MDCache::rejoin_gather_finish
(this=0x27a7200) at mds/MDCache.cc:
#8  0x0059b9d5 in MDCache::rejoin_send_rejoins
(this=0x27a7200) at mds/MDCache.cc:3388
#9  0x004a8721 in MDS::rejoin_joint_start (this=0x27bc000) at
mds/MDS.cc:1404
#10 0x004c253a in MDS::handle_mds_map (this=0x27bc000,
m=value optimized out)
at mds/MDS.cc:968
#11 0x004c4513 in MDS::handle_core_message (this=0x27bc000,
m=0x27ab800) at mds/MDS.cc:1651
#12 0x004c45ef in MDS::_dispatch (this=0x27bc000, m=0x27ab800)
at mds/MDS.cc:1790
#13 0x004c628b in MDS::ms_dispatch (this=0x27bc000,
m=0x27ab800) at mds/MDS.cc:1602
#14 0x00732609 in Messenger::ms_deliver_dispatch
(this=0x279f680) at msg/Messenger.h:178
#15 SimpleMessenger::dispatch_entry (this=0x279f680) at
msg/SimpleMessenger.cc:363
#16 0x007207ad in SimpleMessenger::DispatchThread::entry() ()
#17 0x7f33a3e508ca in start_thread (arg=value optimized out) at
pthread_create.c:300
#18 0x7f33a26d892d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#19 0x in ?? ()

Any more ideas? :)
Or can i get you more debugging output?



2012/5/23 Gregory Farnum g...@inktank.com:
 On Wed, May 23, 2012 at 5:28 AM, Felix Feinhals
 f...@turtle-entertainment.de wrote:
 Hey,

 ok i installed libc-dbg and run your commands now this comes up:

 gdb /usr/bin/ceph-mds core

 snip

 GNU gdb (GDB) 7.0.1-debian
 Copyright (C) 2009 Free Software Foundation, Inc.
 License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
 This is free software: you are free to change and redistribute it.
 There is NO WARRANTY, to the extent permitted by law.  Type show copying
 and show warranty for details.
 This GDB was configured as x86_64-linux-gnu.
 For bug reporting instructions, please see:
 http://www.gnu.org/software/gdb/bugs/...
 Reading symbols from /usr/bin/ceph-mds...Reading symbols from
 /usr/lib/debug/usr/bin/ceph-mds...done.
 (no debugging symbols found)...done.
 [New Thread 22980]
 [New Thread 22984]
 [New Thread 22986]
 [New Thread 22979]
 [New Thread 22970]
 [New Thread 22981]
 [New Thread 22971]
 [New Thread 22976]
 [New Thread 22973]
 [New Thread 22975]
 [New Thread 22974]
 [New Thread 22972]
 [New Thread 22978]
 [New Thread 22982]

 warning: Can't read pathname for load map: Input/output error.
 Reading symbols from /lib/libpthread.so.0...Reading symbols from
 /usr/lib/debug/lib/libpthread-2.11.3.so...done.
 (no debugging symbols found)...done.
 Loaded symbols for /lib/libpthread.so.0
 Reading symbols from /usr/lib/libcrypto++.so.8...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libcrypto++.so.8
 Reading symbols from /lib/libuuid.so.1...(no debugging symbols found)...done.
 Loaded symbols for /lib/libuuid.so.1
 Reading symbols from /lib/librt.so.1...Reading symbols from
 /usr/lib/debug/lib/librt-2.11.3.so...done.
 (no debugging symbols found)...done.
 Loaded symbols for /lib/librt.so.1
 Reading symbols from /usr/lib/libtcmalloc.so.0...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libtcmalloc.so.0
 Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libstdc++.so.6
 Reading symbols from /lib/libm.so.6...Reading symbols from
 /usr/lib/debug/lib/libm-2.11.3.so...done.
 (no debugging symbols found)...done.
 Loaded symbols for /lib/libm.so.6
 Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols 
 found)...done.
 Loaded symbols for /lib/libgcc_s.so.1
 Reading symbols from /lib/libc.so.6...Reading symbols from
 /usr/lib/debug/lib/libc-2.11.3.so...done.
 (no debugging symbols found)...done.
 Loaded symbols for /lib/libc.so.6
 Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols
 from /usr/lib/debug/lib/ld-2.11.3.so...done.
 (no debugging symbols found

Re: MDS crash, wont startup again

2012-05-23 Thread Felix Feinhals
) at
pthread_create.c:300
#18 0x7f10be95292d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#19 0x in ?? ()

So i wonder is the crash because of the missing file message?

2012/5/22 Greg Farnum g...@inktank.com:


 On Tuesday, May 22, 2012 at 3:12 AM, Felix Feinhals wrote:

 I am not quite sure on how to get you the coredump infos. I installed
 all ceph-dbg packages and executed:

 gdb /usr/bin/ceph-mds core

 snip

 GNU gdb (GDB) 7.0.1-debian
 Copyright (C) 2009 Free Software Foundation, Inc.
 License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
 This is free software: you are free to change and redistribute it.
 There is NO WARRANTY, to the extent permitted by law. Type show copying
 and show warranty for details.
 This GDB was configured as x86_64-linux-gnu.
 For bug reporting instructions, please see:
 http://www.gnu.org/software/gdb/bugs/...
 Reading symbols from /usr/bin/ceph-mds...Reading symbols from
 /usr/lib/debug/usr/bin/ceph-mds...done.
 (no debugging symbols found)...done.
 [New Thread 22980]
 [New Thread 22984]
 [New Thread 22986]
 [New Thread 22979]
 [New Thread 22970]
 [New Thread 22981]
 [New Thread 22971]
 [New Thread 22976]
 [New Thread 22973]
 [New Thread 22975]
 [New Thread 22974]
 [New Thread 22972]
 [New Thread 22978]
 [New Thread 22982]

 warning: Can't read pathname for load map: Input/output error.
 Reading symbols from /lib/libpthread.so.0...(no debugging symbols 
 found)...done.
 Loaded symbols for /lib/libpthread.so.0
 Reading symbols from /usr/lib/libcrypto++.so.8...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libcrypto++.so.8
 Reading symbols from /lib/libuuid.so.1...(no debugging symbols found)...done.
 Loaded symbols for /lib/libuuid.so.1
 Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
 Loaded symbols for /lib/librt.so.1
 Reading symbols from /usr/lib/libtcmalloc.so.0...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libtcmalloc.so.0
 Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libstdc++.so.6
 Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
 Loaded symbols for /lib/libm.so.6
 Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols 
 found)...done.
 Loaded symbols for /lib/libgcc_s.so.1
 Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
 Loaded symbols for /lib/libc.so.6
 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging
 symbols found)...done.
 Loaded symbols for /lib64/ld-linux-x86-64.so.2
 Reading symbols from /usr/lib/libunwind.so.7...(no debugging symbols
 found)...done.
 Loaded symbols for /usr/lib/libunwind.so.7
 Core was generated by `/usr/bin/ceph-mds -i c --pid-file
 /var/run/ceph/mds.c.pid -c /etc/ceph/ceph.con'.
 Program terminated with signal 11, Segmentation fault.
 #0 0x7f10c00d2ebb in raise () from /lib/libpthread.so.0


 Argh. This is finicky and annoying; don't feel bad. :) There are two 
 possibilities here:
 1) If I remember correctly, PATH and the actual debug symbol install 
 locations often don't match up. Check out where the debug packages actually 
 installed to, and make sure that directory is in PATH when running gdb.
 2) The default thread you're getting a backtrace on doesn't look to be the 
 one we actually care about (notice how the backtrace is through completely 
 different parts of the code); it's conceivable that there just aren't any 
 debug symbols for those libraries. Try running thread apply all bt (I think 
 that's the right command) and looking for one that matches the backtrace in 
 the log file. Then switch to it (thread x where x is the thread number) and 
 get the backtrace of that.
 -Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS crash, wont startup again

2012-05-21 Thread Felix Feinhals
Hi Josh,

i quoted the trace and some other stats in my first email, maybe it
got stuck in the spam filters.
Well next try:

snip

-3 2012-05-10 14:52:29.509940 7fb1c9351700 1 mds.0.40 handle_mds_map
 i am now mds.0.40
 -2 2012-05-10 14:52:29.509956 7fb1c9351700 1 mds.0.40 handle_mds_map
 state change up:reconnect -- up:rejoin
 -1 2012-05-10 14:52:29.509963 7fb1c9351700 1 mds.0.40 rejoin_joint_start
 0 2012-05-10 14:52:29.512503 7fb1c9351700 -1 *** Caught signal
 (Segmentation fault) **
 in thread 7fb1c9351700

ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
 1: ceph-mds() [0x814279]
 2: (()+0xeff0) [0x7fb1cddbfff0]
 3: (SnapRealm::have_past_parents_open(snapid_t, snapid_t)+0x4f) [0x6cb5ef]
 4: (MDCache::check_realm_past_parents(SnapRealm*)+0x2b) [0x55d58b]
 5: (MDCache::choose_lock_states_and_reconnect_caps()+0x29c) [0x572eec]
 6: (MDCache::rejoin_gather_finish()+0x90) [0x5931a0]
 7: (MDCache::rejoin_send_rejoins()+0x2c05) [0x59b9d5]
 8: (MDS::rejoin_joint_start()+0x131) [0x4a8721]
 9: (MDS::handle_mds_map(MMDSMap*)+0x2c4a) [0x4c253a]
 10: (MDS::handle_core_message(Message*)+0x913) [0x4c4513]
 11: (MDS::_dispatch(Message*)+0x2f) [0x4c45ef]
 12: (MDS::ms_dispatch(Message*)+0x1fb) [0x4c628b]
 13: (SimpleMessenger::dispatch_entry()+0x979) [0x7acb49]
 14: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7336ed]
 15: (()+0x68ca) [0x7fb1cddb78ca]
 16: (clone()+0x6d) [0x7fb1cc63f92d]

snip

I though ceph chooses which MDS is active and which is standby, i just
have 3 in the cluster config:

[mds.a]
host = x

[mds.b]
host = y

[mds.c]
host = z

no global MDS config.
Should i reconfigure this?



2012/5/17 Josh Durgin josh.dur...@inktank.com:
 On 05/16/2012 01:11 AM, Felix Feinhals wrote:

 Hi again,

 anything on this Problem? Seems that the only choice for me is to
 reinitialize the whole cephfs (mkcephfs...)
 :(


 Hi Felix, it looks like your first mail never reached the list.


 2012/5/10 Felix Feinhalsf...@turtle-entertainment.de:

 Hi List,

 we installed a ceph cluster with ceph version 0.46.
 3 OSDs, 3 MONs and 3 MDSs.

 After copying a bunch of files to a ceph-fuse mount all MDS daemons
 crash and now i cant bring them back online.
 I already tried to restart the daemons in different order and also
 removed one OSD, nothing really happened only now we have pgs with
 active+remapped which i think is normal.
 Any hints?


 Are all three MDS active? At this point, more than one active MDS is
 likely to crash. You can have one active and others standby.

 If you've got only one active, what was the backtrace of the crash?
 It'll be at the end of the MDS log (by default in /var/log/ceph).
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS crash, wont startup again

2012-05-16 Thread Felix Feinhals
Hi again,

anything on this Problem? Seems that the only choice for me is to
reinitialize the whole cephfs (mkcephfs...)
:(

2012/5/10 Felix Feinhals f...@turtle-entertainment.de:
 Hi List,

 we installed a ceph cluster with ceph version 0.46.
 3 OSDs, 3 MONs and 3 MDSs.

 After copying a bunch of files to a ceph-fuse mount all MDS daemons
 crash and now i cant bring them back online.
 I already tried to restart the daemons in different order and also
 removed one OSD, nothing really happened only now we have pgs with
 active+remapped which i think is normal.
 Any hints?

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html