It should apply cleanly on top of 0.48.2. There may be a 0.48.3, but it won't be driven by this patch. -Greg
On Sat, Nov 3, 2012 at 7:27 PM, Nick Couchman <nick.couch...@seakr.com> wrote: > Okay - I'm planning to try to go to version 0.48.2, the latest stable - is > the patch available for that branch, or will there be a 0.48.3 release coming? > >>>> Gregory Farnum 11/03/12 11:45 AM >>> > Sage merged it into master, so whatever you like. If you remove the > patch and the error happens again, your MDS will fail on replay as it > did here. If you leave it in, it has no effect other than handling > that particular bad case. > -Greg > > On Tue, Oct 30, 2012 at 3:22 AM, Nick Couchman wrote: >> Okay, that patch worked and it seems to be running, again. Should I >> continue to run with that patch, or go back to the original binaries? >> >>>>> Gregory Farnum 10/19/12 4:16 PM >>> >> I've written a small patch on top of v0.48.1argonaut which should >> avoid this. It's in branch 3369-mds-session-workaround and will simply >> log an error in the monitor central log instead of segfaulting. There >> should shortly be packages available at >> http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/3369-mds-session-workaround/ >> (for Precise amd64; or elsewhere if you're on a different platform?). >> -Greg >> >> On Fri, Oct 19, 2012 at 1:52 PM, Nick Couchman wrote: >>> One of the MDSs crashed over the weekend (late Friday night), but I believe >>> that one was not active and was just in Replay mode. Other than that, I >>> don't know of anything that would have affected the MDSs. >>> >>> -Nick >>> >>>>>> On 2012/10/18 at 16:55, Gregory Farnum wrote: >>>> Okay, looked at this a little bit. Can you describe what was happening >>>> before you got into this failed-replay loop? (So, why was it in replay >>>> at all?) I see that the monitor marked it as laggy for some reason; >>>> was the cluster under load; did the monitors break; something else? >>>> I can see why it's failed here and I think I can do a simple code >>>> patch to work around it, but the root cause is something that happened >>>> while the MDS was still alive. >>>> >>>> Basic technical content: >>>> The MDS journals all open client sessions. It brings them back into >>>> memory during replay, and then operates on them to do things like open >>>> new sessions or close ones that it turns out not to need. Your log >>>> contains two close events for the same client session, and it's >>>> causing a big freak out. This actually feels somewhat familiar; I'll >>>> talk about it with our team here and get back to you tomorrow >>>> sometime. >>>> -Greg >>>> >>>> On Thu, Oct 18, 2012 at 8:56 AM, Nick Couchman >>>> wrote: >>>>> Hopefully this is what you're looking for... >>>>> (gdb) bt >>>>> #0 ESession::replay (this=0x7fffcc49a7c0, mds=0x127d5f0) at >>>> mds/journal.cc:828 >>>>> #1 0x00000000006a2446 in MDLog::_replay_thread (this=0x1281390) at >>>> mds/MDLog.cc:580 >>>>> #2 0x00000000004cf5ed in MDLog::ReplayThread::entry (this=) at >>>> mds/MDLog.h:86 >>>>> #3 0x00007ffff764df05 in start_thread () from /lib64/libpthread.so.0 >>>>> #4 0x00007ffff680d10d in clone () from /lib64/libc.so.6 >>>>> >>>>>>>> On 2012/10/17 at 09:53, Sam Lang wrote: >>>>>> On 10/17/2012 09:42 AM, Nick Couchman wrote: >>>>>>> Thanks...here's the backtrace: >>>>>>> (gdb) bt >>>>>>> #0 0x00000000004dcfea in ESession::replay(MDS*) () >>>>>>> #1 0x00000000006a2446 in MDLog::_replay_thread() () >>>>>>> #2 0x00000000004cf5ed in MDLog::ReplayThread::entry() () >>>>>>> #3 0x00007ffff764df05 in start_thread () from /lib64/libpthread.so.0 >>>>>>> #4 0x00007ffff680d10d in clone () from /lib64/libc.so.6 >>>>>> >>>>>> Hi Nick, >>>>>> >>>>>> This doesn't have the debug symbols (line numbers in the source) we were >>>>>> hoping for. Could you install the ceph-dpg package and rerun? You will >>>>>> probably have to first uninstall the ceph package. >>>>>> >>>>>> Thanks, >>>>>> -sam >>>>>> >>>>>>> >>>>>>>>>> On 2012/10/17 at 07:34, Sam Lang wrote: >>>>>>>> On 10/16/2012 06:04 PM, Gregory Farnum wrote: >>>>>>>>> Okay, that's the right debugging but it wasn't quite as helpful on its >>>>>>>>> own as I expected. Can you get a core dump (you might already have >>>>>>>>> one, depending on system settings) of the crash and open it up with >>>>>>>>> gdb and get a full backtrace? >>>>>>>> >>>>>>>> You can also run the mds directly in gdb and avoid any core file ulimit >>>>>>>> settings you have set: >>>>>>>> >>>>>>>> > gdb --args ceph-mds -n mds.b -c /etc/ceph/ceph.conf -f >>>>>>>> ... >>>>>>>> (gdb) run >>>>>>>> >>>>>>>> Once you hit the segfault you can get the backtrace with: >>>>>>>> >>>>>>>> (gdb) bt >>>>>>>> >>>>>>>> -sam >>>>>>>> >>>>>>>> >>>>>>>>> -Greg >>>>>>>>> >>>>>>>>> On Mon, Oct 15, 2012 at 10:59 AM, Nick Couchman >>>>>>>> wrote: >>>>>>>>>> Well, hopefully this is still okay...8.5MB bzip2d, 230MB unzipped. >>>>>>>>>> >>>>>>>>>> -Nick >>>>>>>>>> >>>>>>>>>>>>> On 2012/10/15 at 11:47, Gregory Farnum wrote: >>>>>>>>>>> Yeah, zip it and post * somebody's going to have to download it and >>>>>>>>>> do >>>>>>>>>>> fun things. :) >>>>>>>>>>> -Greg >>>>>>>>>>> >>>>>>>>>>> On Mon, Oct 15, 2012 at 10:43 AM, Nick Couchman >>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>>> Anywhere in particular I should make it available? It's a little >>>>>>>>>> over a >>>>>>>>>>> million lines of debug in the file - I can put it on a pastebin, if >>>>>>>>>> that >>>>>>>>>>> works, or perhaps zip it up and throw it somewhere? >>>>>>>>>>>> >>>>>>>>>>>> -Nick >>>>>>>>>>>> >>>>>>>>>>>>>>> On 2012/10/15 at 11:26, Gregory Farnum wrote: >>>>>>>>>>>>> Something in the MDS log is bad or is poking at a bug in the code. >>>>>>>>>> Can >>>>>>>>>>>>> you turn on MDS debugging and restart a daemon and put that log >>>>>>>>>>>>> somewhere accessible? >>>>>>>>>>>>> debug mds = 20 >>>>>>>>>>>>> debug journaler = 20 >>>>>>>>>>>>> debug ms = 1 >>>>>>>>>>>>> -Greg >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Oct 15, 2012 at 10:02 AM, Nick Couchman >>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> Well, both of my MDSs seem to be down right now, and then >>>>>>>>>> continually >>>>>>>>>>>>> segfault (every time I try to start them) with the following: >>>>>>>>>>>>>> >>>>>>>>>>>>>> ceph-mdsmon-a:~ # ceph-mds -n mds.b -c /etc/ceph/ceph.conf -f >>>>>>>>>>>>>> starting mds.b at :/0 >>>>>>>>>>>>>> *** Caught signal (Segmentation fault) ** >>>>>>>>>>>>>> in thread 7fbe0d61d700 >>>>>>>>>>>>>> ceph version 0.48.1argonaut >>>>>>>>>>>>> (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) >>>>>>>>>>>>>> 1: ceph-mds() [0x7ef83a] >>>>>>>>>>>>>> 2: (()+0xfd00) [0x7fbe15a0cd00] >>>>>>>>>>>>>> 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] >>>>>>>>>>>>>> 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] >>>>>>>>>>>>>> 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] >>>>>>>>>>>>>> 6: (()+0x7f05) [0x7fbe15a04f05] >>>>>>>>>>>>>> 7: (clone()+0x6d) [0x7fbe14bc410d] >>>>>>>>>>>>>> 2012-10-15 10:57:35.449161 7fbe0d61d700 -1 *** Caught signal >>>>>>>>>> (Segmentation >>>>>>>>>>>>> fault) ** >>>>>>>>>>>>>> in thread 7fbe0d61d700 >>>>>>>>>>>>>> >>>>>>>>>>>>>> ceph version 0.48.1argonaut >>>>>>>>>>>>> (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) >>>>>>>>>>>>>> 1: ceph-mds() [0x7ef83a] >>>>>>>>>>>>>> 2: (()+0xfd00) [0x7fbe15a0cd00] >>>>>>>>>>>>>> 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] >>>>>>>>>>>>>> 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] >>>>>>>>>>>>>> 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] >>>>>>>>>>>>>> 6: (()+0x7f05) [0x7fbe15a04f05] >>>>>>>>>>>>>> 7: (clone()+0x6d) [0x7fbe14bc410d] >>>>>>>>>>>>>> NOTE: a copy of the executable, or `objdump -rdS ` is >>>>>>>>>> needed to >>>>>>>>>>>>> interpret this. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 0> 2012-10-15 10:57:35.449161 7fbe0d61d700 -1 *** Caught >>>>>>>>>> signal >>>>>>>>>>>>> (Segmentation fault) ** >>>>>>>>>>>>>> in thread 7fbe0d61d700 >>>>>>>>>>>>>> >>>>>>>>>>>>>> ceph version 0.48.1argonaut >>>>>>>>>>>>> (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) >>>>>>>>>>>>>> 1: ceph-mds() [0x7ef83a] >>>>>>>>>>>>>> 2: (()+0xfd00) [0x7fbe15a0cd00] >>>>>>>>>>>>>> 3: (ESession::replay(MDS*)+0x3ea) [0x4dcfea] >>>>>>>>>>>>>> 4: (MDLog::_replay_thread()+0x6b6) [0x6a2446] >>>>>>>>>>>>>> 5: (MDLog::ReplayThread::entry()+0xd) [0x4cf5ed] >>>>>>>>>>>>>> 6: (()+0x7f05) [0x7fbe15a04f05] >>>>>>>>>>>>>> 7: (clone()+0x6d) [0x7fbe14bc410d] >>>>>>>>>>>>>> NOTE: a copy of the executable, or `objdump -rdS ` is >>>>>>>>>> needed to >>>>>>>>>>>>> interpret this. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Segmentation fault >>>>>>>>>>>>>> >>>>>>>>>>>>>> Anyone have any hints on recovering? I'm running 0.48.1argonaut >>>>>>>>>>>>>> - >>>>>>>>>> I can >>>>>>>>>>>>> attempt to upgrade to 0.48.2 and see if that helps, but I figured >>>>>>>>>> if anyone >>>>>>>>>>>>> can offer any insight as to what to do to get the replay to run >>>>>>>>>> without >>>>>>>>>>>>> segfaulting? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -------- >>>>>>>>>>>>>> This e-mail may contain confidential and privileged material for >>>>>>>>>> the sole use >>>>>>>>>>>>> of the intended recipient. If this email is not intended for you, >>>>>>>>>> or you >>>>>>>>>>> are >>>>>>>>>>>>> not responsible for the delivery of this message to the intended >>>>>>>>>> recipient, >>>>>>>>>>>>> please note that this message may contain SEAKR Engineering >>>>>>>>>> (SEAKR) >>>>>>>>>>>>> Privileged/Proprietary Information. In such a case, you are >>>>>>>>>> strictly >>>>>>>>>>>>> prohibited from downloading, photocopying, distributing or >>>>>>>>>> otherwise using >>>>>>>>>>>>> this message, its contents or attachments in any way. If you have >>>>>>>>>> received >>>>>>>>>>>>> this message in error, please notify us immediately by replying to >>>>>>>>>> this >>>>>>>>>>> e-mail >>>>>>>>>>>>> and delete the message from your mailbox. Information contained >>>>>>>>>>>>> in >>>>>>>>>> this >>>>>>>>>>>>> message that does not relate to the business of SEAKR is neither >>>>>>>>>> endorsed by >>>>>>>>>>>>> nor attributable to SEAKR. >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>>> ceph-devel" in >>>>>>>>>>>>>> the body of a message to majord...@vger.kernel.org >>>>>>>>>>>>>> More majordomo info at >>>>>>>>>> http://vger.kernel.org/majordomo-info.html >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -------- >>>>>>>>>>>> >>>>>>>>>>>> This e-mail may contain confidential and privileged material for >>>>>>>>>>>> the >>>>>>>>>> sole use >>>>>>>>>>> of the intended recipient. If this email is not intended for you, >>>>>>>>>>> or >>>>>>>>>> you are >>>>>>>>>>> not responsible for the delivery of this message to the intended >>>>>>>>>> recipient, >>>>>>>>>>> please note that this message may contain SEAKR Engineering (SEAKR) >>>>>>>>>>> Privileged/Proprietary Information. In such a case, you are >>>>>>>>>>> strictly >>>>>>>>>> >>>>>>>>>>> prohibited from downloading, photocopying, distributing or otherwise >>>>>>>>>> using >>>>>>>>>>> this message, its contents or attachments in any way. If you have >>>>>>>>>> received >>>>>>>>>>> this message in error, please notify us immediately by replying to >>>>>>>>>> this e-mail >>>>>>>>>>> and delete the message from your mailbox. Information contained in >>>>>>>>>> this >>>>>>>>>>> message that does not relate to the business of SEAKR is neither >>>>>>>>>> endorsed by >>>>>>>>>>> nor attributable to SEAKR. >>>>>>>>>>> -- >>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>>>> ceph-devel" >>>>>>>>>> in >>>>>>>>>>> the body of a message to majord...@vger.kernel.org >>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -------- >>>>>>>>>> This e-mail may contain confidential and privileged material for the >>>>>>>>>> sole use >>>>>>>> of the intended recipient. If this email is not intended for you, or >>>>>>>> you >>>>>> are >>>>>>>> not responsible for the delivery of this message to the intended >>>>>>>> recipient, >>>>>>>> please note that this message may contain SEAKR Engineering (SEAKR) >>>>>>>> Privileged/Proprietary Information. In such a case, you are strictly >>>>>>>> prohibited from downloading, photocopying, distributing or otherwise >>>>>>>> using >>>>>>>> this message, its contents or attachments in any way. If you have >>>>>>>> received >>>>>>>> this message in error, please notify us immediately by replying to this >>>>>> e-mail >>>>>>>> and delete the message from your mailbox. Information contained in >>>>>>>> this >>>>>>>> message that does not relate to the business of SEAKR is neither >>>>>>>> endorsed by >>>>>>>> nor attributable to SEAKR. >>>>>>>>> -- >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>>>>> in >>>>>>>>> the body of a message to majord...@vger.kernel.org >>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -------- >>>>>>> This e-mail may contain confidential and privileged material for the >>>>>>> sole use >>>>>> of the intended recipient. If this email is not intended for you, or you >>>> are >>>>>> not responsible for the delivery of this message to the intended >>>>>> recipient, >>>>>> please note that this message may contain SEAKR Engineering (SEAKR) >>>>>> Privileged/Proprietary Information. In such a case, you are strictly >>>>>> prohibited from downloading, photocopying, distributing or otherwise >>>>>> using >>>>>> this message, its contents or attachments in any way. If you have >>>>>> received >>>>>> this message in error, please notify us immediately by replying to this >>>> e-mail >>>>>> and delete the message from your mailbox. Information contained in this >>>>>> message that does not relate to the business of SEAKR is neither >>>>>> endorsed by >>>>>> nor attributable to SEAKR. >>>>>>> >>>>> >>>>> >>>>> >>>>> -------- >>>>> >>>>> This e-mail may contain confidential and privileged material for the sole >>>>> use >>>> of the intended recipient. If this email is not intended for you, or you >>>> are >>>> not responsible for the delivery of this message to the intended recipient, >>>> please note that this message may contain SEAKR Engineering (SEAKR) >>>> Privileged/Proprietary Information. In such a case, you are strictly >>>> prohibited from downloading, photocopying, distributing or otherwise using >>>> this message, its contents or attachments in any way. If you have received >>>> this message in error, please notify us immediately by replying to this >>>> e-mail >>>> and delete the message from your mailbox. Information contained in this >>>> message that does not relate to the business of SEAKR is neither endorsed >>>> by >>>> nor attributable to SEAKR. >>> >>> >>> >>> -------- >>> >>> This e-mail may contain confidential and privileged material for the sole >>> use of the intended recipient. If this email is not intended for you, or >>> you are not responsible for the delivery of this message to the intended >>> recipient, please note that this message may contain SEAKR Engineering >>> (SEAKR) Privileged/Proprietary Information. In such a case, you are >>> strictly prohibited from downloading, photocopying, distributing or >>> otherwise using this message, its contents or attachments in any way. If >>> you have received this message in error, please notify us immediately by >>> replying to this e-mail and delete the message from your mailbox. >>> Information contained in this message that does not relate to the business >>> of SEAKR is neither endorsed by nor attributable to SEAKR. >> >> >> >> -------- >> >> This e-mail may contain confidential and privileged material for the sole >> use of the intended recipient. If this email is not intended for you, or >> you are not responsible for the delivery of this message to the intended >> recipient, please note that this message may contain SEAKR Engineering >> (SEAKR) Privileged/Proprietary Information. In such a case, you are >> strictly prohibited from downloading, photocopying, distributing or >> otherwise using this message, its contents or attachments in any way. If >> you have received this message in error, please notify us immediately by >> replying to this e-mail and delete the message from your mailbox. >> Information contained in this message that does not relate to the business >> of SEAKR is neither endorsed by nor attributable to SEAKR. > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -------- > > This e-mail may contain confidential and privileged material for the sole use > of the intended recipient. If this email is not intended for you, or you are > not responsible for the delivery of this message to the intended recipient, > please note that this message may contain SEAKR Engineering (SEAKR) > Privileged/Proprietary Information. In such a case, you are strictly > prohibited from downloading, photocopying, distributing or otherwise using > this message, its contents or attachments in any way. If you have received > this message in error, please notify us immediately by replying to this > e-mail and delete the message from your mailbox. Information contained in > this message that does not relate to the business of SEAKR is neither > endorsed by nor attributable to SEAKR. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html