Re: [ceph-users] Failed assert when starting new OSDs in 0.60

Travis Rhoden Tue, 30 Apr 2013 09:17:30 -0700

On the OSD node:

root@cepha0:~# lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 12.10
Release:    12.10
Codename:    quantal
root@cepha0:~# dpkg -l "*leveldb*"
Desired=Unknown/Install/Remove/Purge/Hold
|
Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                   Version
Architecture             Description
+++-======================================-========================-========================-==================================================================================
ii  libleveldb1:armhf                      0+20120530.gitdd0d562-2
armhf                    fast key-value storage library
root@cepha0:~# uname -a
Linux cepha0 3.5.0-27-highbank #46-Ubuntu SMP Mon Mar 25 23:19:40 UTC 2013
armv7l armv7l armv7l GNU/Linux



On the MON node:
# lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 12.10
Release:    12.10
Codename:    quantal
# uname -a
Linux  3.5.0-27-generic #46-Ubuntu SMP Mon Mar 25 19:58:17 UTC 2013 x86_64
x86_64 x86_64 GNU/Linux
# dpkg -l "*leveldb*"
Desired=Unknown/Install/Remove/Purge/Hold
|
Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                   Version
Architecture             Description
+++-======================================-========================-========================-==================================================================================
un  leveldb-doc
<none>                                            (no description available)
ii  libleveldb-dev:amd64                   0+20120530.gitdd0d562-2
amd64                    fast key-value storage library (development files)
ii  libleveldb1:amd64                      0+20120530.gitdd0d562-2
amd64                    fast key-value storage library


On Tue, Apr 30, 2013 at 12:11 PM, Samuel Just <sam.j...@inktank.com> wrote:

> What version of leveldb is installed?  Ubuntu/version?
> -Sam
>
> On Tue, Apr 30, 2013 at 8:50 AM, Travis Rhoden <trho...@gmail.com> wrote:
> > Interestingly, the down OSD does not get marked out after 5 minutes.
> > Probably that is already fixed by http://tracker.ceph.com/issues/4822.
> >
> >
> > On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden <trho...@gmail.com>
> wrote:
> >>
> >> Hi Sam,
> >>
> >> I was prepared to write in and say that the problem had gone away.  I
> >> tried restarting several OSDs last night in the hopes of capturing the
> >> problem on and OSD that hadn't failed yet, but didn't have any luck.
>  So I
> >> did indeed re-create the cluster from scratch (using mkcephfs), and
> what do
> >> you know -- everything worked.  I got everything in a nice stable state,
> >> then decided to do a full cluster restart, just to be sure.  Sure
> enough,
> >> one OSD failed to come up, and has the same stack trace.  So I believe I
> >> have the log you want -- just from the OSD that failed, right?
> >>
> >> Question -- any feeling for what parts of the log you need?  It's 688MB
> >> uncompressed (two hours!), so I'd like to be able to trim some off for
> you
> >> before making it available.  Do you only need/want the part from after
> the
> >> OSD was restarted?  Or perhaps the corruption happens on OSD shutdown
> and
> >> you need some before that?  If you are fine with that large of a file,
> I can
> >> just make that available too.  Let me know.
> >>
> >>  - Travis
> >>
> >>
> >> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden <trho...@gmail.com>
> wrote:
> >>>
> >>> Hi Sam,
> >>>
> >>> No problem, I'll leave that debugging turned up high, and do a mkcephfs
> >>> from scratch and see what happens.  Not sure if it will happen again
> or not.
> >>> =)
> >>>
> >>> Thanks again.
> >>>
> >>>  - Travis
> >>>
> >>>
> >>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just <sam.j...@inktank.com>
> >>> wrote:
> >>>>
> >>>> Hmm, I need logging from when the corruption happened.  If this is
> >>>> reproducible, can you enable that logging on a clean osd (or better, a
> >>>> clean cluster) until the assert occurs?
> >>>> -Sam
> >>>>
> >>>> On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden <trho...@gmail.com>
> >>>> wrote:
> >>>> > Also, I can note that it does not take a full cluster restart to
> >>>> > trigger
> >>>> > this.  If I just restart an OSD that was up/in previously, the same
> >>>> > error
> >>>> > can happen (though not every time).  So restarting OSD's for me is a
> >>>> > bit
> >>>> > like Russian roullette.  =)  Even though restarting an OSD may not
> >>>> > also
> >>>> > result in the error, it seems that once it happens that OSD is gone
> >>>> > for
> >>>> > good.  No amount of restart has brought any of the dead ones back.
> >>>> >
> >>>> > I'd really like to get to the bottom of it.  Let me know if I can do
> >>>> > anything to help.
> >>>> >
> >>>> > I may also have to try completely wiping/rebuilding to see if I can
> >>>> > make
> >>>> > this thing usable.
> >>>> >
> >>>> >
> >>>> > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden <trho...@gmail.com>
> >>>> > wrote:
> >>>> >>
> >>>> >> Hi Sam,
> >>>> >>
> >>>> >> Thanks for being willing to take a look.
> >>>> >>
> >>>> >> I applied the debug settings on one host that 3 out of 3 OSDs with
> >>>> >> this
> >>>> >> problem.  Then tried to start them up.  Here are the resulting
> logs:
> >>>> >>
> >>>> >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz
> >>>> >>
> >>>> >>  - Travis
> >>>> >>
> >>>> >>
> >>>> >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just <sam.j...@inktank.com
> >
> >>>> >> wrote:
> >>>> >>>
> >>>> >>> You appear to be missing pg metadata for some reason.  If you can
> >>>> >>> reproduce it with
> >>>> >>> debug osd = 20
> >>>> >>> debug filestore = 20
> >>>> >>> debug ms = 1
> >>>> >>> on all of the OSDs, I should be able to track it down.
> >>>> >>>
> >>>> >>> I created a bug: #4855.
> >>>> >>>
> >>>> >>> Thanks!
> >>>> >>> -Sam
> >>>> >>>
> >>>> >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden <trho...@gmail.com
> >
> >>>> >>> wrote:
> >>>> >>> > Thanks Greg.
> >>>> >>> >
> >>>> >>> > I quit playing with it because every time I restarted the
> cluster
> >>>> >>> > (service
> >>>> >>> > ceph -a restart), I lost more OSDs..  First time it was 1, 2nd
> 10,
> >>>> >>> > 3rd
> >>>> >>> > time
> >>>> >>> > 13...  All 13 down OSDs all show the same stacktrace.
> >>>> >>> >
> >>>> >>> >  - Travis
> >>>> >>> >
> >>>> >>> >
> >>>> >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum
> >>>> >>> > <g...@inktank.com>
> >>>> >>> > wrote:
> >>>> >>> >>
> >>>> >>> >> This sounds vaguely familiar to me, and I see
> >>>> >>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't
> >>>> >>> >> reproduce" — I think maybe this is fixed in "next" and
> "master",
> >>>> >>> >> but
> >>>> >>> >> I'm not sure. For more than that I'd have to defer to Sage or
> >>>> >>> >> Sam.
> >>>> >>> >> -Greg
> >>>> >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >>>> >>> >>
> >>>> >>> >>
> >>>> >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden
> >>>> >>> >> <trho...@gmail.com>
> >>>> >>> >> wrote:
> >>>> >>> >> > Hey folks,
> >>>> >>> >> >
> >>>> >>> >> > I'm helping put together a new test/experimental cluster, and
> >>>> >>> >> > hit
> >>>> >>> >> > this
> >>>> >>> >> > today
> >>>> >>> >> > when bringing the cluster up for the first time (using
> >>>> >>> >> > mkcephfs).
> >>>> >>> >> >
> >>>> >>> >> > After doing the normal "service ceph -a start", I noticed one
> >>>> >>> >> > OSD
> >>>> >>> >> > was
> >>>> >>> >> > down,
> >>>> >>> >> > and a lot of PGs were stuck creating.  I tried restarting the
> >>>> >>> >> > down
> >>>> >>> >> > OSD,
> >>>> >>> >> > but
> >>>> >>> >> > it would come up.  It always had this error:
> >>>> >>> >> >
> >>>> >>> >> >     -1> 2013-04-27 18:11:56.179804 b6fcd000  2 osd.1 0 boot
> >>>> >>> >> >      0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In
> >>>> >>> >> > function
> >>>> >>> >> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t,
> >>>> >>> >> > hobject_t&,
> >>>> >>> >> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27
> >>>> >>> >> > 18:11:56.399089
> >>>> >>> >> > osd/PG.cc: 2556: FAILED assert(values.size() == 1)
> >>>> >>> >> >
> >>>> >>> >> >  ceph version 0.60-401-g17a3859
> >>>> >>> >> > (17a38593d60f5f29b9b66c13c0aaa759762c6d04)
> >>>> >>> >> >  1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
> >>>> >>> >> > ceph::buffer::list*)+0x1ad) [0x2c3c0a]
> >>>> >>> >> >  2: (OSD::load_pgs()+0x357) [0x28cba0]
> >>>> >>> >> >  3: (OSD::init()+0x741) [0x290a16]
> >>>> >>> >> >  4: (main()+0x1427) [0x2155c0]
> >>>> >>> >> >  5: (__libc_start_main()+0x99) [0xb69bcf42]
> >>>> >>> >> >  NOTE: a copy of the executable, or `objdump -rdS
> <executable>`
> >>>> >>> >> > is
> >>>> >>> >> > needed to
> >>>> >>> >> > interpret this.
> >>>> >>> >> >
> >>>> >>> >> >
> >>>> >>> >> > I then did a full cluster restart, and now I have ten OSDs
> down
> >>>> >>> >> > --
> >>>> >>> >> > each
> >>>> >>> >> > showing the same exception/failed assert.
> >>>> >>> >> >
> >>>> >>> >> > Anybody seen this?
> >>>> >>> >> >
> >>>> >>> >> > I know I'm running a weird version -- it's compiled from
> >>>> >>> >> > source, and
> >>>> >>> >> > was
> >>>> >>> >> > provided to me.  The OSDs are all on ARM, and the mon is
> >>>> >>> >> > x86_64.
> >>>> >>> >> > Just
> >>>> >>> >> > looking to see if anyone has seen this particular stack trace
> >>>> >>> >> > of
> >>>> >>> >> > load_pgs()/peek_map_epoch() before....
> >>>> >>> >> >
> >>>> >>> >> >  - Travis
> >>>> >>> >> >
> >>>> >>> >> > _______________________________________________
> >>>> >>> >> > ceph-users mailing list
> >>>> >>> >> > ceph-users@lists.ceph.com
> >>>> >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>> >>> >> >
> >>>> >>> >
> >>>> >>> >
> >>>> >>
> >>>> >>
> >>>> >
> >>>
> >>>
> >>
> >
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Failed assert when starting new OSDs in 0.60

Reply via email to