Re: [ceph-users] Failed assert when starting new OSDs in 0.60

Travis Rhoden Mon, 29 Apr 2013 14:45:33 -0700

Also, I can note that it does not take a full cluster restart to trigger
this.  If I just restart an OSD that was up/in previously, the same error
can happen (though not every time).  So restarting OSD's for me is a bit
like Russian roullette.  =)  Even though restarting an OSD may not also
result in the error, it seems that once it happens that OSD is gone for
good.  No amount of restart has brought any of the dead ones back.


I'd really like to get to the bottom of it.  Let me know if I can do
anything to help.

I may also have to try completely wiping/rebuilding to see if I can make
this thing usable.


On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden <trho...@gmail.com> wrote:

> Hi Sam,
>
> Thanks for being willing to take a look.
>
> I applied the debug settings on one host that 3 out of 3 OSDs with this
> problem.  Then tried to start them up.  Here are the resulting logs:
>
> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz
>
>  - Travis
>
>
> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just <sam.j...@inktank.com> wrote:
>
>> You appear to be missing pg metadata for some reason.  If you can
>> reproduce it with
>> debug osd = 20
>> debug filestore = 20
>> debug ms = 1
>> on all of the OSDs, I should be able to track it down.
>>
>> I created a bug: #4855.
>>
>> Thanks!
>> -Sam
>>
>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden <trho...@gmail.com> wrote:
>> > Thanks Greg.
>> >
>> > I quit playing with it because every time I restarted the cluster
>> (service
>> > ceph -a restart), I lost more OSDs..  First time it was 1, 2nd 10, 3rd
>> time
>> > 13...  All 13 down OSDs all show the same stacktrace.
>> >
>> >  - Travis
>> >
>> >
>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum <g...@inktank.com>
>> wrote:
>> >>
>> >> This sounds vaguely familiar to me, and I see
>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't
>> >> reproduce" — I think maybe this is fixed in "next" and "master", but
>> >> I'm not sure. For more than that I'd have to defer to Sage or Sam.
>> >> -Greg
>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> >>
>> >>
>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden <trho...@gmail.com>
>> wrote:
>> >> > Hey folks,
>> >> >
>> >> > I'm helping put together a new test/experimental cluster, and hit
>> this
>> >> > today
>> >> > when bringing the cluster up for the first time (using mkcephfs).
>> >> >
>> >> > After doing the normal "service ceph -a start", I noticed one OSD was
>> >> > down,
>> >> > and a lot of PGs were stuck creating.  I tried restarting the down
>> OSD,
>> >> > but
>> >> > it would come up.  It always had this error:
>> >> >
>> >> >     -1> 2013-04-27 18:11:56.179804 b6fcd000  2 osd.1 0 boot
>> >> >      0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In function
>> >> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
>> >> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27 18:11:56.399089
>> >> > osd/PG.cc: 2556: FAILED assert(values.size() == 1)
>> >> >
>> >> >  ceph version 0.60-401-g17a3859
>> >> > (17a38593d60f5f29b9b66c13c0aaa759762c6d04)
>> >> >  1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
>> >> > ceph::buffer::list*)+0x1ad) [0x2c3c0a]
>> >> >  2: (OSD::load_pgs()+0x357) [0x28cba0]
>> >> >  3: (OSD::init()+0x741) [0x290a16]
>> >> >  4: (main()+0x1427) [0x2155c0]
>> >> >  5: (__libc_start_main()+0x99) [0xb69bcf42]
>> >> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> >> > needed to
>> >> > interpret this.
>> >> >
>> >> >
>> >> > I then did a full cluster restart, and now I have ten OSDs down --
>> each
>> >> > showing the same exception/failed assert.
>> >> >
>> >> > Anybody seen this?
>> >> >
>> >> > I know I'm running a weird version -- it's compiled from source, and
>> was
>> >> > provided to me.  The OSDs are all on ARM, and the mon is x86_64.
>>  Just
>> >> > looking to see if anyone has seen this particular stack trace of
>> >> > load_pgs()/peek_map_epoch() before....
>> >> >
>> >> >  - Travis
>> >> >
>> >> > _______________________________________________
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >
>> >
>>
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Failed assert when starting new OSDs in 0.60

Reply via email to