On the OSD node: root@cepha0:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 12.10 Release: 12.10 Codename: quantal root@cepha0:~# dpkg -l "*leveldb*" Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-======================================-========================-========================-================================================================================== ii libleveldb1:armhf 0+20120530.gitdd0d562-2 armhf fast key-value storage library root@cepha0:~# uname -a Linux cepha0 3.5.0-27-highbank #46-Ubuntu SMP Mon Mar 25 23:19:40 UTC 2013 armv7l armv7l armv7l GNU/Linux
On the MON node: # lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 12.10 Release: 12.10 Codename: quantal # uname -a Linux 3.5.0-27-generic #46-Ubuntu SMP Mon Mar 25 19:58:17 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux # dpkg -l "*leveldb*" Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-======================================-========================-========================-================================================================================== un leveldb-doc <none> (no description available) ii libleveldb-dev:amd64 0+20120530.gitdd0d562-2 amd64 fast key-value storage library (development files) ii libleveldb1:amd64 0+20120530.gitdd0d562-2 amd64 fast key-value storage library On Tue, Apr 30, 2013 at 12:11 PM, Samuel Just <sam.j...@inktank.com> wrote: > What version of leveldb is installed? Ubuntu/version? > -Sam > > On Tue, Apr 30, 2013 at 8:50 AM, Travis Rhoden <trho...@gmail.com> wrote: > > Interestingly, the down OSD does not get marked out after 5 minutes. > > Probably that is already fixed by http://tracker.ceph.com/issues/4822. > > > > > > On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden <trho...@gmail.com> > wrote: > >> > >> Hi Sam, > >> > >> I was prepared to write in and say that the problem had gone away. I > >> tried restarting several OSDs last night in the hopes of capturing the > >> problem on and OSD that hadn't failed yet, but didn't have any luck. > So I > >> did indeed re-create the cluster from scratch (using mkcephfs), and > what do > >> you know -- everything worked. I got everything in a nice stable state, > >> then decided to do a full cluster restart, just to be sure. Sure > enough, > >> one OSD failed to come up, and has the same stack trace. So I believe I > >> have the log you want -- just from the OSD that failed, right? > >> > >> Question -- any feeling for what parts of the log you need? It's 688MB > >> uncompressed (two hours!), so I'd like to be able to trim some off for > you > >> before making it available. Do you only need/want the part from after > the > >> OSD was restarted? Or perhaps the corruption happens on OSD shutdown > and > >> you need some before that? If you are fine with that large of a file, > I can > >> just make that available too. Let me know. > >> > >> - Travis > >> > >> > >> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden <trho...@gmail.com> > wrote: > >>> > >>> Hi Sam, > >>> > >>> No problem, I'll leave that debugging turned up high, and do a mkcephfs > >>> from scratch and see what happens. Not sure if it will happen again > or not. > >>> =) > >>> > >>> Thanks again. > >>> > >>> - Travis > >>> > >>> > >>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just <sam.j...@inktank.com> > >>> wrote: > >>>> > >>>> Hmm, I need logging from when the corruption happened. If this is > >>>> reproducible, can you enable that logging on a clean osd (or better, a > >>>> clean cluster) until the assert occurs? > >>>> -Sam > >>>> > >>>> On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden <trho...@gmail.com> > >>>> wrote: > >>>> > Also, I can note that it does not take a full cluster restart to > >>>> > trigger > >>>> > this. If I just restart an OSD that was up/in previously, the same > >>>> > error > >>>> > can happen (though not every time). So restarting OSD's for me is a > >>>> > bit > >>>> > like Russian roullette. =) Even though restarting an OSD may not > >>>> > also > >>>> > result in the error, it seems that once it happens that OSD is gone > >>>> > for > >>>> > good. No amount of restart has brought any of the dead ones back. > >>>> > > >>>> > I'd really like to get to the bottom of it. Let me know if I can do > >>>> > anything to help. > >>>> > > >>>> > I may also have to try completely wiping/rebuilding to see if I can > >>>> > make > >>>> > this thing usable. > >>>> > > >>>> > > >>>> > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden <trho...@gmail.com> > >>>> > wrote: > >>>> >> > >>>> >> Hi Sam, > >>>> >> > >>>> >> Thanks for being willing to take a look. > >>>> >> > >>>> >> I applied the debug settings on one host that 3 out of 3 OSDs with > >>>> >> this > >>>> >> problem. Then tried to start them up. Here are the resulting > logs: > >>>> >> > >>>> >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz > >>>> >> > >>>> >> - Travis > >>>> >> > >>>> >> > >>>> >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just <sam.j...@inktank.com > > > >>>> >> wrote: > >>>> >>> > >>>> >>> You appear to be missing pg metadata for some reason. If you can > >>>> >>> reproduce it with > >>>> >>> debug osd = 20 > >>>> >>> debug filestore = 20 > >>>> >>> debug ms = 1 > >>>> >>> on all of the OSDs, I should be able to track it down. > >>>> >>> > >>>> >>> I created a bug: #4855. > >>>> >>> > >>>> >>> Thanks! > >>>> >>> -Sam > >>>> >>> > >>>> >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden <trho...@gmail.com > > > >>>> >>> wrote: > >>>> >>> > Thanks Greg. > >>>> >>> > > >>>> >>> > I quit playing with it because every time I restarted the > cluster > >>>> >>> > (service > >>>> >>> > ceph -a restart), I lost more OSDs.. First time it was 1, 2nd > 10, > >>>> >>> > 3rd > >>>> >>> > time > >>>> >>> > 13... All 13 down OSDs all show the same stacktrace. > >>>> >>> > > >>>> >>> > - Travis > >>>> >>> > > >>>> >>> > > >>>> >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum > >>>> >>> > <g...@inktank.com> > >>>> >>> > wrote: > >>>> >>> >> > >>>> >>> >> This sounds vaguely familiar to me, and I see > >>>> >>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't > >>>> >>> >> reproduce" — I think maybe this is fixed in "next" and > "master", > >>>> >>> >> but > >>>> >>> >> I'm not sure. For more than that I'd have to defer to Sage or > >>>> >>> >> Sam. > >>>> >>> >> -Greg > >>>> >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com > >>>> >>> >> > >>>> >>> >> > >>>> >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden > >>>> >>> >> <trho...@gmail.com> > >>>> >>> >> wrote: > >>>> >>> >> > Hey folks, > >>>> >>> >> > > >>>> >>> >> > I'm helping put together a new test/experimental cluster, and > >>>> >>> >> > hit > >>>> >>> >> > this > >>>> >>> >> > today > >>>> >>> >> > when bringing the cluster up for the first time (using > >>>> >>> >> > mkcephfs). > >>>> >>> >> > > >>>> >>> >> > After doing the normal "service ceph -a start", I noticed one > >>>> >>> >> > OSD > >>>> >>> >> > was > >>>> >>> >> > down, > >>>> >>> >> > and a lot of PGs were stuck creating. I tried restarting the > >>>> >>> >> > down > >>>> >>> >> > OSD, > >>>> >>> >> > but > >>>> >>> >> > it would come up. It always had this error: > >>>> >>> >> > > >>>> >>> >> > -1> 2013-04-27 18:11:56.179804 b6fcd000 2 osd.1 0 boot > >>>> >>> >> > 0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In > >>>> >>> >> > function > >>>> >>> >> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, > >>>> >>> >> > hobject_t&, > >>>> >>> >> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27 > >>>> >>> >> > 18:11:56.399089 > >>>> >>> >> > osd/PG.cc: 2556: FAILED assert(values.size() == 1) > >>>> >>> >> > > >>>> >>> >> > ceph version 0.60-401-g17a3859 > >>>> >>> >> > (17a38593d60f5f29b9b66c13c0aaa759762c6d04) > >>>> >>> >> > 1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, > >>>> >>> >> > ceph::buffer::list*)+0x1ad) [0x2c3c0a] > >>>> >>> >> > 2: (OSD::load_pgs()+0x357) [0x28cba0] > >>>> >>> >> > 3: (OSD::init()+0x741) [0x290a16] > >>>> >>> >> > 4: (main()+0x1427) [0x2155c0] > >>>> >>> >> > 5: (__libc_start_main()+0x99) [0xb69bcf42] > >>>> >>> >> > NOTE: a copy of the executable, or `objdump -rdS > <executable>` > >>>> >>> >> > is > >>>> >>> >> > needed to > >>>> >>> >> > interpret this. > >>>> >>> >> > > >>>> >>> >> > > >>>> >>> >> > I then did a full cluster restart, and now I have ten OSDs > down > >>>> >>> >> > -- > >>>> >>> >> > each > >>>> >>> >> > showing the same exception/failed assert. > >>>> >>> >> > > >>>> >>> >> > Anybody seen this? > >>>> >>> >> > > >>>> >>> >> > I know I'm running a weird version -- it's compiled from > >>>> >>> >> > source, and > >>>> >>> >> > was > >>>> >>> >> > provided to me. The OSDs are all on ARM, and the mon is > >>>> >>> >> > x86_64. > >>>> >>> >> > Just > >>>> >>> >> > looking to see if anyone has seen this particular stack trace > >>>> >>> >> > of > >>>> >>> >> > load_pgs()/peek_map_epoch() before.... > >>>> >>> >> > > >>>> >>> >> > - Travis > >>>> >>> >> > > >>>> >>> >> > _______________________________________________ > >>>> >>> >> > ceph-users mailing list > >>>> >>> >> > ceph-users@lists.ceph.com > >>>> >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> >>> >> > > >>>> >>> > > >>>> >>> > > >>>> >> > >>>> >> > >>>> > > >>> > >>> > >> > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com