> Op 17 augustus 2016 om 23:54 schreef Dan Jakubiec <dan.jakub...@gmail.com>:
> 
> 
> Hi Wido,
> 
> Thank you for the response:
> 
> > On Aug 17, 2016, at 16:25, Wido den Hollander <w...@42on.com> wrote:
> > 
> > 
> >> Op 17 augustus 2016 om 17:44 schreef Dan Jakubiec <dan.jakub...@gmail.com>:
> >> 
> >> 
> >> Hello, we have a Ceph cluster with 8 OSD that recently lost power to all 8 
> >> machines.  We've managed to recover the XFS filesystems on 7 of the 
> >> machines, but the OSD service is only starting on 1 of them.
> >> 
> >> The other 5 machines all have complaints similar to the following:
> >> 
> >>    2016-08-17 09:32:15.549588 7fa2f4666800 -1 
> >> filestore(/var/lib/ceph/osd/ceph-1) Error initializing leveldb : 
> >> Corruption: 6 missing files; e.g.: 
> >> /var/lib/ceph/osd/ceph-1/current/omap/042421.ldb
> >> 
> >> How can we repair the leveldb to allow the OSDs to startup?  
> >> 
> > 
> > My first question would be: How did this happen?
> > 
> > What hardware are you using underneath? Is there a RAID controller which is 
> > not flushing properly? Since this should not happen during a power failure.
> > 
> 
> Each OSD drive is connected to an onboard hardware RAID controller and 
> configured in RAID 0 mode as individual virtual disks.  The RAID controller 
> is an LSI 3108.
> 

Was that controller in writeback mode without a BBU?

> I agree -- I am finding it bizarre that 7 of our 8 OSDs (one per machine) did 
> not survive the power outage.  
> 

As Christian already asked, mounted the FS with nobarrier?

> We did have some problems with the stock Ubunut xfs_repair (3.1.9) seg 
> faulting, which eventually we overcame by building a newer version of 
> xfs_repair (4.7.0).  But it did finally repair clean.
> 

Not good. A xfs_repair should not be required after a power failure. A 
journaling filesystem properly mounted and a good controller underneath should 
mount and just replay it's journal.

> We actually have some different errors on other OSDs.  A few of them are 
> failing with "Missing map in load_pgs" errors.  But generally speaking it 
> appears to be missing files of various types causing different kinds of 
> failures.
> 

Missing files is not good, very bad actually. This should never happen and 
points to something which is not Ceph's fault. Controller in writeback, 
nobarrier mount option, etc.

> I'm really nervous now about the OSD's inability to start with any 
> inconsistencies and no repair utilities (that I can find).  Any advice on how 
> to recover?
> 

I am afraid that you won't be able to recover from this. You are missing 
essential files from the OSDs. Without them they won't be able to start.

Maybe, maybe, maybe something will be able to reconstruct the leveldb of the 
other OSDs with data from the one surviving OSD, but that's a very big maybe.

Wido

> > I don't know the answer to your question, but lost files are not good.
> > 
> > You might find them in a lost+found directory if XFS repair worked?
> > 
> 
> Sadly this directory is empty.
> 
> -- Dan
> 
> > Wido
> > 
> >> Thanks,
> >> 
> >> -- Dan J_______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to