Bug#773361: ceph: osd dies, something corrupts journal

Dmitry Smirnov Sun, 15 Mar 2015 17:18:31 -0700

On Wed, 17 Dec 2014 14:05:30 Clint Adams wrote:
> This only seems to occur on one particular machine.


Which may be an evidence of a hardware problem such as overheating CPU, faulty 
RAM, HDD firmware or HDD controller.


> At some point, an OSD will die.  I am unable to
> restart the OSD unless I zero out the beginning of
> the journal and then mkjournal it first.

It may be more dangerous than it seems due to Ceph's total disregard to data 
integrity. I actually lost the whole cluster once when one OSD flushed junk 
from its journal and spread rubbish to other OSDs which caused massive cascade 
of OSD crashes.


> I am retrying with journal_aio set to false to
> see if this recurs.

I doubt that journal_aio manipulations would be helpful (please let us know if 
I'm wrong). I recommend to remove problematic OSD ASAP to avoid bigger 
problems.

Besides this problem seems unrelated to Debian so you have better chances to 
get help from upstream. I was the only active Ceph maintainer in Debian for 
almost a year but I'm no longer interested in Ceph (very disappointed) and 
with my retirement from team I doubt that bugs like this would receive much 
attention.

-- 
Best wishes,
 Dmitry Smirnov
 GPG key : 4096R/53968D1B

---

Not a lack of belief, but adherence to false knowledge is the enemy of
progress. And certain that we have found everything worth searching for,
we see no point in further search and inquiry. Believing what is
unworthy of belief, believing falsehood as if it were incontrovertible
truth, and sure that we know everything we will ever need to know, we
are worse than ignorant.
        -- Chester Dolan, "Blind Faith"

signature.asc
Description: This is a digitally signed message part.

Bug#773361: ceph: osd dies, something corrupts journal

Reply via email to