On Wed, 17 Dec 2014 14:05:30 Clint Adams wrote: > This only seems to occur on one particular machine.
Which may be an evidence of a hardware problem such as overheating CPU, faulty RAM, HDD firmware or HDD controller. > At some point, an OSD will die. I am unable to > restart the OSD unless I zero out the beginning of > the journal and then mkjournal it first. It may be more dangerous than it seems due to Ceph's total disregard to data integrity. I actually lost the whole cluster once when one OSD flushed junk from its journal and spread rubbish to other OSDs which caused massive cascade of OSD crashes. > I am retrying with journal_aio set to false to > see if this recurs. I doubt that journal_aio manipulations would be helpful (please let us know if I'm wrong). I recommend to remove problematic OSD ASAP to avoid bigger problems. Besides this problem seems unrelated to Debian so you have better chances to get help from upstream. I was the only active Ceph maintainer in Debian for almost a year but I'm no longer interested in Ceph (very disappointed) and with my retirement from team I doubt that bugs like this would receive much attention. -- Best wishes, Dmitry Smirnov GPG key : 4096R/53968D1B --- Not a lack of belief, but adherence to false knowledge is the enemy of progress. And certain that we have found everything worth searching for, we see no point in further search and inquiry. Believing what is unworthy of belief, believing falsehood as if it were incontrovertible truth, and sure that we know everything we will ever need to know, we are worse than ignorant. -- Chester Dolan, "Blind Faith"
signature.asc
Description: This is a digitally signed message part.