My initial experience was similar to Mike's, causing a similar level of paranoia. :-) I'm dealing with RadosGW though, so I can tolerate higher latencies.
I was running my cluster with noout and nodown set for weeks at a time. Recovery of a single OSD might cause other OSDs to crash. In the primary cluster, I was always able to get it under control before it cascaded too wide. In my secondary cluster, it did spiral out to 40% of the OSDs, with 2-5 OSDs down at any time. I traced my problems to a combination of osd max backfills was too high for my cluster, and my mkfs.xfs arguments were causing memory starvation issues. I lowered osd max backfills, added SSD journals, and reformatted every OSD with better mkfs.xfs arguments. Now both clusters are stable, and I don't want to break it. I only have 45 OSDs, so the risk with a 24-48 hours recovery time is acceptable to me. It will be a problem as I scale up, but scaling up will also help with the latency problems. On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson <mike.daw...@cloudapt.com> wrote: > > We use 3x replication and have drives that have relatively high > steady-state IOPS. Therefore, we tend to prioritize client-side IO more > than a reduction from 3 copies to 2 during the loss of one disk. The > disruption to client io is so great on our cluster, we don't want our > cluster to be in a recovery state without operator-supervision. > > Letting OSDs get marked out without operator intervention was a disaster > in the early going of our cluster. For example, an OSD daemon crash would > trigger automatic recovery where it was unneeded. Ironically, often times > the unneeded recovery would often trigger additional daemons to crash, > making a bad situation worse. During the recovery, rbd client io would > often times go to 0. > > To deal with this issue, we set "mon osd down out interval = 14400", so as > operators we have 4 hours to intervene before Ceph attempts to self-heal. > When hardware is at fault, we remove the osd, replace the drive, re-add the > osd, then allow backfill to begin, thereby completely skipping step B in > your timeline above. > > - Mike > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com