I wouldn't say that's a pretty common failure. The flaw here perhaps is the design of the cluster and that it was relying on a single power source. Power sources fail. Dual power supplies connected to a b power sources in the data centre is pretty standard.
On Tuesday, July 2, 2019, Bryan Henderson <bry...@giraffe-data.com> wrote: >> Normally in the case of a restart then somebody who used to have a >> connection to the OSD would still be running and flag it as dead. But >> if *all* the daemons in the cluster lose their soft state, that can't >> happen. > > OK, thanks. I guess that explains it. But that's a pretty serious design > flaw, isn't it? What I experienced is a pretty common failure mode: a power > outage caused the entire cluster to die simultaneously, then when power came > back, some OSDs didn't (the most common time for a server to fail is at > startup). > > I wonder if I could close this gap with additional monitoring of my own. I > could have a cluster bringup protocol that detects OSD processes that aren't > running after a while and mark those OSDs down. It would be cleaner, though, > if I could just find out from the monitor what OSDs are in the map but not > connected to the monitor cluster. Is that possible? > > A related question: If I mark an OSD down administratively, does it stay down > until I give a command to mark it back up, or will the monitor detect signs of > life and declare it up again on its own? > > -- > Bryan Henderson San Jose, California > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com