I wouldn't say that's a pretty common failure. The flaw here perhaps is the
design of the cluster and that it was relying on a single power source.
Power sources fail. Dual power supplies connected to a b power sources in
the data centre is pretty standard.

On Tuesday, July 2, 2019, Bryan Henderson <bry...@giraffe-data.com> wrote:
>> Normally in the case of a restart then somebody who used to have a
>> connection to the OSD would still be running and flag it as dead. But
>> if *all* the daemons in the cluster lose their soft state, that can't
>> happen.
>
> OK, thanks.  I guess that explains it.  But that's a pretty serious design
> flaw, isn't it?  What I experienced is a pretty common failure mode: a
power
> outage caused the entire cluster to die simultaneously, then when power
came
> back, some OSDs didn't (the most common time for a server to fail is at
> startup).
>
> I wonder if I could close this gap with additional monitoring of my own.
I
> could have a cluster bringup protocol that detects OSD processes that
aren't
> running after a while and mark those OSDs down.  It would be cleaner,
though,
> if I could just find out from the monitor what OSDs are in the map but not
> connected to the monitor cluster.  Is that possible?
>
> A related question: If I mark an OSD down administratively, does it stay
down
> until I give a command to mark it back up, or will the monitor detect
signs of
> life and declare it up again on its own?
>
> --
> Bryan Henderson                                   San Jose, California
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to