Just adding our feedback - this is affecting us as well. We reboot
periodically to test durability of the clusters we run, and this is
fairly impactful. I could see power loss/other scenarios in which this
could end quite poorly for those with less than perfect redundancy in
DCs across multiple racks/PDUs/etc. I see
https://github.com/ceph/ceph/pull/42690 has been submitted, but I'd
definitely make an argument for it being a 'very high' priority, so it
hopefully gets a review in time for 16.2.6. :)

David

On Tue, Aug 10, 2021 at 4:36 AM Sebastian Wagner <sewag...@redhat.com> wrote:
>
> Good morning Robert,
>
> Am 10.08.21 um 09:53 schrieb Robert Sander:
> > Hi,
> >
> > Am 09.08.21 um 20:44 schrieb Adam King:
> >
> >> This issue looks the same as https://tracker.ceph.com/issues/51027
> >> which is
> >> being worked on. Essentially, it seems that hosts that were being
> >> rebooted
> >> were temporarily marked as offline and cephadm had an issue where it
> >> would
> >> try to remove all daemons (outside of osds I believe) from offline
> >> hosts.
> >
> > Sorry for maybe being rude but how on earth does one come up with the
> > idea to automatically remove components from a cluster where just one
> > node is currently rebooting without any operator interference?
>
> Obviously no one :-). We already have over 750 tests for the cephadm
> scheduler and I can foresee that we'll get some additional ones for this
> case as well.
>
> Kind regards,
>
> Sebastian
>
>
> >
> > Regards
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to