Hi Patrick.

> To be clear in case there is any confusion: once you do `fs fail`, the
> MDS are removed from the cluster and they will respawn. They are not
> given any time to flush remaining I/O.

This is fine, there is not enough time to flush anything. As long as they leave 
the meta-data- and data pools in a consistent state, that is, after an "fs set 
<fs_name> joinable true" the MDSes start replaying the journal etc. and the FS 
comes up healthy, everything is fine. If user IO in flight gets lost in this 
process, this is not a problem. A problem would be a corruption of the file 
system itself.

In my experience, an mds fail is a clean (non-destructive) operation. I have 
never had an FS corruption due to an mds fail. As long as an "fs fail" is also 
non-destructive, it is the best way I can see to cut off all user IO as fast as 
possible and bring all hardware to rest. What I would like to avoid is a power 
loss on a busy cluster where I would have to rely on too many things to be 
implemented correctly. With >800 disks you start seeing unusual firmware fails 
and also disk fails after power up are not uncommon. I just want to take as 
much as possible out of the "does this really work in all corner cases" 
equation and rather rely on "I did this 100 times in the past without a 
problem" situations.

That users may have to repeat a task is not a problem. Damaging the file system 
itself, on the other hand, is.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Patrick Donnelly <pdonn...@redhat.com>
Sent: 25 October 2022 14:51:33
To: Frank Schilder
Cc: Dan van der Ster; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Temporary shutdown of subcluster and cephfs

On Tue, Oct 25, 2022 at 3:48 AM Frank Schilder <fr...@dtu.dk> wrote:
>
> Hi Patrick,
>
> thanks for your answer. This is exactly the behaviour we need.
>
> For future reference some more background:
>
> We need to prepare a quite large installation for planned power outages. Even 
> though they are called planned, we will not be able to handle these manually 
> in good time for reasons irrelevant here. Our installation is protected by an 
> UPS, but the guaranteed uptime on outage is only 6 minutes. So, we talk more 
> about transient protection than uninterrupted power supply. Although we 
> survived more than 20 minute power outages without loss of power to the DC, 
> we need to plan with these 6 minutes.
>
> In these 6 minutes, we need to wait for at least 1-2 minutes to avoid 
> unintended shut-downs. In the remaining 4 minutes, we need to take down a 500 
> node HPC cluster and an 1000OSD+12MDS+2MON ceph sub-cluster. Part of this 
> ceph cluster will continue running on another site with higher power 
> redundancy. This gives maybe 1-2 minutes response time for the ceph cluster 
> and the best we can do is to try to achieve a "consistent at rest" state and 
> hope we can cleanly power down the system before the power is cut.
>
> Why am I so concerned about a "consistent at rest" state?
>
> Its because while not all instances of a power loss lead to data loss, all 
> instances of data loss I know of and were not caused by admin errors were 
> caused by a power loss (see https://tracker.ceph.com/issues/46847). We were 
> asked to prepare for a worst case of weekly power cuts, so no room for taking 
> too many chances here. Our approach is: unmount as much as possible, fail the 
> quickly FS to stop all remaining IO, give OSDs and MDSes a chance to flush 
> pending operations to disk or journal and then try a clean shut down.

To be clear in case there is any confusion: once you do `fs fail`, the
MDS are removed from the cluster and they will respawn. They are not
given any time to flush remaining I/O.

FYI as this may interest you: we have a ticket to set a flag on the
file system to prevent new client mounts:
https://tracker.ceph.com/issues/57090

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to