On Tue, Nov 06, 2012 at 12:33:08PM +0000, Michael Kjörling wrote:
> Can btrfs deal reasonably gracefully with sudden shutdowns? (I'm
> mainly thinking of power outages which lead to logical structure
> damage but not physical media damage.)

   In theory (i.e. by the design of the FS), you should be able to
pull the plug on btrfs at any point, and the FS will always be
consistent.

   This makes some assumptions: That writing a single page to the FS
is atomic. That the hardware reports barriers to the OS reliably. i.e.
if the hardware says it's fully stored data without losing it, then it
actually has.

   There are also some caveats: while the FS should always be
consistent, the latest transaction write may not have been completed,
so you could potentially lose up to 30 seconds of writes to the FS
from immediately before the crash.

   If the FS does corrupt over a power failure, and the hardware can
be demonstrated to be good, then we have a bug that needs to be
tracked down. (There have been a number of these over the development
of the FS so far, but they do get fixed).

> What would be the risk points, file-system-wise?
> 
> Can for example a rotating snapshot schedule mitigate some or all
> issues relating to sudden shutdowns, if any? (_For example_, take a
> snapshot every minute, keeping the last five; if the main file system
> fails to mount, then could the most recent usable snapshot be used as
> a fallback, or is it likely to be equally damaged or inconsistent?)

   No, snapshots give you no additional guarantees -- if the FS
corrupts and is unmountable, a snapshot is part of the same FS and
will also be unmountable.

> Obviously a UPS or other form of fallback power is preferable to no
> UPS if power outages are a concern, so as to allow a controlled system
> shutdown (or fail-over to a more long-term backup power supply) in the
> event of a prolonged power outage, but I'm wondering about situations
> where such don't exist or even fail.

   As I said above, the FS structures _should_ be completely reliable
in the face of power loss; that they haven't been in the past is
definitely a bug, and those bugs have been / are being fixed as
they're found. We've had very few transid match failures recently,
which used to be the main failure mode for these bugs. I don't know
whether that's because people aren't reporting them, or because
they're not happening nearly so often these days. I suspect the
latter.

   I guess the question for you is: are you after the _expected_
behaviour of the FS (should always be consistent on good hardware, but
you may lose up to 30 seconds of writes), or are you after mitigation
strategies in the face of FS bugs (keep off-site backups and be
prepared to use them)?

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
        --- emacs:  Eighty Megabytes And Constantly Swapping. ---        

Attachment: signature.asc
Description: Digital signature

Reply via email to