On Tue, Nov 06, 2012 at 12:33:08PM +0000, Michael Kjörling wrote: > Can btrfs deal reasonably gracefully with sudden shutdowns? (I'm > mainly thinking of power outages which lead to logical structure > damage but not physical media damage.)
In theory (i.e. by the design of the FS), you should be able to pull the plug on btrfs at any point, and the FS will always be consistent. This makes some assumptions: That writing a single page to the FS is atomic. That the hardware reports barriers to the OS reliably. i.e. if the hardware says it's fully stored data without losing it, then it actually has. There are also some caveats: while the FS should always be consistent, the latest transaction write may not have been completed, so you could potentially lose up to 30 seconds of writes to the FS from immediately before the crash. If the FS does corrupt over a power failure, and the hardware can be demonstrated to be good, then we have a bug that needs to be tracked down. (There have been a number of these over the development of the FS so far, but they do get fixed). > What would be the risk points, file-system-wise? > > Can for example a rotating snapshot schedule mitigate some or all > issues relating to sudden shutdowns, if any? (_For example_, take a > snapshot every minute, keeping the last five; if the main file system > fails to mount, then could the most recent usable snapshot be used as > a fallback, or is it likely to be equally damaged or inconsistent?) No, snapshots give you no additional guarantees -- if the FS corrupts and is unmountable, a snapshot is part of the same FS and will also be unmountable. > Obviously a UPS or other form of fallback power is preferable to no > UPS if power outages are a concern, so as to allow a controlled system > shutdown (or fail-over to a more long-term backup power supply) in the > event of a prolonged power outage, but I'm wondering about situations > where such don't exist or even fail. As I said above, the FS structures _should_ be completely reliable in the face of power loss; that they haven't been in the past is definitely a bug, and those bugs have been / are being fixed as they're found. We've had very few transid match failures recently, which used to be the main failure mode for these bugs. I don't know whether that's because people aren't reporting them, or because they're not happening nearly so often these days. I suspect the latter. I guess the question for you is: are you after the _expected_ behaviour of the FS (should always be consistent on good hardware, but you may lose up to 30 seconds of writes), or are you after mitigation strategies in the face of FS bugs (keep off-site backups and be prepared to use them)? Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- emacs: Eighty Megabytes And Constantly Swapping. ---
signature.asc
Description: Digital signature