Martin Steigerwald posted on Sun, 12 Oct 2014 12:14:01 +0200 as excerpted: > I always thought with a controller and device and driver combination > that honors fsync with BTRFS it would either be the new state of the > last known good state *anyway*. So where does the need to rollback arise > from?
My understanding here is... With btrfs a full-tree commit is atomic. You should get either the old tree or the new tree. However, due to the cascading nature of updates on cow-based structures, these full-tree commits are done by default (there's a mount-option to adjust it) every 30 seconds. Between these atomic commits partial updates may have occurred. The btrfs log (the one that btrfs-zero-log kills) is limited to between-commit updates, and thus to the upto 30 seconds (default) worth of changes since the last full- tree atomic commit. In addition to that, there's a history of tree-root commits kept (with the superblocks pointing to the last one). Btrfs-find-tree-root can be used to list this history. The recovery mount option simply allows btrfs to fall back to this history, should the current root be corrupted. Btrfs restore can be used to list tree roots as well, and can be pointed at an appropriate one if necessary. Fsync forces the file and its corresponding metadata update to the log and barring hardware or software bugs should not return until it's safely in the log, but I'm not sure whether it forces a full-tree commit. Either way the guarantees should be the same. If the log can be replayed or a full-tree commit has occurred since the fsync, the new copy should appear. If it can't, the rollback to the last atomic tree commit should return an intact copy of the file from that point. If the recovery mount option is used and a further rollback to an earlier full-tree commit is forced, provided it existed at the point of that full-tree commit, the intact file at that point should appear. So if the current tree root is a good one, the log will replay the last upto 30 seconds of activity on top of that last atomic tree root. If the current root tree itself is corrupt, the recovery mount option will let an earlier one be used. Obviously in that case the log will be discarded since it applies to a later root tree that itself has been discarded. The debate is whether recovery should be automated so the admin doesn't have to care about it, or whether having to manually add that option serves as a necessary notifier to the admin that something /did/ go wrong, and that an earlier root is being used instead, so more than a few seconds worth of data may have disappeared. As someone else has already suggested, I'd argue that as long as btrfs continues to be under the sort of development it's in now, keeping recovery as a non-default option is desired. Once it's optimized and considered stable, arguably recovery should be made the default, perhaps with a no-recovery option for those who prefer that in-the-face notification in the form of a mount error, if btrfs would otherwise fall back to an earlier tree root commit. What worries me, however, is that IMO the recent warning stripping was premature. Btrfs is certainly NOT fully stable or optimized for normal use at this point. We're still using the even/odd PID balancing scheme for raid1 reads, for instance, and multi-device writes are still serialized when they could be parallelized to a much larger degree (tho keeping some serialization is arguably good for data safety). Arguably optimizing that now would be premature optimization since the code itself is still subject to change, so I'm not complaining, but by that very same token, it *IS* still subject to change, which by definition means it's *NOT* stable, so why are we removing all the warnings and giving the impression that it IS stable? The decision wasn't mine to make and I don't know, but while a nice suggestion, making recovery-by-default a measure of when btrfs goes stable simply won't work, because surely, the same folks behind the warning stripping would then ensure this indicator too, said btrfs was stable, while the state of the code itself continues to say otherwise. Meanwhile, if your distributed transactions scenario doesn't account for crash and loss of data on one side with real-time backup/redundancy, such that loss of a few seconds worth of transactions on a single local filesystem is going to kill the entire scenario, I don't think too much of that scenario in the first place, and regardless, btrfs, certainly in its current state, is definitely NOT an appropriate base for it. Use appropriate tools for the task. Btrfs at least at this point is simply not an appropriate tool for that task. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html