Kai Krakow posted on Fri, 07 Feb 2014 23:26:34 +0100 as excerpted:

> So the question is: Do btrfs snapshots give the same guarantees on the
> filesystem level that write-barriers give on the storage level which
> exactly those processes rely upon? The cleanest solution would be if
> processes could give btrfs hints about what belongs to their
> transactions so in the moment of a snapshot the data file would be in
> clean state. I guess snapshots are atomic in that way, that pending
> writes will never reach the snapshots just taken, which is good.

Keep in mind that btrfs' metadata is COW-based also.  Like reiser4 in 
this way, in theory at least, commits are atomic -- they've ether made it 
to disk or they haven't, there's no half there.  Commits at the leaf 
level propagate up the tree, and are not finalized until the top-level 
root node is written.  AFAIK if there's dirty data to write, btrfs 
triggers a root node commit every 30 seconds.  Until that root is 
rewritten, it points to the last consistent-state written root node.  
Once it's rewritten, it points to the new one and a new set of writes are 
started, only to be finalized at the next root node write.

And I believe that final write simply updates a pointer to point at the 
latest root node.  There's also a history of root nodes, which is what 
the btrfs-find-root tool uses in combination with btrfs restore, if 
necessary, to find a valid root from the root node pointer log if the 
system crashed in the middle of that final update so the pointer ends up 
pointing at garbage.

Meanwhile, I'm a bit blurry on this but if I understand things correctly, 
between root node writes/full-filesystem-commits there's a log of 
transaction completions at the atomic individual transaction level, such 
that even transactions completed between root node writes can normally be 
replayed.  Of course this is only ~30 seconds worth of activity max, 
since the root node writes should occur every 30 seconds, but this is 
what btrfs-zero-log zeroes out, if/when needed.  You'll lose that few 
seconds of log replay since the last root node write, but if it was 
garbage data due to it being written when the system actually went down, 
dropping those few extra seconds of log can allow the filesystem to mount 
properly from the last full root node commit, where it couldn't, 
otherwise.

It's actually those metadata trees and the atomic root-node commit 
feature that btrfs snapshots depend on, and why they're normally so fast 
to create.  When a snapshot is taken, btrfs simply keeps a record of the 
current root node instead of letting it recede into history and fall off 
the end of the root node log, labeling that record with the name of the 
snapshot for humans as well as the object-ID that btrfs uses.  That root 
node is by definition a record of the filesystem in a consistent state, 
so any snapshot that's a reference to it is similarly by definition in a 
consistent state.

So normally, files in the process of being written out (created) simply 
wouldn't appear in the snapshot.  Of course preexisting files will appear 
(and fallocated files are simply the blanked-out-special-case of 
preexisting), but again, with normal COW-based files at least, will exist 
in a state either before the latest transaction started, or after it 
finished, which of course is where fsync comes in, since that's how 
userspace apps communicate file transactions to the filesystem.

And of course in addition to COW, btrfs normally does checksumming as 
well, and again, the filesystem including that checksumming will be self-
consistent when a root-node is written, or it won't be written until the 
filesystem /is/ self-consistent.  If for whatever reason there's garbage 
when btrfs attempts to read the data back, which is exactly what btrfs 
defines it as if it doesn't pass checksum, btrfs will refuse to use that 
data.  If there's a second copy somewhere (as with raid1 mode), it'll try 
to restore from that second copy.  If it can't, btrfs will return an 
error and simply won't let you access that file.

So one way or another, a snapshot is deterministic and atomic.  No 
partial transactions, at least on ordinary COW and checksummed files.

Which brings us to NOCOW files, where for btrfs NOCOW also turns off 
checksumming.  Btrfs will write these files in-place, and as a result 
there's not the transaction integrity guarantee on these files that there 
is on ordinary files.

*HOWEVER*, the situation isn't as bad as it might seem, because most 
files where NOCOW is recommended, database files, VM images, pre-
allocated torrent files, etc, are created and managed by applications 
that already have their own data integrity management/verification/repair 
methods, since they're designed to work on filesystems without the data 
integrity guarantees btrfs normally provides.

In fact, it's possible, even likely in case of a crash, that the 
application's own data integrity mechanisms can fight with those of 
btrfs, and letting btrfs scrub restore what it thinks is a good copy can 
actually interfere with the application's own data integrity and repair 
functionality because it often goes to quite some lengths to repair 
damage or simply revert to a checkpoint position if it has to, but it 
doesn't expect the filesystem to be making such changes and isn't 
prepared to deal with filesystems that do so!  There have in fact been 
several reports to the list of what appears to be exactly that happening!

So in fact it's often /better/ to turn off both COW and checksumming via 
NOCOW, if you know your application manages such things.  That way the 
filesystem doesn't try to repair the damage in case of a crash, which 
leaves the application's own functionality to handle it and repair or 
roll back as it is designed to do.

That's with crashes.  The one quirk that's left to deal with is how 
snapshots deal with NOCOW files.  As explained earlier, snapshots leave a 
NOCOW file as-is initially, but will COW it ONCE, the first time a 
snapshotted NOCOW file-block is written to in that snapshot, thus 
diverging it from the shared version.

A snapshot thus looks much like a crash in terms of NOCOW file integrity 
since the blocks of a NOCOW file are simply snapshotted in-place, and 
there's already no checksumming or file integrity verification on such 
files -- they're simply directly written in-place (with the exception of 
a single COW write when a writable snapshottted NOCOW file diverges from 
the shared snapshot version).

But as I said, the applications themselves are normally designed to 
handle and recover from crashes, and in fact, having btrfs try to manage 
it too only complicates things and can actually make it impossible for 
the app to recover what it would have otherwise recovered just fine.

So it should be with these NOCOW in-place snapshotted files, too.  If a 
NOCOW file is put back into operation from a snapshot, and the file was 
being written to at snapshot time, it'll very likely trigger exactly the 
same response from the application as a crash while writing would have 
triggered, but, the point is, such applications are normally designed to 
deal with just that, and thus, they should recover just as they would 
from a crash.  If they could recover from a crash, it shouldn't be an 
issue.  If they couldn't, well...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to