Re: Status of SMR with BTRFS

Austin S. Hemmelgarn Thu, 21 Jul 2016 07:14:28 -0700

On 2016-07-21 09:34, Chris Murphy wrote:

On Thu, Jul 21, 2016 at 6:46 AM, Austin S. Hemmelgarn
<ahferro...@gmail.com> wrote:

On 2016-07-20 15:58, Chris Murphy wrote:


On Sun, Jul 17, 2016 at 3:08 AM, Hendrik Friedel <hend...@friedels.name>
wrote:

Well, btrfs does write data very different to many other file systems. On
every write the file is copied to another place, even if just one bit is
changed. That's special and I am wondering whether that could cause
problems.



It depends on the application. In practice, the program most
responsible for writing the file often does a faux-COW by writing a
whole new (temporary) file somewhere, when that operation completes,
it then deletes the original, and move+renames the temporary one into
place where the original one, doing fsync in between each of those
operations. I think some of this is done via VFS also. It's all much
more metadata centric than what Btrfs would do on its own.


I'm pretty certain that the VFS itself does not do replace by rename type
stuff.


I can't tell what does it. But so far every program I've tried: vi,
gedit, GIMP, writes out a new file - as in, it has a different inode
number and every extent has a different address. That every program
reimplements this faux-COW would kinda surprise me rather than just
letting the VFS do it for everyone. I think since ancient times
literally overwriting files is just a bad idea that pretty much
guarantees data loss of old and new data if something interrupts that
overwrite.

This really isn't fake COW, it's COW, just at a higher level than mostprogrammers would think of it. The rename to replace is the pointerupdate, and the copy granularity is variable based on the size of the file.

The whole practice is used by just about everything, and dates back tobefore even SVR4, because traditional filesystems will corrupt files ifthey're being written when a power loss or crash occurs. It's alsopopular because it breaks hard links, which have often be used as a poorman's form of deduplication. Even on newer journaled filesystems,things aren't always safe across a power loss if you don't do this. Itcan't be done legitimately in the VFS though, because POSIX requiresthat the inode not change if the file is just overwritten or rewrittenin place. Vi (which is probably vim on your system, although all otherimplementations I know of do likewise) does the this by itself. Mostgraphical applications have it happen through libraries they link to (Iknow for a fact that Qt has an option to do this, and I'm pretty certainGlib does too, but I don't know if they do by default or not). Ingeneral though, it's really not all that much duplicated code, maybe 20lines tops, assuming they don't use predictable file names and open codethe temporary name generation.

BTRFS by nature technically does though, it's the same idea as a COW
update, just at a higher level, so we're technically doing the same thing
for every single block that changes.  The only issue I can think of in this
context with a replace by rename is that you end up hitting the metadata
trees twice.


Do programs have a way to communicate what portion of a data file is
modified, so that only changed blocks are COW'd? When I change a
single pixel in a 400MiB image and do a save (to overwrite the
original file), it takes just as long to overwrite as to write it out
as a new file. It'd be neat if that could be optimized but I don't see
it being the case at the moment.

AFAIUI, in BTRFS (and also ZFS), whatever blocks get rewritten getCOW'ed. So, rewriting the whole file will COW the whole file, not justthe blocks that are different. Trying to check in the FS itself whatchanged is actually rather inefficient (you will almost always spendmore time comparing data than you will save by writing it all out ifyour using fast storage, and every write potentially implies a hugenumber of reads), and relying on the application to tell us isdangerous. That said, most of the required infrastructure is alreadypresent in the in-band deduplication stuff, and in fact, it may do thisfor files that get rewritten frequently enough that they don't getpushed out of it's cache (I haven't tested for this, and I don't havethe time or expertise to read through the code to see if it will, butbased on my current understanding of how it works, it should do thisimplicitly). The whole thing is a trade off though, because onlyCOW'ing the parts that changed leads to higher levels of fragmentation,and that's part of why database and disk image files have such issueswith fragmentation and making them NOCOW helps with these issues, theyonly get spot rewrites.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of SMR with BTRFS

Reply via email to