Rohan Kadekodi posted on Sat, 09 Sep 2017 18:50:09 -0500 as excerpted: > Hello, > > I was trying to understand how file renames are handled in Btrfs. I read > the code documentation, but had a problem understanding a few things. > > During a file rename, btrfs_commit_transaction() is called which is > because Btrfs has to commit the whole FS before storing the information > related to the new renamed file. It has to commit the FS because a > rename first does an unlink, which is not recorded in the btrfs_rename() > transaction and so is not logged in the log tree. Is my understanding > correct? If yes, my questions are as follows:
I'm not a dev, but am a btrfs user and list regular, and can try my hand at answering... and if I'm wrong, a dev's reply can correct my misconceptions as well. =:^) > 1. What does committing the whole FS mean? Blktrace shows that there are > 2 256KB writes, which are essentially writes to the data of the > root directory of the file system (which I found out through > btrfs-debug-tree). Is this equivalent to doing a shell sync, as the same > block groups are written during a shell sync too? Also, does it imply > that all the metadata held by the log tree is now checkpointed to the > respective trees? A btrfs commit is the equivalent of a *single* filesystem sync, yes. The difference compared to the sync(1) command is that sync applies to all filesystems of all types, not just a single btrfs filesystem. See also the btrfs filesystem sync command (btrfs-filesystem(8) manpage), which applies to a a single btrfs, but also triggers deleted subvolume cleanup. But these are not writes to the /data/ of the root directory. In btrfs, data and metadata are separated, and these are writes to the /metadata/ of the filesystem, including writing a new filesystem top-level (aka root) block and the superblock and its backups. Yes, the log is synced too. But regarding the log, in btrfs, because btrfs is atomic cow-based (copy- on-write), at each commit the filesystem is designed to be entirely self- consistent, with the result being that most actions don't need to be and are not logged. At a crash and later remount, the filesystem as of the last atomically-written root-block state will be mounted, and anything being written at the time of the crash will either have been entirely written and committed (the top-level root tree block will have been updated to reflect it), or that update will not have happened yet, so the state of the filesystem will be that of the last root tree block commit, with newer in-process actions lost. The btrfs log is an exception, a compromise in the interest of fsync speed. The only thing it logs are fsyncs (filesyncs, as opposed to whole filesystem syncs) that would otherwise not return until the next commit (with commits on a 30-second timer by default), since the filesystem would otherwise be unable to guarantee that the fsync had been entirely written to permanent media and thus should survive a crash. The log ensures the fsynced file's new data (if any) is written to its new location on the media (cow so new block location), updates the metadata (also cow so written to a new location), then logs the metadata update so it can be committed at log replay if necessary, and returns. If a crash happens before the next full filesystem atomic commit, the fsync can be replayed from the log, thus satisfying the fsync guarantee without forcing a wait for a full atomic commit. But once that full filesystem atomic commit happens (again, with a 30-second default timeout), all updates are now reflected in the new filesystem state as registered in the new root tree block, and the previous log is now dead/unreferenced on the media (because the new root block doesn't refer to it any longer, referring instead to a new log). > 2. Why are there 2 complete writes to the data held by the root > directory and not just 1? These writes are 256KB each, which is the size > of the extent allocated to the root directory I'm not sure on this one, hopefully a btrfs dev can clarify, but at a guess, you may be seeing writes to the superblock and its backup -- on a large enough filesystem there's two backups, but your filesystem may be small enough to have just one backup. It's also possible you're seeing the new copy of the metadata tree being written out, then the root block and superblocks (and backups) being updated. > 3. Why are the writes being done to the root directory of the file > system / subvolume and not just the parent directory where the unlink > happened? Remember, everything's in trees, and updates are cowed, with updates at lower levels of the tree not reflected in the atomic state of the filesystem until they've recursed up the tree and a new root tree block is written, pointing at the new trees instead of the old ones, with the superblock and backups then updated to point at the new root tree block. So nothing's local-only. First, the old (meta)data along with any updates to it is written to a new location, then higher tree entries must be updated and written to new locations, all the way to the top. Until that top entry is updated, the state of the filesystem reflects the old state, without any in-process changes -- it's as if your rename hasn't happened yet because the atomic filesystem state doesn't point to the newly written location yet. Once the updates reach the top, the new state is reflected. Of course with the fsync log being an exception, as mentioned above, but it too is renewed by the full filesystem commit, with the old log freed to be garbage-collected, and a new initially empty log pointed at by the newly written root block, which is in turn pointed at by the newly rewritten superblock and its backups. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html