On Mon, Apr 16, 2018 at 09:35:24AM -0500, Jayashree Mohan wrote:
> Hi,
> 
> The following seems to be a crash consistency bug on btrfs, where in
> the link count is not persisted even after a fsync on the original
> file.
> 
> Consider the following workload :
> creat foo
> link (foo, A/bar)
> fsync(foo)
> ---Crash---
> 
> Now, on recovery we expect the metadata of foo to be persisted i.e
> have a link count of 2. However in btrfs, the link count is 1 and file
> A/bar is not persisted. The expected behaviour would be to persist the
> dependencies of inode foo. That is to say, shouldn't fsync of foo
> persist A/bar and correctly update the link count?

Those dependencies are backward.  foo's inode doesn't depend on anything
but the data in the file foo, and foo's inode itself.

"foo" and "A/bar" are dirents that both depend on the inode of foo, which
implies that "A" and "." must be updated atomically with foo's inode.
If you had called fsync(A) then we'd expect A/bar to exist and the inode
to have a link count of 2.  If you'd called fsync(.) then...well, you
didn't modify "." at all, so I guess either outcome is valid as long as
the inode link count matches the number of dirents referencing the inode.

But then...why does foo exist at all?  I'd expect at least some tests
would end without foo on disk either, since all that was fsync()ed was the
foo inode, not the foo dirent in the directory '.'.  Does btrfs combine
creating foo and updating foo's inode into a single atomic operation?
I vaguely recall that it does exactly that, in order to solve a bug
some years ago.  What happens if you add a rename, e.g.

        unlink foo2     # make sure foo2 doesn't exist
        creat foo
        rename(foo, foo2)
        link(foo2, A/bar)
        fsync(foo2)

Do you get foo or foo2?  I'd expect foo since you didn't fsync '.',
but maybe rename implies flush and you get foo2.

That's not to say that fsync is not a rich source of filesystem bugs.
btrfs did once have (and maybe still has?) a bug where renames and fsync
can create a dirent with no inode, e.g.

        loop continuously:
                creat foo
                write(foo, data)
                fsync(foo)
                rename(foo, bar)

and crash somewhere in the middle of the loop, which will create a
dirent "foo" that points to a non-existent inode.

Removing the "fsync" works around the bug.  rename() does a flush anyway,
so the fsync() wasn't needed, but fsync() shouldn't _create_ a filesystem
inconsistency, especially when Googling recommends app developers to
sprinkle fsync()s indiscriminately in their code to prevent their data
from being mangled.

I haven't been tracking to see if that's fixed yet.  I last saw it on
4.11, but I have been aggressively avoiding fsync with eatmydata for
some years now.

> Note that ext4, xfs and f2fs recover to the correct link count of 2
> for the above workload.

Do those filesystems also work if you remove the fsync?  That may be
your answer:  they could be flushing the other metadata earlier, before
you call fsync().

> Let us know what you think about this behavior.
> 
> Thanks,
> Jayashree Mohan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment: signature.asc
Description: PGP signature

Reply via email to