subject:"Hard link not persisted on fsync"

Re: Hard link not persisted on fsync

2018-04-20 Thread Jayashree Mohan

Hi Zygo,

Thanks for the reply. Here are few responses about btrfs behavior that
you had questions about in your email.

On Thu, Apr 19, 2018 at 4:41 PM, Zygo Blaxell
 wrote:
>
> On Mon, Apr 16, 2018 at 09:35:24AM -0500, Jayashree Mohan wrote:
> > Hi,
> >
> > The following seems to be a crash consistency bug on btrfs, where in
> > the link count is not persisted even after a fsync on the original
> > file.
> >
> > Consider the following workload :
> > creat foo
> > link (foo, A/bar)
> > fsync(foo)
> > ---Crash---
> >
> > Now, on recovery we expect the metadata of foo to be persisted i.e
> > have a link count of 2. However in btrfs, the link count is 1 and file
> > A/bar is not persisted. The expected behaviour would be to persist the
> > dependencies of inode foo. That is to say, shouldn't fsync of foo
> > persist A/bar and correctly update the link count?
>
> Those dependencies are backward.  foo's inode doesn't depend on anything
> but the data in the file foo, and foo's inode itself.
>
> "foo" and "A/bar" are dirents that both depend on the inode of foo, which
> implies that "A" and "." must be updated atomically with foo's inode.
> If you had called fsync(A) then we'd expect A/bar to exist and the inode
> to have a link count of 2.  If you'd called fsync(.) then...well, you
> didn't modify "." at all, so I guess either outcome is valid as long as
> the inode link count matches the number of dirents referencing the inode.
>
> But then...why does foo exist at all?  I'd expect at least some tests
> would end without foo on disk either, since all that was fsync()ed was the
> foo inode, not the foo dirent in the directory '.'.  Does btrfs combine
> creating foo and updating foo's inode into a single atomic operation?
> I vaguely recall that it does exactly that, in order to solve a bug
> some years ago.  What happens if you add a rename, e.g.
>
> unlink foo2 # make sure foo2 doesn't exist
> creat foo
> rename(foo, foo2)
> link(foo2, A/bar)
> fsync(foo2)
>
> Do you get foo or foo2?  I'd expect foo since you didn't fsync '.',
> but maybe rename implies flush and you get foo2.

This is the workload we tried:

creat foo
rename(foo, foo2)
mkdir A
link(foo2, A/bar)
fsync(foo2)

Btrfs persists foo2 here, and A/bar is not persisted. Also, the link
count of foo2 is 1, while on the other filesystems, A/bar persist as
well and foo2 has a link count 2.
I don't think it's the rename here that implies the flush - removing
the fsync(foo2) ends up not persisting A/bar. So looks like, its the
fsync that forces all dependencies to the disk.

As a variant of the above workload, consider the following:

mkdir A
sync
creat foo
rename(foo, foo2)
link(foo2, A/bar)
fsync(foo2)

The only difference between this workload and the one above is, where
directory A is created and whether its persisted previously or not. In
this modified workload, foo2 has a link count of 2 and ends up
persisting A/bar. Why is A/bar being persisted here even when we did
not explicitly call a fsync on directory A?

We don't seem to correctly understand the reason behind these
different behaviors. It will be useful if you could provide more
insight into this.

> That's not to say that fsync is not a rich source of filesystem bugs.
> btrfs did once have (and maybe still has?) a bug where renames and fsync
> can create a dirent with no inode, e.g.
>
> loop continuously:
> creat foo
> write(foo, data)
> fsync(foo)
> rename(foo, bar)
>
> and crash somewhere in the middle of the loop, which will create a
> dirent "foo" that points to a non-existent inode.
>
> Removing the "fsync" works around the bug.  rename() does a flush anyway,
> so the fsync() wasn't needed, but fsync() shouldn't _create_ a filesystem
> inconsistency, especially when Googling recommends app developers to
> sprinkle fsync()s indiscriminately in their code to prevent their data
> from being mangled.
>
> I haven't been tracking to see if that's fixed yet.  I last saw it on
> 4.11, but I have been aggressively avoiding fsync with eatmydata for
> some years now.
>
> > Note that ext4, xfs and f2fs recover to the correct link count of 2
> > for the above workload.
>
> Do those filesystems also work if you remove the fsync?  That may be
> your answer:  they could be flushing the other metadata earlier, before
> you call fsync().

>> creat foo
>> link (foo, A/bar)
> >fsync(foo)
>>---Crash---

Actually, they don't seem to work if we remove the fsync. The behavior
changes and we neither persist A/bar nor update the link count of foo
to 2. So this confirms that fsync is forcing the metadata and not the
link() syscall itself right?

Thanks,
Jayashree
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hard link not persisted on fsync

2018-04-19 Thread Zygo Blaxell

On Mon, Apr 16, 2018 at 09:35:24AM -0500, Jayashree Mohan wrote:
> Hi,
> 
> The following seems to be a crash consistency bug on btrfs, where in
> the link count is not persisted even after a fsync on the original
> file.
> 
> Consider the following workload :
> creat foo
> link (foo, A/bar)
> fsync(foo)
> ---Crash---
> 
> Now, on recovery we expect the metadata of foo to be persisted i.e
> have a link count of 2. However in btrfs, the link count is 1 and file
> A/bar is not persisted. The expected behaviour would be to persist the
> dependencies of inode foo. That is to say, shouldn't fsync of foo
> persist A/bar and correctly update the link count?

Those dependencies are backward.  foo's inode doesn't depend on anything
but the data in the file foo, and foo's inode itself.

"foo" and "A/bar" are dirents that both depend on the inode of foo, which
implies that "A" and "." must be updated atomically with foo's inode.
If you had called fsync(A) then we'd expect A/bar to exist and the inode
to have a link count of 2.  If you'd called fsync(.) then...well, you
didn't modify "." at all, so I guess either outcome is valid as long as
the inode link count matches the number of dirents referencing the inode.

But then...why does foo exist at all?  I'd expect at least some tests
would end without foo on disk either, since all that was fsync()ed was the
foo inode, not the foo dirent in the directory '.'.  Does btrfs combine
creating foo and updating foo's inode into a single atomic operation?
I vaguely recall that it does exactly that, in order to solve a bug
some years ago.  What happens if you add a rename, e.g.

unlink foo2 # make sure foo2 doesn't exist
creat foo
rename(foo, foo2)
link(foo2, A/bar)
fsync(foo2)

Do you get foo or foo2?  I'd expect foo since you didn't fsync '.',
but maybe rename implies flush and you get foo2.

That's not to say that fsync is not a rich source of filesystem bugs.
btrfs did once have (and maybe still has?) a bug where renames and fsync
can create a dirent with no inode, e.g.

loop continuously:
creat foo
write(foo, data)
fsync(foo)
rename(foo, bar)

and crash somewhere in the middle of the loop, which will create a
dirent "foo" that points to a non-existent inode.

Removing the "fsync" works around the bug.  rename() does a flush anyway,
so the fsync() wasn't needed, but fsync() shouldn't _create_ a filesystem
inconsistency, especially when Googling recommends app developers to
sprinkle fsync()s indiscriminately in their code to prevent their data
from being mangled.

I haven't been tracking to see if that's fixed yet.  I last saw it on
4.11, but I have been aggressively avoiding fsync with eatmydata for
some years now.

> Note that ext4, xfs and f2fs recover to the correct link count of 2
> for the above workload.

Do those filesystems also work if you remove the fsync?  That may be
your answer:  they could be flushing the other metadata earlier, before
you call fsync().

> Let us know what you think about this behavior.
> 
> Thanks,
> Jayashree Mohan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

signature.asc
Description: PGP signature

Re: Hard link not persisted on fsync

2018-04-18 Thread Filipe Manana

On Wed, Apr 18, 2018 at 1:02 AM, Jayashree Mohan
 wrote:
> Hi,
>
> A gentle reminder on the crash consistency bug that we found on btrfs:

Why do you call it a consistency bug?
The filesystem does not stay in inconsistent state. The link count
stays 1 and the dentry used for fsync (foo) is persisted.
An inconsistency would be if we ended up with a link count of 2 and
only one of the dentries was persisted, or if we ended up with a link
count of 1 and both dentries were persisted.
Those cases would be detected by an fsck and would could fs operations
to fail unexpectedly (attemping to remove a dentry, etc).

> Link count of a file is not persisted even after a fsync. We believe a
> filesystem that ensures strictly ordered metadata behaviour should be
> able to persist the hard link after a fsync on the original file.

The thing is there's no written specification about what's the
expected and correct behavior.
Yes, I'm aware of Dave's explanation on strictly ordered metadata on
the other thread and transaction details, but things on btrfs work
very differently from xfs (I'm not saying he's wrong).

For me it makes more sense to persist the hard link, but again,
there's a lack of a specification that demands such behaviour, and I'm
not aware of applications needing that behaviour nor user reports
about it.

> Could you comment on why btrfs does not exhibit this behavior, and if
> it's something you'd want to fix?

I don't know the "why", as I am not the original author of the log
tree (what is used to implement fsync on btrfs), so either it's
accidental behavior (the most likely) or intentional.

Since it's not something that causes any corruption, fs inconsistency,
crash, etc, nor there are user reports complaining about this (afaik),
for me it would be far from a priority at the moment as I'm trying to
fix corruptions and similar issues (not necessarily caused by fsync).

Plus, adding such behaviour would have to be done carefully to not
impact performance, as checking if the node has multiple hard links
and which ones need to be persisted (created in the current,
uncommitted, transaction) may have a measurable impact. The current
behaviour it to only guarantee persisting the dentry used for the
fsync call.

>
> Thanks,
> Jayashree Mohan
>
>
>
> On Mon, Apr 16, 2018 at 9:35 AM, Jayashree Mohan
>  wrote:
>> Hi,
>>
>> The following seems to be a crash consistency bug on btrfs, where in
>> the link count is not persisted even after a fsync on the original
>> file.
>>
>> Consider the following workload :
>> creat foo
>> link (foo, A/bar)
>> fsync(foo)
>> ---Crash---
>>
>> Now, on recovery we expect the metadata of foo to be persisted i.e
>> have a link count of 2. However in btrfs, the link count is 1 and file
>> A/bar is not persisted. The expected behaviour would be to persist the
>> dependencies of inode foo. That is to say, shouldn't fsync of foo
>> persist A/bar and correctly update the link count?
>> Note that ext4, xfs and f2fs recover to the correct link count of 2
>> for the above workload.
>>
>> Let us know what you think about this behavior.
>>
>> Thanks,
>> Jayashree Mohan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hard link not persisted on fsync

2018-04-17 Thread Jayashree Mohan

Hi,

A gentle reminder on the crash consistency bug that we found on btrfs:
Link count of a file is not persisted even after a fsync. We believe a
filesystem that ensures strictly ordered metadata behaviour should be
able to persist the hard link after a fsync on the original file.
Could you comment on why btrfs does not exhibit this behavior, and if
it's something you'd want to fix?

Thanks,
Jayashree Mohan



On Mon, Apr 16, 2018 at 9:35 AM, Jayashree Mohan
 wrote:
> Hi,
>
> The following seems to be a crash consistency bug on btrfs, where in
> the link count is not persisted even after a fsync on the original
> file.
>
> Consider the following workload :
> creat foo
> link (foo, A/bar)
> fsync(foo)
> ---Crash---
>
> Now, on recovery we expect the metadata of foo to be persisted i.e
> have a link count of 2. However in btrfs, the link count is 1 and file
> A/bar is not persisted. The expected behaviour would be to persist the
> dependencies of inode foo. That is to say, shouldn't fsync of foo
> persist A/bar and correctly update the link count?
> Note that ext4, xfs and f2fs recover to the correct link count of 2
> for the above workload.
>
> Let us know what you think about this behavior.
>
> Thanks,
> Jayashree Mohan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hard link not persisted on fsync

2018-04-16 Thread Jayashree Mohan

Hi,

The following seems to be a crash consistency bug on btrfs, where in
the link count is not persisted even after a fsync on the original
file.

Consider the following workload :
creat foo
link (foo, A/bar)
fsync(foo)
---Crash---

Now, on recovery we expect the metadata of foo to be persisted i.e
have a link count of 2. However in btrfs, the link count is 1 and file
A/bar is not persisted. The expected behaviour would be to persist the
dependencies of inode foo. That is to say, shouldn't fsync of foo
persist A/bar and correctly update the link count?
Note that ext4, xfs and f2fs recover to the correct link count of 2
for the above workload.

Let us know what you think about this behavior.

Thanks,
Jayashree Mohan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hard link not persisted on fsync

Re: Hard link not persisted on fsync

Re: Hard link not persisted on fsync

Re: Hard link not persisted on fsync

Hard link not persisted on fsync

5 matches

Site Navigation

Mail list logo

Footer information