>Recently there's been discussion [1] in the Linux community about how
>filesystems should deal with rename(2), particularly in the case of a crash.
>ext4 was found to truncate files after a crash, that had been written with
>open("foo.tmp"), write(), close() and then rename("foo.tmp", "foo"). This is
> because ext4 uses delayed allocation and may not write the contents to disk
>immediately, but commits metadata changes quite frequently. So when
>rename("foo.tmp","foo") is committed to disk, it has a length of zero which
>is later updated when the data is written to disk. This means after a crash,
>"foo" is zero-length, and both the new and the old data has been lost, which
>is undesirable. This doesn't happen when using ext3's default settings
>because ext3 writes data to disk before metadata (which has performance
>problems, see Firefox 3 and fsync[2])

Believing that, somehow, "metadata" is more important than "other data"
should have been put to rest with UFS.  Yes, it's easier to "fsck" the
filesystem when the metadata is correct and that gets you a valid 
filesystem but that doesn't mean that you get a filesystem with valid contents.

>Ted T'so's (the main author of ext3 and ext4) response is that applications
>which perform open(),write(),close(),rename() in the expectation that they
>will either get the old data or the new data, but not no data at all, are
>broken, and instead should call open(),write(),fsync(),close(),rename().
>Most other people are arguing that POSIX says rename(2) is atomic, and while
>POSIX doesn't specify crash recovery, returning no data at all after a crash
>is clearly wrong, and excessive use of fsync is overkill and
>counter-productive (Ted later proposes a "yes-I-really-mean-it" flag for
>fsync). I've omitted a lot of detail, but I think this is the core of the
>argument.


As long as POSIX believes that systems don't crash, then clearly there is
nothing in the standard which would help the argument on either side.

It is a "quality of implementation" property.  Apparently, T'so feels
that reordering filesystem operations is fine.


>Now the question I have, is how does ZFS deal with
>open(),write(),close(),rename() in the case of a crash? Will it always
>return the new data or the old data, or will it sometimes return no data? Is
> returning no data defensible, either under POSIX or common sense? Comments
>about other filesystems, eg UFS are also welcome. As a counter-point, XFS
>(written by SGI) is notorious for data-loss after a crash, but its authors
>defend the behaviour as POSIX-compliant.

I didn't know about XFS behaviour on crash.  I don't know exactly how ZFS 
commits transaction groups; the ZFS authors can tell and I hope they chime 
in.

The only time POSIX is in question is when the fileserver crashes and 
whether or not the NFS server keeps its promises.  Some typical Linux 
configuration would break some of those promises.

Casper

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to