Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Theodore Tso
On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote:
 Background: The eXplode file system checker found a bug in ext2 fsync
 behavior.  Do the following: truncate file A, create file B which
 reallocates one of A's old indirect blocks, fsync file B.  If you then
 crash before file A's metadata is all written out, fsck will complete
 the truncate for file A... thereby deleting file B's data.  So fsync
 file B doesn't guarantee data is on disk after a crash.  Details:

It's actually not the case that fsck will complete the truncate for
file A.  The problem is that while e2fsck is processing indirect
blocks in pass 1, the block which is marked as file A's indirect block
(but which actually contain's file B's data) gets fixed when e2fsck
sees block numbers which look like illegal block numbers.  So this
ends up corrupting file B's data.

This is actually legal end result, BTW, since it's POSIX states the
result of fsync() is undefined if the system crashes.  Technically
fsync() did actually guarantee that file B's data is on disk; the
problem is that e2fsck would corrupt the data afterwards.  Ironically,
fsync()'ing file B actually makes it more likely that it might get
corrupted afterwards, since normally filesystem metadata gets sync'ed
out on 5 second intervals, while data gets sync'ed out at 30 second
intervals.

 * Rearrange order of duplicate block checking and fixing file size in
   fsck.  Not sure how hard this is. (Ted?)

It's not a matter of changing when we deal with fixing the file size,
as described above.  At the fsck time, we would need to keep backup
copies of any indirect blocks that get modified for whatever reason,
and then in pass 1D, when we clone a block that has been claimed by
multiple inods, the inodes which claim the block as a data block
should get a copy of the block before it was modified by e2fsck.

 * Keep a set of still allocated on disk block bitmaps that gets
   flushed whenever a sync happens.  Don't allocate these blocks.
   Journaling file systems already have to do this.

A list would be more efficient, as others have pointed out.  That
would work, although the knowing when entries could be removed from
the list.  The machinery for knowing when metadata has been updated
isn't present in ext2, and that's a fair amount of complexity.  You
could clear the list/bitmap after the 5 second metadata flush command
has been kicked off, or if you associate a data block with the
previous inode's owner, you could clear the entry when the inode's
dirty bit has been cleared, but that doesn't completely get rid of the
race unless you tie it to when the write has completed (and this
assumes write barriers to make sure the block was actually flushed to
the media).

Another very heavyweight approach would be to simply force a full sync
of the filesystem whenever fysnc() is called.  Not pretty, and without
the proper write ordering, the race is still potentially there.

I'd say that the best way to handle this is in fsck, but quite frankly
it's relatively low priority bug to handle, since a much simpler
workaround is to tell people to use ext3 instead.

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Dave Kleikamp
On Thu, 2007-02-15 at 09:20 -0500, Theodore Tso wrote:

 Another very heavyweight approach would be to simply force a full sync
 of the filesystem whenever fysnc() is called.  Not pretty, and without
 the proper write ordering, the race is still potentially there.

I don't think this race is an issue, in that it would require the crash
to happen before the fsync completed, so there would be no expectation
that the data is safe.  It's a moot point, since I don't think this is
an acceptable solution anyway.

 I'd say that the best way to handle this is in fsck, but quite frankly
 it's relatively low priority bug to handle, since a much simpler
 workaround is to tell people to use ext3 instead.

Right.  Who's still using ext2?
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread sfaibish
On Thu, 15 Feb 2007 10:09:22 -0500, Dave Kleikamp  
[EMAIL PROTECTED] wrote:



On Thu, 2007-02-15 at 09:20 -0500, Theodore Tso wrote:


Another very heavyweight approach would be to simply force a full sync
of the filesystem whenever fysnc() is called.  Not pretty, and without
the proper write ordering, the race is still potentially there.


I don't think this race is an issue, in that it would require the crash
to happen before the fsync completed, so there would be no expectation
that the data is safe.  It's a moot point, since I don't think this is
an acceptable solution anyway.


I'd say that the best way to handle this is in fsck, but quite frankly
it's relatively low priority bug to handle, since a much simpler
workaround is to tell people to use ext3 instead.


Right.  Who's still using ext2?
It was my understanding from the persentation of Dawson that ext3 and jfs  
have
same problem. It is not an ext2 only problem. Also whatever solution we  
adopt

we need to be sure that we test it using the eXplode methodology.

/Sorin
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Dave Kleikamp
On Thu, 2007-02-15 at 10:59 -0500, sfaibish wrote:
 On Thu, 15 Feb 2007 10:09:22 -0500, Dave Kleikamp  
 [EMAIL PROTECTED] wrote:
 
  On Thu, 2007-02-15 at 09:20 -0500, Theodore Tso wrote:
 
  Another very heavyweight approach would be to simply force a full sync
  of the filesystem whenever fysnc() is called.  Not pretty, and without
  the proper write ordering, the race is still potentially there.
 
  I don't think this race is an issue, in that it would require the crash
  to happen before the fsync completed, so there would be no expectation
  that the data is safe.  It's a moot point, since I don't think this is
  an acceptable solution anyway.
 
  I'd say that the best way to handle this is in fsck, but quite frankly
  it's relatively low priority bug to handle, since a much simpler
  workaround is to tell people to use ext3 instead.
 
  Right.  Who's still using ext2?
 It was my understanding from the persentation of Dawson that ext3 and jfs  
 have
 same problem.

Hmm.  If jfs has the problem, it is a bug.  jfs is designed to handle
this correctly.  I'm pretty sure I've fixed at least one bug that
eXplode has uncovered in the past.  I'm not sure what was mentioned in
the presentation though.  I'd like any information about current
problems in jfs.

 It is not an ext2 only problem. Also whatever solution we  
 adopt
 we need to be sure that we test it using the eXplode methodology.
 
 /Sorin
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Theodore Tso
On Thu, Feb 15, 2007 at 10:39:02AM -0600, Dave Kleikamp wrote:
  It was my understanding from the persentation of Dawson that ext3 and jfs  
  have ame problem.
 
 Hmm.  If jfs has the problem, it is a bug.  jfs is designed to handle
 this correctly.  I'm pretty sure I've fixed at least one bug that
 eXplode has uncovered in the past.  I'm not sure what was mentioned in
 the presentation though.  I'd like any information about current
 problems in jfs.

That was not my understanding of the charts that were presented
earlier this week.  Ext3 journaling code will deal with this case
explicitly, just as jfs does.  

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread sfaibish

On Thu, 15 Feb 2007 12:15:59 -0500, Theodore Tso [EMAIL PROTECTED] wrote:


On Thu, Feb 15, 2007 at 10:39:02AM -0600, Dave Kleikamp wrote:
 It was my understanding from the persentation of Dawson that ext3 and  
jfs

 have ame problem.

Hmm.  If jfs has the problem, it is a bug.  jfs is designed to handle
this correctly.  I'm pretty sure I've fixed at least one bug that
eXplode has uncovered in the past.  I'm not sure what was mentioned in
the presentation though.  I'd like any information about current
problems in jfs.


That was not my understanding of the charts that were presented
earlier this week.  Ext3 journaling code will deal with this case
explicitly, just as jfs does.


My mistake: there were fsync bugs in JFS and ext2 that cannot be
fixed by fsck. Not same for JFS and ext2. See quote:
There were two interesting fsync errors, one in JFS
and one in ext2. The ext2 bug is a case where an
implementation error points out a deeper design problem.
...
We found two bugs (one in JFS, one in Reiser4) where crashed
disks cannot be recovered by fsck.




- Ted






--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Dave Kleikamp
On Thu, 2007-02-15 at 11:11 -0800, Junfeng Yang wrote:
 Hmm.  If jfs has the problem, it is a bug.  jfs is designed to
 handle
 this correctly.  I'm pretty sure I've fixed at least one bug
 that 
 eXplode has uncovered in the past.  I'm not sure what was
 mentioned in
 the presentation though.  I'd like any information about
 current
 problems in jfs.
 
 
 I believe you have fixed the JFS fsync bug, Dave.  It was caused by
 reusing a directory inode as a file inode.  If the machine crashes
 later, fsck would think this file is a directory, and clear all its
 data. 

Yeah.  That one was fixed a while back.  Thanks for clearing this up.

Shaggy

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Dawson Engler
 It was my understanding from the persentation of Dawson that ext3 and jfs  
 have
 same problem. It is not an ext2 only problem. Also whatever solution we  
 adopt
 we need to be sure that we test it using the eXplode methodology.

apologies for dropping in randomly into the discussion: if this is
about the crash-during-recovery bugs, the specific ones i discussed
have been fixed in jfs and ext3 (junfeng: this is correct, right?).

i should have made this clear in the talk (along with many other things:
grabbing junfeng's slides and blathering about them w/o preperation is
not the right algorithm for giving a good talk.)

the other error --- fsync of file data on ext2 that reuses a freed inode
from a file that was not flushed to disk  is still open.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix d_path for lazy unmounts

2007-02-15 Thread Jan Engelhardt
Hi,

On Feb 14 2007 14:57, Andreas Gruenbacher wrote:
[2]

pipe:  pipe:[439336] (or pipe/[439336])

[3] Always make disconnected paths double-slashed:
--
pipe:  //pipe/[439336]
lazily umounted dir:   //dir/file
lazily unmounted fs:   //file
unreachable root:  //

Opinions?

As for [2]/[3]:

What's the point in changing pipefs... you can *never*
reach it *anyway*, even if it was a /-style path, since
pipefs is a NOMNT filesystem.

That said, programs like lsof might break when it changes
away from pipe:[integer] (same goes for socket:, etc.)


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix d_path for lazy unmounts

2007-02-15 Thread Andreas Gruenbacher
On Thursday 15 February 2007 04:53, Jan Engelhardt wrote:
 What's the point in changing pipefs... you can *never*
 reach it *anyway*, even if it was a /-style path, since
 pipefs is a NOMNT filesystem.

The point is that we could then get rid of the special case for MS_NOUSER 
filesystems like pipefs in __d_path(). (This special case caused the lazy 
unmounted dir bug in the first place.) It is likely not really worth it, 
though.

Andreas
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Theodore Tso
On Thu, Feb 15, 2007 at 11:28:46AM -0800, Junfeng Yang wrote:
 
 Actually,  we found a crash-during-recovery bug in ext3 too.  It's a race
 between resetting the journal super block and replay of the journal.  This
 bug was fixed by Ted long time ago (3 years?).

That was found in your original work (using UML) not the more recent
work using EXPLODE, correct?

- Ted

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-15 Thread sfaibish
On Thu, 15 Feb 2007 14:46:34 -0500, Jan Engelhardt  
[EMAIL PROTECTED] wrote:




On Feb 15 2007 21:38, Andi Kleen wrote:


Also I would expect your design to be slow for metadata read intensive
workloads. E.g. have you tried to boot a root partition with dual fs?
That's a very important IO benchmark for desktop Linux systems.


Did someone say metadata intensive? Try kernel tarballs, they're a
perfect workload. Tons of directories, and even more files here and
there. Works wonders.


I just did now per your request. To make things more relevant I
created a file structure from the 2.4.19 kernel sources and repeated it
recursively into the deepest dir level (10) 4 times ending up with
7280 directories with 40 levels of directories depth and 1 GB
data set size. I run both tar and untar operations on the tree
for ext3, reiserfs, jfs and DualFS. I remounted the FS before
each test. I end up with 7280 directories 40 levels depth and
1 GB data. Both tar file and directory tree were on the FS under
test.

Here are the results - elapse time in sec:
tar untar
ext3:   144 143
reiserfs:   100 100
JFS:196 140
DualFS: 63  54

Hope this helps.



Jan


/Sorin

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html