Re: Fix(es) for ext2 fsync bug
On Thu, Feb 15, 2007 at 09:20:21AM -0500, Theodore Tso wrote: It's actually not the case that fsck will complete the truncate for file A. The problem is that while e2fsck is processing indirect blocks in pass 1, the block which is marked as file A's indirect block (but which actually contain's file B's data) gets fixed when e2fsck sees block numbers which look like illegal block numbers. So this ends up corrupting file B's data. Ah, that's what happens. Thanks for the clarification. This is actually legal end result, BTW, since it's POSIX states the result of fsync() is undefined if the system crashes. Technically And POSIX also states that sync() is only required to schedule the writes, but may return before the actual writing is done. Looks like the only way you can guarantee data is on-disk according to POSIX is to reboot the system after every synchronous write. Man, we file systems developers sure have it easy! -VAL - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Tue, Feb 20, 2007 at 01:30:25PM -0800, Junfeng Yang wrote: On 2/20/07, Valerie Henson [EMAIL PROTECTED] wrote: Google. (GoogleFS runs on top of ext2.) It's surprising to know that... I guess they reply on GoogleFS's own replication and checksumming for consistency. Yep, they just want a local file system with ultrafast on-line performance. They don't care about recovery time particularly because of the GoogleFS replication (although I heard rumors they have some fast fsck scheme, maybe resembling the dirty bit stuff I did last year). Actually, according to the GFS paper (which may be out of date), for the chunkservers that is true, but for their master they really want fast recovery as a way to reduce mean-time-to-repair (and thus increase availability). Though, given that they have shadow masters perhaps everyone is happy as long as master recovery usually fast. Dawson - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
In message [EMAIL PROTECTED], Valerie Henson writes: On Thu, Feb 15, 2007 at 09:20:21AM -0500, Theodore Tso wrote: And POSIX also states that sync() is only required to schedule the writes, but may return before the actual writing is done. Looks like One more reason to form a group to discuss POSIX updates/changes (as per LSF last week). the only way you can guarantee data is on-disk according to POSIX is to reboot the system after every synchronous write. Man, we file systems developers sure have it easy! No need to be that extreme. :-) It should be enough to just unmount all file systems, unload all fs and disk drivers, then reload+remount everything. No? -VAL Erez. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Tue, 2007-02-20 at 21:39 +, Valerie Henson wrote: On Tue, Feb 20, 2007 at 01:30:25PM -0800, Junfeng Yang wrote: On 2/20/07, Valerie Henson [EMAIL PROTECTED] wrote: Google. (GoogleFS runs on top of ext2.) It's surprising to know that... I guess they reply on GoogleFS's own replication and checksumming for consistency. Yep, they just want a local file system with ultrafast on-line performance. They don't care about recovery time particularly because of the GoogleFS replication (although I heard rumors they have some fast fsck scheme, maybe resembling the dirty bit stuff I did last year). I wonder if they would consider this a important bug? I know nothing about GoogleFS, but I would guess that they have more sophisticated recovery than relying on an fsync shortly before a crash to ensure data integrity. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote: Background: The eXplode file system checker found a bug in ext2 fsync behavior. Do the following: truncate file A, create file B which reallocates one of A's old indirect blocks, fsync file B. If you then crash before file A's metadata is all written out, fsck will complete the truncate for file A... thereby deleting file B's data. So fsync file B doesn't guarantee data is on disk after a crash. Details: It's actually not the case that fsck will complete the truncate for file A. The problem is that while e2fsck is processing indirect blocks in pass 1, the block which is marked as file A's indirect block (but which actually contain's file B's data) gets fixed when e2fsck sees block numbers which look like illegal block numbers. So this ends up corrupting file B's data. This is actually legal end result, BTW, since it's POSIX states the result of fsync() is undefined if the system crashes. Technically fsync() did actually guarantee that file B's data is on disk; the problem is that e2fsck would corrupt the data afterwards. Ironically, fsync()'ing file B actually makes it more likely that it might get corrupted afterwards, since normally filesystem metadata gets sync'ed out on 5 second intervals, while data gets sync'ed out at 30 second intervals. * Rearrange order of duplicate block checking and fixing file size in fsck. Not sure how hard this is. (Ted?) It's not a matter of changing when we deal with fixing the file size, as described above. At the fsck time, we would need to keep backup copies of any indirect blocks that get modified for whatever reason, and then in pass 1D, when we clone a block that has been claimed by multiple inods, the inodes which claim the block as a data block should get a copy of the block before it was modified by e2fsck. * Keep a set of still allocated on disk block bitmaps that gets flushed whenever a sync happens. Don't allocate these blocks. Journaling file systems already have to do this. A list would be more efficient, as others have pointed out. That would work, although the knowing when entries could be removed from the list. The machinery for knowing when metadata has been updated isn't present in ext2, and that's a fair amount of complexity. You could clear the list/bitmap after the 5 second metadata flush command has been kicked off, or if you associate a data block with the previous inode's owner, you could clear the entry when the inode's dirty bit has been cleared, but that doesn't completely get rid of the race unless you tie it to when the write has completed (and this assumes write barriers to make sure the block was actually flushed to the media). Another very heavyweight approach would be to simply force a full sync of the filesystem whenever fysnc() is called. Not pretty, and without the proper write ordering, the race is still potentially there. I'd say that the best way to handle this is in fsck, but quite frankly it's relatively low priority bug to handle, since a much simpler workaround is to tell people to use ext3 instead. Regards, - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 2007-02-15 at 09:20 -0500, Theodore Tso wrote: Another very heavyweight approach would be to simply force a full sync of the filesystem whenever fysnc() is called. Not pretty, and without the proper write ordering, the race is still potentially there. I don't think this race is an issue, in that it would require the crash to happen before the fsync completed, so there would be no expectation that the data is safe. It's a moot point, since I don't think this is an acceptable solution anyway. I'd say that the best way to handle this is in fsck, but quite frankly it's relatively low priority bug to handle, since a much simpler workaround is to tell people to use ext3 instead. Right. Who's still using ext2? -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 15 Feb 2007 10:09:22 -0500, Dave Kleikamp [EMAIL PROTECTED] wrote: On Thu, 2007-02-15 at 09:20 -0500, Theodore Tso wrote: Another very heavyweight approach would be to simply force a full sync of the filesystem whenever fysnc() is called. Not pretty, and without the proper write ordering, the race is still potentially there. I don't think this race is an issue, in that it would require the crash to happen before the fsync completed, so there would be no expectation that the data is safe. It's a moot point, since I don't think this is an acceptable solution anyway. I'd say that the best way to handle this is in fsck, but quite frankly it's relatively low priority bug to handle, since a much simpler workaround is to tell people to use ext3 instead. Right. Who's still using ext2? It was my understanding from the persentation of Dawson that ext3 and jfs have same problem. It is not an ext2 only problem. Also whatever solution we adopt we need to be sure that we test it using the eXplode methodology. /Sorin - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 2007-02-15 at 10:59 -0500, sfaibish wrote: On Thu, 15 Feb 2007 10:09:22 -0500, Dave Kleikamp [EMAIL PROTECTED] wrote: On Thu, 2007-02-15 at 09:20 -0500, Theodore Tso wrote: Another very heavyweight approach would be to simply force a full sync of the filesystem whenever fysnc() is called. Not pretty, and without the proper write ordering, the race is still potentially there. I don't think this race is an issue, in that it would require the crash to happen before the fsync completed, so there would be no expectation that the data is safe. It's a moot point, since I don't think this is an acceptable solution anyway. I'd say that the best way to handle this is in fsck, but quite frankly it's relatively low priority bug to handle, since a much simpler workaround is to tell people to use ext3 instead. Right. Who's still using ext2? It was my understanding from the persentation of Dawson that ext3 and jfs have same problem. Hmm. If jfs has the problem, it is a bug. jfs is designed to handle this correctly. I'm pretty sure I've fixed at least one bug that eXplode has uncovered in the past. I'm not sure what was mentioned in the presentation though. I'd like any information about current problems in jfs. It is not an ext2 only problem. Also whatever solution we adopt we need to be sure that we test it using the eXplode methodology. /Sorin -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, Feb 15, 2007 at 10:39:02AM -0600, Dave Kleikamp wrote: It was my understanding from the persentation of Dawson that ext3 and jfs have ame problem. Hmm. If jfs has the problem, it is a bug. jfs is designed to handle this correctly. I'm pretty sure I've fixed at least one bug that eXplode has uncovered in the past. I'm not sure what was mentioned in the presentation though. I'd like any information about current problems in jfs. That was not my understanding of the charts that were presented earlier this week. Ext3 journaling code will deal with this case explicitly, just as jfs does. - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 15 Feb 2007 12:15:59 -0500, Theodore Tso [EMAIL PROTECTED] wrote: On Thu, Feb 15, 2007 at 10:39:02AM -0600, Dave Kleikamp wrote: It was my understanding from the persentation of Dawson that ext3 and jfs have ame problem. Hmm. If jfs has the problem, it is a bug. jfs is designed to handle this correctly. I'm pretty sure I've fixed at least one bug that eXplode has uncovered in the past. I'm not sure what was mentioned in the presentation though. I'd like any information about current problems in jfs. That was not my understanding of the charts that were presented earlier this week. Ext3 journaling code will deal with this case explicitly, just as jfs does. My mistake: there were fsync bugs in JFS and ext2 that cannot be fixed by fsck. Not same for JFS and ext2. See quote: There were two interesting fsync errors, one in JFS and one in ext2. The ext2 bug is a case where an implementation error points out a deeper design problem. ... We found two bugs (one in JFS, one in Reiser4) where crashed disks cannot be recovered by fsck. - Ted -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 2007-02-15 at 11:11 -0800, Junfeng Yang wrote: Hmm. If jfs has the problem, it is a bug. jfs is designed to handle this correctly. I'm pretty sure I've fixed at least one bug that eXplode has uncovered in the past. I'm not sure what was mentioned in the presentation though. I'd like any information about current problems in jfs. I believe you have fixed the JFS fsync bug, Dave. It was caused by reusing a directory inode as a file inode. If the machine crashes later, fsck would think this file is a directory, and clear all its data. Yeah. That one was fixed a while back. Thanks for clearing this up. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
It was my understanding from the persentation of Dawson that ext3 and jfs have same problem. It is not an ext2 only problem. Also whatever solution we adopt we need to be sure that we test it using the eXplode methodology. apologies for dropping in randomly into the discussion: if this is about the crash-during-recovery bugs, the specific ones i discussed have been fixed in jfs and ext3 (junfeng: this is correct, right?). i should have made this clear in the talk (along with many other things: grabbing junfeng's slides and blathering about them w/o preperation is not the right algorithm for giving a good talk.) the other error --- fsync of file data on ext2 that reuses a freed inode from a file that was not flushed to disk is still open. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, Feb 15, 2007 at 11:28:46AM -0800, Junfeng Yang wrote: Actually, we found a crash-during-recovery bug in ext3 too. It's a race between resetting the journal super block and replay of the journal. This bug was fixed by Ted long time ago (3 years?). That was found in your original work (using UML) not the more recent work using EXPLODE, correct? - Ted - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Fix(es) for ext2 fsync bug
Just some quick notes on possible ways to fix the ext2 fsync bug that eXplode found. Whether or not anyone will bother to implement it is another matter. Background: The eXplode file system checker found a bug in ext2 fsync behavior. Do the following: truncate file A, create file B which reallocates one of A's old indirect blocks, fsync file B. If you then crash before file A's metadata is all written out, fsck will complete the truncate for file A... thereby deleting file B's data. So fsync file B doesn't guarantee data is on disk after a crash. Details: http://www.stanford.edu/~engler/explode-osdi06.pdf Two possible solutions I can think of: * Rearrange order of duplicate block checking and fixing file size in fsck. Not sure how hard this is. (Ted?) * Keep a set of still allocated on disk block bitmaps that gets flushed whenever a sync happens. Don't allocate these blocks. Journaling file systems already have to do this. -VAL - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote: Just some quick notes on possible ways to fix the ext2 fsync bug that eXplode found. Whether or not anyone will bother to implement it is another matter. Background: The eXplode file system checker found a bug in ext2 fsync behavior. Do the following: truncate file A, create file B which reallocates one of A's old indirect blocks, fsync file B. If you then crash before file A's metadata is all written out, fsck will complete the truncate for file A... thereby deleting file B's data. So fsync file B doesn't guarantee data is on disk after a crash. Details: http://www.stanford.edu/~engler/explode-osdi06.pdf Two possible solutions I can think of: * Rearrange order of duplicate block checking and fixing file size in fsck. Not sure how hard this is. (Ted?) * Keep a set of still allocated on disk block bitmaps that gets flushed whenever a sync happens. Don't allocate these blocks. Journaling file systems already have to do this. You don't need anything on disk or to fsck to fix this problem - just avoid it completely by keeping a list of recently truncated blocks in memory and don't reuse them until the old owner inode is sync'd to disk. XFS solves this problem in exactly this manner - it keeps a list of recently freed blocks whose freeing transactions have not yet been committed to disk to prevent them from being reused before it is safe to. See xfs_alloc_search_busy() and callers - if we try to reallocate a busy extent, we force the log to get the free transaction on disk before allowing the block to be reusued... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 2007-02-15 at 07:31 +1100, David Chinner wrote: On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote: Just some quick notes on possible ways to fix the ext2 fsync bug that eXplode found. Whether or not anyone will bother to implement it is another matter. Background: The eXplode file system checker found a bug in ext2 fsync behavior. Do the following: truncate file A, create file B which reallocates one of A's old indirect blocks, fsync file B. If you then crash before file A's metadata is all written out, fsck will complete the truncate for file A... thereby deleting file B's data. So fsync file B doesn't guarantee data is on disk after a crash. Details: http://www.stanford.edu/~engler/explode-osdi06.pdf Two possible solutions I can think of: * Rearrange order of duplicate block checking and fixing file size in fsck. Not sure how hard this is. (Ted?) * Keep a set of still allocated on disk block bitmaps that gets flushed whenever a sync happens. Don't allocate these blocks. Journaling file systems already have to do this. You don't need anything on disk or to fsck to fix this problem - just avoid it completely by keeping a list of recently truncated blocks in memory and don't reuse them until the old owner inode is sync'd to disk. I think that's pretty much what Val is suggesting. She suggests bitmaps rather than a list though. Maybe she should have used a better term than flushed, as this list only needs to be cleared, rather than written to disk. -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Wed, Feb 14, 2007 at 03:26:22PM -0600, Dave Kleikamp wrote: On Thu, 2007-02-15 at 07:31 +1100, David Chinner wrote: On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote: Just some quick notes on possible ways to fix the ext2 fsync bug that eXplode found. Whether or not anyone will bother to implement it is another matter. Background: The eXplode file system checker found a bug in ext2 fsync behavior. Do the following: truncate file A, create file B which reallocates one of A's old indirect blocks, fsync file B. If you then crash before file A's metadata is all written out, fsck will complete the truncate for file A... thereby deleting file B's data. So fsync file B doesn't guarantee data is on disk after a crash. Details: http://www.stanford.edu/~engler/explode-osdi06.pdf Two possible solutions I can think of: * Rearrange order of duplicate block checking and fixing file size in fsck. Not sure how hard this is. (Ted?) * Keep a set of still allocated on disk block bitmaps that gets flushed whenever a sync happens. Don't allocate these blocks. Journaling file systems already have to do this. You don't need anything on disk or to fsck to fix this problem - just avoid it completely by keeping a list of recently truncated blocks in memory and don't reuse them until the old owner inode is sync'd to disk. I think that's pretty much what Val is suggesting. She suggests bitmaps rather than a list though. Maybe she should have used a better term than flushed, as this list only needs to be cleared, rather than written to disk. Yeah, probably was - I misparsed the still allocated on disk block bitmaps phrase differently to what may have been intended... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
Val, Maybe it is not only our (FS people) problem. We probably need to bring the kernel people judge as ext2 and ext3 are the base Linux FS. I add the kernel list for opinion. /Sorin On Wed, 14 Feb 2007 14:54:54 -0500, Valerie Henson [EMAIL PROTECTED] wrote: Just some quick notes on possible ways to fix the ext2 fsync bug that eXplode found. Whether or not anyone will bother to implement it is another matter. Background: The eXplode file system checker found a bug in ext2 fsync behavior. Do the following: truncate file A, create file B which reallocates one of A's old indirect blocks, fsync file B. If you then crash before file A's metadata is all written out, fsck will complete the truncate for file A... thereby deleting file B's data. So fsync file B doesn't guarantee data is on disk after a crash. Details: http://www.stanford.edu/~engler/explode-osdi06.pdf Two possible solutions I can think of: * Rearrange order of duplicate block checking and fixing file size in fsck. Not sure how hard this is. (Ted?) * Keep a set of still allocated on disk block bitmaps that gets flushed whenever a sync happens. Don't allocate these blocks. Journaling file systems already have to do this. -VAL - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html