Re: [PATCH] ext34: ensure do_split leaves enough free space in both blocks
Hi, On Mon, 2007-09-17 at 11:06 -0500, Eric Sandeen wrote: > The do_split() function for htree dir blocks is intended to split a > leaf block to make room for a new entry. It sorts the entries in the > original block by hash value, then moves the last half of the entries to > the new block - without accounting for how much space this actually moves. Nasty. > Also add a few comments to the functions involved. A big improvement. :) > + /* Split the existing block in the middle, size-wise */ > + size = 0; > + move = 0; > + for (i = count-1; i >= 0; i--) { > + /* is more than half of this entry in 2nd half of the block? */ > + if (size + map[i].size/2 > blocksize/2) > + break; > + size += map[i].size; > + move++; > + } > + /* map index at which we will split */ > + split = count - move; This is the guts of it, and I spent a while looking just to convince myself it was getting things right in the corner case and was splitting at _exactly_ the right point. If we have large dirents and 1k blocksize, even getting the split point off-by-one will still leave us vulnerable to the bug. But it all looks fine. My only other comment would be that "move" is redundant, you could lose it completely and just do split = i+1; but I think the logic's easier to understand the way it is. Nice catch. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ext34: ensure do_split leaves enough free space in both blocks
Hi, On Mon, 2007-09-17 at 11:06 -0500, Eric Sandeen wrote: The do_split() function for htree dir blocks is intended to split a leaf block to make room for a new entry. It sorts the entries in the original block by hash value, then moves the last half of the entries to the new block - without accounting for how much space this actually moves. Nasty. Also add a few comments to the functions involved. A big improvement. :) + /* Split the existing block in the middle, size-wise */ + size = 0; + move = 0; + for (i = count-1; i = 0; i--) { + /* is more than half of this entry in 2nd half of the block? */ + if (size + map[i].size/2 blocksize/2) + break; + size += map[i].size; + move++; + } + /* map index at which we will split */ + split = count - move; This is the guts of it, and I spent a while looking just to convince myself it was getting things right in the corner case and was splitting at _exactly_ the right point. If we have large dirents and 1k blocksize, even getting the split point off-by-one will still leave us vulnerable to the bug. But it all looks fine. My only other comment would be that move is redundant, you could lose it completely and just do split = i+1; but I think the logic's easier to understand the way it is. Nice catch. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [188/2many] MAINTAINERS - EXT4 FILE SYSTEM
Hi, On Mon, 2007-08-13 at 03:01 -0600, Andreas Dilger wrote: > To be honest, Stephen and Andrew haven't been directly involved in > the ext4 development. It probably makes more sense to have e.g. > Eric Sandeen, Ted Ts'o, and MingMing Cao in their place. Works for me. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [188/2many] MAINTAINERS - EXT4 FILE SYSTEM
Hi, On Mon, 2007-08-13 at 03:01 -0600, Andreas Dilger wrote: To be honest, Stephen and Andrew haven't been directly involved in the ext4 development. It probably makes more sense to have e.g. Eric Sandeen, Ted Ts'o, and MingMing Cao in their place. Works for me. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk
Hi, On Sun, 2007-06-10 at 18:27 +, Pavel Machek wrote: > > Once a f/s is read-only, there should be NO writing to > > it. Right? > > Linux happily writes to filesystems mounted read-only. It will replay > journal on them. Only at mount time, not on unmount; and it does check whether the underlying device is truly readonly or not first (assuming bdev_read_only() is working on the device in question.) --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk
Hi, On Sun, 2007-06-10 at 18:27 +, Pavel Machek wrote: Once a f/s is read-only, there should be NO writing to it. Right? Linux happily writes to filesystems mounted read-only. It will replay journal on them. Only at mount time, not on unmount; and it does check whether the underlying device is truly readonly or not first (assuming bdev_read_only() is working on the device in question.) --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk
Hi, On Thu, 2007-06-07 at 12:01 -0400, Mark Lord wrote: > >>mount /var/lib/mythtv -oremount,ro > >>sync > >>umount /var/lib/mythtv > > > > Did this succeed? If the application is still truncating that file, the > > umount should have failed. > > Actually, what I expect to happen is for the remount,ro > to block until the file deletion completes. But it doesn't. No -- all that the remount,ro sees is that there's a fd still open for write. It has no idea if or when that fd is going to get closed, so it should fail. Returning -EBUSY is the only thing it _can_ do: waiting for the fs to be remountable might wait forever. So the fs is still writable after that point. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk
Hi, On Thu, 2007-06-07 at 12:01 -0400, Mark Lord wrote: mount /var/lib/mythtv -oremount,ro sync umount /var/lib/mythtv Did this succeed? If the application is still truncating that file, the umount should have failed. Actually, what I expect to happen is for the remount,ro to block until the file deletion completes. But it doesn't. No -- all that the remount,ro sees is that there's a fd still open for write. It has no idea if or when that fd is going to get closed, so it should fail. Returning -EBUSY is the only thing it _can_ do: waiting for the fs to be remountable might wait forever. So the fs is still writable after that point. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: EXT3-fs error (device hda8): ext3_free_blocks: Freeing blocks not in datazone
Hi, On Mon, 2005-09-05 at 17:24, Riccardo Castellani wrote: > I'm using FC3 with Kernel 2.6.12-1.1376. > After few hours file system on /dev/hda8 EXT3 partition has a problem so it > remounted in only read mode. > Sep 5 17:34:40 mrtg kernel: EXT3-fs error (device hda8): ext3_free_blocks: > Freeing blocks not in datazone - block = 134217728, count = 1 That block number is 0x800 in hex. It's a single-bit flip error; that strongly sounds like hardware, and I'd run memtest86 on that box next. > Sep 5 17:34:40 mrtg kernel: Aborting journal on device hda8. > Sep 5 17:34:40 mrtg kernel: EXT3-fs error (device hda8) in > ext3_reserve_inode_write: Journal has aborted ... > I tried several times to run fsck on this partition and I also tried to > remount fs in a new partition, but it happened nothing ! What do you mean? fsck found nothing wrong? remount failed? You _did_ unmount before fscking, did you? --Stephen > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
Hi, On Sun, 2005-09-04 at 21:33, Pavel Machek wrote: > > - read-only mount > > - "specatator" mount (like ro but no journal allocated for the mount, > > no fencing needed for failed node that was mounted as specatator) > > I'd call it "real-read-only", and yes, that's very usefull > mount. Could we get it for ext3, too? I don't want to pollute the ext3 paths with extra checks for the case when there's no journal struct at all. But a dummy journal struct that isn't associated with an on-disk journal and that can never, ever go writable would certainly be pretty easy to do. But mount -o readonly gives you most of what you want already. An always-readonly option would be different in some key ways --- for a start, it would be impossible to perform journal recovery if that's needed, as that still needs journal and superblock write access. That's not necessarily a good thing. And you *still* wouldn't get something that could act as a spectator to a filesystem mounted writable elsewhere on a SAN, because updates on the other node wouldn't invalidate cached data on the readonly node. So is this really a useful combination? About the only combination I can think of that really makes sense in this context is if you have a busted filesystem that somehow can't be recovered --- either the journal is broken or the underlying device is truly readonly --- and you want to mount without recovery in order to attempt to see what you can find. That's asking for data corruption, but that may be better than getting no data at all. But that is something that could be done with a "-o skip-recovery" mount option, which would necessarily imply always-readonly behaviour. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
Hi, On Sun, 2005-09-04 at 21:33, Pavel Machek wrote: - read-only mount - specatator mount (like ro but no journal allocated for the mount, no fencing needed for failed node that was mounted as specatator) I'd call it real-read-only, and yes, that's very usefull mount. Could we get it for ext3, too? I don't want to pollute the ext3 paths with extra checks for the case when there's no journal struct at all. But a dummy journal struct that isn't associated with an on-disk journal and that can never, ever go writable would certainly be pretty easy to do. But mount -o readonly gives you most of what you want already. An always-readonly option would be different in some key ways --- for a start, it would be impossible to perform journal recovery if that's needed, as that still needs journal and superblock write access. That's not necessarily a good thing. And you *still* wouldn't get something that could act as a spectator to a filesystem mounted writable elsewhere on a SAN, because updates on the other node wouldn't invalidate cached data on the readonly node. So is this really a useful combination? About the only combination I can think of that really makes sense in this context is if you have a busted filesystem that somehow can't be recovered --- either the journal is broken or the underlying device is truly readonly --- and you want to mount without recovery in order to attempt to see what you can find. That's asking for data corruption, but that may be better than getting no data at all. But that is something that could be done with a -o skip-recovery mount option, which would necessarily imply always-readonly behaviour. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: EXT3-fs error (device hda8): ext3_free_blocks: Freeing blocks not in datazone
Hi, On Mon, 2005-09-05 at 17:24, Riccardo Castellani wrote: I'm using FC3 with Kernel 2.6.12-1.1376. After few hours file system on /dev/hda8 EXT3 partition has a problem so it remounted in only read mode. Sep 5 17:34:40 mrtg kernel: EXT3-fs error (device hda8): ext3_free_blocks: Freeing blocks not in datazone - block = 134217728, count = 1 That block number is 0x800 in hex. It's a single-bit flip error; that strongly sounds like hardware, and I'd run memtest86 on that box next. Sep 5 17:34:40 mrtg kernel: Aborting journal on device hda8. Sep 5 17:34:40 mrtg kernel: EXT3-fs error (device hda8) in ext3_reserve_inode_write: Journal has aborted ... I tried several times to run fsck on this partition and I also tried to remount fs in a new partition, but it happened nothing ! What do you mean? fsck found nothing wrong? remount failed? You _did_ unmount before fscking, did you? --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Ext3 online resizing locking issue
Hi, On Wed, 2005-08-31 at 12:35, Glauber de Oliveira Costa wrote: > At a first look, i thought about locking gdt-related data. But in a > closer one, it seemed to me that we're in fact modifying a little bit > more than that in the resize code. But all these modifications seem to > be somehow related to the ext3 super block specific data in > ext3_sb_info. My first naive approach would be adding a lock to that > struct I took great care when making that code SMP-safe to avoid such locks, for performance reasons. See the comments at * We need to protect s_groups_count against other CPUs seeing * inconsistent state in the superblock. in fs/ext3/resize.c for the rules. But basically the way it works is that we only usually modify data that cannot be in use by other parts of the kernel --- and that's fairly easy to guarantee, since by definition extending the fs is something that is touching bits that aren't already in use. Only once all the new data is safely installed do we atomically update the s_groups_count field, which instantly makes the new data visible. We enforce this ordering via smp read barriers before reading s_groups_count and write barriers after modifying it, but we don't actually have locks as such. The only use of locking in the resize is hence the superblock lock, which is not really there to protect the resize from the rest of the fs --- the s_groups_count barriers do that. All the sb lock is needed for is to prevent two resizes from progressing at the same time; and that could easily be abstracted into a separate resize lock. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Ext3 online resizing locking issue
Hi, On Wed, 2005-08-31 at 12:35, Glauber de Oliveira Costa wrote: At a first look, i thought about locking gdt-related data. But in a closer one, it seemed to me that we're in fact modifying a little bit more than that in the resize code. But all these modifications seem to be somehow related to the ext3 super block specific data in ext3_sb_info. My first naive approach would be adding a lock to that struct I took great care when making that code SMP-safe to avoid such locks, for performance reasons. See the comments at * We need to protect s_groups_count against other CPUs seeing * inconsistent state in the superblock. in fs/ext3/resize.c for the rules. But basically the way it works is that we only usually modify data that cannot be in use by other parts of the kernel --- and that's fairly easy to guarantee, since by definition extending the fs is something that is touching bits that aren't already in use. Only once all the new data is safely installed do we atomically update the s_groups_count field, which instantly makes the new data visible. We enforce this ordering via smp read barriers before reading s_groups_count and write barriers after modifying it, but we don't actually have locks as such. The only use of locking in the resize is hence the superblock lock, which is not really there to protect the resize from the rest of the fs --- the s_groups_count barriers do that. All the sb lock is needed for is to prevent two resizes from progressing at the same time; and that could easily be abstracted into a separate resize lock. Cheers, Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Ext3 online resizing locking issue
Hi, On Thu, 2005-08-25 at 21:43, Glauber de Oliveira Costa wrote: > Just a question here. With s_lock held by the remount code, we're > altering the struct super_block, and believing we're safe. We try to > acquire it inside the resize functions, because we're trying to modify > this same data. Thus, if we rely on another lock, aren't we probably > messing up something ? The two different uses of the superblock lock are really quite different; I don't see any particular problem with using two different locks for the two different things. Mount and the namespace code are not locking the same thing --- the fact that the resize code uses the superblock lock is really a historical side-effect of the fact that we used to use the same overloaded superblock lock in the ext2/ext3 block allocation layers to guard bitmap access. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Real-Time Preemption, -RT-2.6.13-rc4-V0.7.52-01
Hi, On Fri, 2005-08-26 at 12:20, Steven Rostedt wrote: > > could you try a), how clean does it get? Personally i'm much more in > > favor of cleanliness. On the vanilla kernel a spinlock is zero bytes on > > UP [the most RAM-sensitive platform], and it's a word on typical SMP. It's a word, maybe; but it's a word used only by ext3 afaik, and it's getting added to the core buffer_head. Not very nice. It certainly looks like the easiest short-term way out for a development patch series, though. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Real-Time Preemption, -RT-2.6.13-rc4-V0.7.52-01
Hi, On Fri, 2005-08-26 at 05:24, Steven Rostedt wrote: > Well, I just spent several hours trying to use the b_update_lock in > implementing something to replace the bit spinlocks for RT. It's > getting really ugly and I just hit a stone wall. > > The problem is that I have two locks to work with. A > jbd_lock_bh_journal_head and a jbd_lock_bh_state. ... For now, yes. > So, the only other solutions that I can think of is: > > a) add yet another (bloat) lock to the buffer head. This one looks like the right answer for now, just to get the patch series running. I've got a WIP patch under development which removes the bh_journal_head lock entirely; if that works out, you may find things get a bit easier. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Real-Time Preemption, -RT-2.6.13-rc4-V0.7.52-01
Hi, On Fri, 2005-08-26 at 12:20, Steven Rostedt wrote: could you try a), how clean does it get? Personally i'm much more in favor of cleanliness. On the vanilla kernel a spinlock is zero bytes on UP [the most RAM-sensitive platform], and it's a word on typical SMP. It's a word, maybe; but it's a word used only by ext3 afaik, and it's getting added to the core buffer_head. Not very nice. It certainly looks like the easiest short-term way out for a development patch series, though. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Real-Time Preemption, -RT-2.6.13-rc4-V0.7.52-01
Hi, On Fri, 2005-08-26 at 05:24, Steven Rostedt wrote: Well, I just spent several hours trying to use the b_update_lock in implementing something to replace the bit spinlocks for RT. It's getting really ugly and I just hit a stone wall. The problem is that I have two locks to work with. A jbd_lock_bh_journal_head and a jbd_lock_bh_state. ... For now, yes. So, the only other solutions that I can think of is: a) add yet another (bloat) lock to the buffer head. This one looks like the right answer for now, just to get the patch series running. I've got a WIP patch under development which removes the bh_journal_head lock entirely; if that works out, you may find things get a bit easier. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Ext3 online resizing locking issue
Hi, On Thu, 2005-08-25 at 21:43, Glauber de Oliveira Costa wrote: Just a question here. With s_lock held by the remount code, we're altering the struct super_block, and believing we're safe. We try to acquire it inside the resize functions, because we're trying to modify this same data. Thus, if we rely on another lock, aren't we probably messing up something ? The two different uses of the superblock lock are really quite different; I don't see any particular problem with using two different locks for the two different things. Mount and the namespace code are not locking the same thing --- the fact that the resize code uses the superblock lock is really a historical side-effect of the fact that we used to use the same overloaded superblock lock in the ext2/ext3 block allocation layers to guard bitmap access. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Ext3 online resizing locking issue
Hi, On Wed, 2005-08-24 at 22:03, Glauber de Oliveira Costa wrote: > This simple patch provides a fix for a locking issue found in the online > resizing code. The problem actually happened while trying to resize the > filesystem trough the resize=xxx option in a remount. NAK, this is wrong: > + lock_super(sb); > err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count); > + unlock_super(sb); This basically reverses the order of locking between lock_super() and journal_start() (the latter acts like a lock because it can block on a resource if the journal is too full for the new transaction.) That's the opposite order to normal, and will result in a potential deadlock. > + {Opt_resize, "resize=%u"}, > {Opt_err, NULL}, > - {Opt_resize, "resize"}, Right, that's disabled for now. I guess the easy fix here is just to remove the code entirely, given that we have locking problems with trying to fix it! But the _right_ fix, if you really want to keep that code, is probably to move all the resize locking to a separate lock that ranks outside the journal_start. The easy workaround is to drop the superblock lock and reaquire it around the journal_start(); it would be pretty easy to make that work robustly as far as ext3 is concerned, but I suspect there may be VFS-layer problems if we start dropping the superblock lock in the middle of the s_ops->remount() call --- Al? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Ext3 online resizing locking issue
Hi, On Wed, 2005-08-24 at 22:03, Glauber de Oliveira Costa wrote: This simple patch provides a fix for a locking issue found in the online resizing code. The problem actually happened while trying to resize the filesystem trough the resize=xxx option in a remount. NAK, this is wrong: + lock_super(sb); err = ext3_group_extend(sb, EXT3_SB(sb)-s_es, n_blocks_count); + unlock_super(sb); This basically reverses the order of locking between lock_super() and journal_start() (the latter acts like a lock because it can block on a resource if the journal is too full for the new transaction.) That's the opposite order to normal, and will result in a potential deadlock. + {Opt_resize, resize=%u}, {Opt_err, NULL}, - {Opt_resize, resize}, Right, that's disabled for now. I guess the easy fix here is just to remove the code entirely, given that we have locking problems with trying to fix it! But the _right_ fix, if you really want to keep that code, is probably to move all the resize locking to a separate lock that ranks outside the journal_start. The easy workaround is to drop the superblock lock and reaquire it around the journal_start(); it would be pretty easy to make that work robustly as far as ext3 is concerned, but I suspect there may be VFS-layer problems if we start dropping the superblock lock in the middle of the s_ops-remount() call --- Al? --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH 2/32] fs: fix-up schedule_timeout() usage
Hi, On Mon, 2005-08-15 at 19:08, Nishanth Aravamudan wrote: > Description: Use schedule_timeout_{,un}interruptible() instead of > set_current_state()/schedule_timeout() to reduce kernel size. > +++ 2.6.13-rc5-mm1-dev/fs/jbd/transaction.c 2005-08-10 15:03:33.0 > -0700 > @@ -1340,8 +1340,7 @@ int journal_stop(handle_t *handle) > - set_current_state(TASK_UNINTERRUPTIBLE); > - schedule_timeout(1); > + schedule_timeout_uninterruptible(1); This chunk at least is fine. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm PATCH 2/32] fs: fix-up schedule_timeout() usage
Hi, On Mon, 2005-08-15 at 19:08, Nishanth Aravamudan wrote: Description: Use schedule_timeout_{,un}interruptible() instead of set_current_state()/schedule_timeout() to reduce kernel size. +++ 2.6.13-rc5-mm1-dev/fs/jbd/transaction.c 2005-08-10 15:03:33.0 -0700 @@ -1340,8 +1340,7 @@ int journal_stop(handle_t *handle) - set_current_state(TASK_UNINTERRUPTIBLE); - schedule_timeout(1); + schedule_timeout_uninterruptible(1); This chunk at least is fine. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Inotify patch missed arch/x86_64/ia32/sys_ia32.c
Hi, The inotify patch just added a line + fsnotify_open(f->f_dentry); to sys_open, but it missed the x86_64 compatibility sys32_open() equivalent in arch/x86_64/ia32/sys_ia32.c. Andi, perhaps it's time to factor out the guts of sys_open from the flag munging to keep as much of that code as common as possible, and avoid this sort of maintenance problem in the future? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Inotify patch missed arch/x86_64/ia32/sys_ia32.c
Hi, The inotify patch just added a line + fsnotify_open(f-f_dentry); to sys_open, but it missed the x86_64 compatibility sys32_open() equivalent in arch/x86_64/ia32/sys_ia32.c. Andi, perhaps it's time to factor out the guts of sys_open from the flag munging to keep as much of that code as common as possible, and avoid this sort of maintenance problem in the future? --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kjournald() missing JFS_UNMOUNT check
Hi, On Mon, 2005-07-11 at 22:19, Mark Fasheh wrote: > Can we please merge this patch? I sent it to ext2-devel for comments > last week and haven't hear anything back. It seems trivially correct and is > testing fine - famous last words, I know :) ACK --- looks fine to me. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kjournald() missing JFS_UNMOUNT check
Hi, On Mon, 2005-07-11 at 22:19, Mark Fasheh wrote: Can we please merge this patch? I sent it to ext2-devel for comments last week and haven't hear anything back. It seems trivially correct and is testing fine - famous last words, I know :) ACK --- looks fine to me. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix race in do_get_write_access()
Hi, On Mon, 2005-07-11 at 17:10, Jan Kara wrote: > attached patch should fix the following race: ... > and we have sent wrong data to disk... We now clean the dirty buffer > flag under buffer lock in all cases and hence we know that whenever a buffer > is starting to be journaled we either finish the pending write-out > before attaching a buffer to a transaction or we won't write the buffer > until the transaction is going to be committed... Please apply. Looks good to me. Btw, how did you find this? Were you able to reproduce this in practice? Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix JBD race in t_forget list handling
Hi, On Mon, 2005-07-11 at 16:34, Jan Kara wrote: > attached patch should close the possible race between > journal_commit_transaction() and journal_unmap_buffer() (which adds > buffers to committing transaction's t_forget list) that could leave > some buffers on transaction's t_forget list (hence leading to an > assertion failure later when transaction is dropped). The patch is > against 2.6.13-rc2 kernel. The race was really happening to David Wilk > <[EMAIL PROTECTED]> (thanks for testing) so please apply if you find > the patch correct. Indeed we can be adding to the committing transaction's t_forget list here. The fix looks correct: we're taking the j_list_lock anyway in that loop for the checkpoint processing, so we're not taking it any more frequently by grabbing it for the loop condition itself too --- just holding it a little longer. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix JBD race in t_forget list handling
Hi, On Mon, 2005-07-11 at 16:34, Jan Kara wrote: attached patch should close the possible race between journal_commit_transaction() and journal_unmap_buffer() (which adds buffers to committing transaction's t_forget list) that could leave some buffers on transaction's t_forget list (hence leading to an assertion failure later when transaction is dropped). The patch is against 2.6.13-rc2 kernel. The race was really happening to David Wilk [EMAIL PROTECTED] (thanks for testing) so please apply if you find the patch correct. Indeed we can be adding to the committing transaction's t_forget list here. The fix looks correct: we're taking the j_list_lock anyway in that loop for the checkpoint processing, so we're not taking it any more frequently by grabbing it for the loop condition itself too --- just holding it a little longer. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix race in do_get_write_access()
Hi, On Mon, 2005-07-11 at 17:10, Jan Kara wrote: attached patch should fix the following race: ... and we have sent wrong data to disk... We now clean the dirty buffer flag under buffer lock in all cases and hence we know that whenever a buffer is starting to be journaled we either finish the pending write-out before attaching a buffer to a transaction or we won't write the buffer until the transaction is going to be committed... Please apply. Looks good to me. Btw, how did you find this? Were you able to reproduce this in practice? Cheers, Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Fri, 2005-04-15 at 21:32, Mingming Cao wrote: > Sorry for the delaying. I was not in office these days. No problem. > > > Also I am concerned about the possible > > > starvation on writers. > > In what way? > I was worried about the rw lock case.:) OK, so we're both on the same track here. :-) > > Note that there _is_ some additional complexity here. It's not entirely > > free. Specifically, if we have other tasks waiting on our provisional > > window, then we need to be very careful about the life expectancy of the > > wait queues involved, so that if the first task fixes its window and > > then deletes it before the waiters have woken up, they don't get > > confused by the provisional window vanishing while being waited for. > This approach(provisional window) sounds interesting to me, but it's a > little more complicated than I thought:( Yep. Once you have other tasks waiting on your window while it's not locked, object lifetime becomes a much bigger problem. > alloc_new_reservation() > retry: > lock the tree > search for a new space start from the given startblock > found a new space > if the new space is not a "provisional window" I was thinking rather that we _start_ with the window, and _then_ look for new space. So we'd start with: if we already have a window, mark it provisional; else, do search for a new window; if the immediately preceding window is provisional, wait for that window; continue; allocate the window and mark it provisional; break At this point we have a provisional window, and we know that nobody else is going to allocate either in it, or in the empty space following it (because if they tried to, they'd bump into our own provisional window as their predecessor and would wait for us.) So even if the window doesn't occupy the _whole_ unreserved area, we can still keep searching without fear of multiple tasks trying to do so in the same space at the same time. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote: > I have measured the bh refcount before the buffer_uptodate() for a few days. > I found out that the bh refcount sometimes reached to 0 . > So, I think following modifications are effective. > > diff -Nru 2.4.30-rc3/fs/jbd/commit.c 2.4.30-rc3_patch/fs/jbd/commit.c > --- 2.4.30-rc3/fs/jbd/commit.c2005-04-06 17:14:47.0 +0900 > +++ 2.4.30-rc3_patch/fs/jbd/commit.c 2005-04-06 17:18:49.0 +0900 > @@ -302,11 +303,14 @@ > if (unlikely(!buffer_uptodate(bh))) > err = -EIO; > /* the journal_head may have been removed now */ > + put_bh(bh); > lock_journal(journal); > goto write_out_data; ... In further testing, this chunk is causing some problems. It is entirely legal for a buffer to be !uptodate here, although the path is somewhat convoluted. The trouble is, once the IO has completed, journal_dirty_data() can steal the buffer from the committing transaction. And once that happens, journal_unmap_buffer() can then release the buffer and clear the uptodate bit. I've run this under buffer tracing and can show you this exact path on a trace. It only takes a few seconds to reproduce under fsx. For this to have happened, though, journal_dirty_data needs to have waited on the data, so it has an opportunity to see the !uptodate state at that point. I think it's safe enough to sidestep this by double-checking to see whether the bh is still on the committing transaction after we've waited for the buffer and retaken journaling locks, but before testing buffer_uptodate(). --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Wed, 2005-04-13 at 00:27, Mingming Cao wrote: > > I wonder if there's not a simple solution for this --- mark the window > > as "provisional", and if any other task tries to allocate in the space > > immediately following such a window, it needs to block until that window > > is released. > Sounds interesting. However that implies we need a write lock to mark > the window as provisional and block other files looking for windows near > it: we need to insert the provisional window into the tree and then mark > it as a temporary window, to really let other file notice this kind of > "hold". We need a lock for the tree modification, yes. > I wonder if the benefit of read/write lock is worth all the hassles now. The point of the provisional window is that you don't need read/write locks at all. The provisional window lets you unlock the tree completely while you do the bitmap scan, so there's really no reason to have rwlocks for the tree any more. > If the new window's neighbor stays the same, we only need to roll > forward once. If not, after a successful scan, we need to grab the > write lock, and make sure the window is still there. When we take the provisional window, we can make a note of how much space we have before the next window. And because all future allocators will stall if they try to allocate at this point due to the provisional window, we know that that space will remain outside any other window until we come to fix the provisional window and potentially roll it forward to the space we found. > If we dropped the > lock without holding the new space, we have to search the whole tree > again to find out if the space is still there As long as the space is within the area between the provisional window and its successor, we don't need to do that. (It gets more complex if the bitmap search returns a bit _beyond_ the next window, though.) > Also I am concerned about the possible > starvation on writers. In what way? > I am thinking, maybe back to the spin_lock is not that bad with the > "mark provisional" suggest you made? Right, that was the intent --- sorry if I didn't make it clear. > It allows us to mark the new space > as provisional if we find a new space(prevent other window searching run > into the same neighborhood). We could release the lock and scan the > bitmap without worry about the new space will be taking by others. Exactly. Note that there _is_ some additional complexity here. It's not entirely free. Specifically, if we have other tasks waiting on our provisional window, then we need to be very careful about the life expectancy of the wait queues involved, so that if the first task fixes its window and then deletes it before the waiters have woken up, they don't get confused by the provisional window vanishing while being waited for. The easy solution is a global wait queue for that, but that doesn't scale well. The complex solution is a per-window wait queue and reference count, which is obviously a bit of bloat, though probably worth it for the high-load case. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Wed, 2005-04-13 at 00:27, Mingming Cao wrote: I wonder if there's not a simple solution for this --- mark the window as provisional, and if any other task tries to allocate in the space immediately following such a window, it needs to block until that window is released. Sounds interesting. However that implies we need a write lock to mark the window as provisional and block other files looking for windows near it: we need to insert the provisional window into the tree and then mark it as a temporary window, to really let other file notice this kind of hold. We need a lock for the tree modification, yes. I wonder if the benefit of read/write lock is worth all the hassles now. The point of the provisional window is that you don't need read/write locks at all. The provisional window lets you unlock the tree completely while you do the bitmap scan, so there's really no reason to have rwlocks for the tree any more. If the new window's neighbor stays the same, we only need to roll forward once. If not, after a successful scan, we need to grab the write lock, and make sure the window is still there. When we take the provisional window, we can make a note of how much space we have before the next window. And because all future allocators will stall if they try to allocate at this point due to the provisional window, we know that that space will remain outside any other window until we come to fix the provisional window and potentially roll it forward to the space we found. If we dropped the lock without holding the new space, we have to search the whole tree again to find out if the space is still there As long as the space is within the area between the provisional window and its successor, we don't need to do that. (It gets more complex if the bitmap search returns a bit _beyond_ the next window, though.) Also I am concerned about the possible starvation on writers. In what way? I am thinking, maybe back to the spin_lock is not that bad with the mark provisional suggest you made? Right, that was the intent --- sorry if I didn't make it clear. It allows us to mark the new space as provisional if we find a new space(prevent other window searching run into the same neighborhood). We could release the lock and scan the bitmap without worry about the new space will be taking by others. Exactly. Note that there _is_ some additional complexity here. It's not entirely free. Specifically, if we have other tasks waiting on our provisional window, then we need to be very careful about the life expectancy of the wait queues involved, so that if the first task fixes its window and then deletes it before the waiters have woken up, they don't get confused by the provisional window vanishing while being waited for. The easy solution is a global wait queue for that, but that doesn't scale well. The complex solution is a per-window wait queue and reference count, which is obviously a bit of bloat, though probably worth it for the high-load case. Cheers, Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote: I have measured the bh refcount before the buffer_uptodate() for a few days. I found out that the bh refcount sometimes reached to 0 . So, I think following modifications are effective. diff -Nru 2.4.30-rc3/fs/jbd/commit.c 2.4.30-rc3_patch/fs/jbd/commit.c --- 2.4.30-rc3/fs/jbd/commit.c2005-04-06 17:14:47.0 +0900 +++ 2.4.30-rc3_patch/fs/jbd/commit.c 2005-04-06 17:18:49.0 +0900 @@ -302,11 +303,14 @@ if (unlikely(!buffer_uptodate(bh))) err = -EIO; /* the journal_head may have been removed now */ + put_bh(bh); lock_journal(journal); goto write_out_data; ... In further testing, this chunk is causing some problems. It is entirely legal for a buffer to be !uptodate here, although the path is somewhat convoluted. The trouble is, once the IO has completed, journal_dirty_data() can steal the buffer from the committing transaction. And once that happens, journal_unmap_buffer() can then release the buffer and clear the uptodate bit. I've run this under buffer tracing and can show you this exact path on a trace. It only takes a few seconds to reproduce under fsx. For this to have happened, though, journal_dirty_data needs to have waited on the data, so it has an opportunity to see the !uptodate state at that point. I think it's safe enough to sidestep this by double-checking to see whether the bh is still on the committing transaction after we've waited for the buffer and retaken journaling locks, but before testing buffer_uptodate(). --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in log_do_checkpoint()?
Hi, On Mon, 2005-04-11 at 12:36, Jan Kara wrote: > > The prevention of multiple writes in this case should also improve > > performance a little. > > > > That ought to be pretty straightforward, I think. The existing cases > > where we remove buffers from a checkpoint shouldn't have to care about > > which list_head we're removing from; those cases already handle buffers > > in both states. It's only when doing the flush/wait that we have to > > distinguish the two. > Yes, AFAICS the changes should remain local to the checkpointing code > (plus __unlink_buffer()). Should I write the patch or will you? Feel free, but please let me know if you start. I'm doing a bit of chasing of leaks and dealing with that O_SYNC thing for 2.4 right now, but I'll get back to the checkpoint code after that if you haven't started by then. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests
Hi, On Wed, 2005-04-06 at 02:23, Andrew Morton wrote: > Nobody has noticed the now-fixed leak since 2.6.6 and this one appears to > be 100x slower. Which is fortunate because this one is going to take a > long time to fix. I'll poke at it some more. OK, I'm now at the stage where I can kick off that fsx test on a kernel without your leak fix, kill it, umount and get Whoops: found 43 unfreeable buffers still on the superblock debug list for sb 0100296b2d48. Tracing one... buffer trace for buffer at 0x01003edaa9c8 (I am CPU 0) ... with a trace pointing to journal_unmap_buffer(). I'll try with the fix in place to see if there are any other cases showing up with the same problem. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Tue, 2005-04-12 at 07:41, Mingming Cao wrote: > > Note that this may improve average case latencies, but it's not likely > > to improve worst-case ones. We still need a write lock to install a new > > window, and that's going to have to wait for us to finish finding a free > > bit even if that operation starts using a read lock. > > > Yes indeed. However nothing is free and there are always trade-offs.:) > > By worse case you mean multiple writes trying to allocate blocks around > same area? It doesn't matter where they are; multiple new file opens will all be looking for a write lock. You only need one long-held read lock and all the writers still block. The worst-case latencies can't be properly solved with r/w locks --- those let the readers go more quickly (assuming they are in the majority), which helps the average case, but writers still have to wait for exclusive access. We only really help them by dropping the lock entirely. > Even if we take out the whole > reservation, we still possibility run into this kind of latency: the > bitmap on disk and on journal are extremely inconsistent so we need to > keep searching them both until we find a free bit on both map. Quite possibly. But as long as that code is running without locks, it's much easier to deal with those latencies: they won't impact other CPUs, cond_resched() is easier, and there's even CONFIG_PREEMPT. > > I'm not really sure what to do about worst-case here. For that, we > > really do want to drop the lock entirely while we do the bitmap scan. > Hmm...if we drop the lock entirely while scan the bitmap, assuming you > mean drop the read lock, then I am afraid we have to re-check the tree > (require a read or write lock ) to see if the new window space is still > there after the scan succeed. Sure. You basically start off with a provisional window, and then if necessary roll it forward just the same way you roll normal windows forward when they get to their end. That means you can still drop the lock while you search for new space. When you get there, reacquire the lock and check that the intervening space is still available. That's really cheap for the common case. The difficulty is when you have many parallel allocations hitting the same bg: they allocate provisional windows, find the same free area later on in the bg, and then stomp on each other as they try to move their windows there. I wonder if there's not a simple solution for this --- mark the window as "provisional", and if any other task tries to allocate in the space immediately following such a window, it needs to block until that window is released. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Tue, 2005-04-12 at 07:41, Mingming Cao wrote: Note that this may improve average case latencies, but it's not likely to improve worst-case ones. We still need a write lock to install a new window, and that's going to have to wait for us to finish finding a free bit even if that operation starts using a read lock. Yes indeed. However nothing is free and there are always trade-offs.:) By worse case you mean multiple writes trying to allocate blocks around same area? It doesn't matter where they are; multiple new file opens will all be looking for a write lock. You only need one long-held read lock and all the writers still block. The worst-case latencies can't be properly solved with r/w locks --- those let the readers go more quickly (assuming they are in the majority), which helps the average case, but writers still have to wait for exclusive access. We only really help them by dropping the lock entirely. Even if we take out the whole reservation, we still possibility run into this kind of latency: the bitmap on disk and on journal are extremely inconsistent so we need to keep searching them both until we find a free bit on both map. Quite possibly. But as long as that code is running without locks, it's much easier to deal with those latencies: they won't impact other CPUs, cond_resched() is easier, and there's even CONFIG_PREEMPT. I'm not really sure what to do about worst-case here. For that, we really do want to drop the lock entirely while we do the bitmap scan. Hmm...if we drop the lock entirely while scan the bitmap, assuming you mean drop the read lock, then I am afraid we have to re-check the tree (require a read or write lock ) to see if the new window space is still there after the scan succeed. Sure. You basically start off with a provisional window, and then if necessary roll it forward just the same way you roll normal windows forward when they get to their end. That means you can still drop the lock while you search for new space. When you get there, reacquire the lock and check that the intervening space is still available. That's really cheap for the common case. The difficulty is when you have many parallel allocations hitting the same bg: they allocate provisional windows, find the same free area later on in the bg, and then stomp on each other as they try to move their windows there. I wonder if there's not a simple solution for this --- mark the window as provisional, and if any other task tries to allocate in the space immediately following such a window, it needs to block until that window is released. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests
Hi, On Wed, 2005-04-06 at 02:23, Andrew Morton wrote: Nobody has noticed the now-fixed leak since 2.6.6 and this one appears to be 100x slower. Which is fortunate because this one is going to take a long time to fix. I'll poke at it some more. OK, I'm now at the stage where I can kick off that fsx test on a kernel without your leak fix, kill it, umount and get Whoops: found 43 unfreeable buffers still on the superblock debug list for sb 0100296b2d48. Tracing one... buffer trace for buffer at 0x01003edaa9c8 (I am CPU 0) ... with a trace pointing to journal_unmap_buffer(). I'll try with the fix in place to see if there are any other cases showing up with the same problem. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in log_do_checkpoint()?
Hi, On Mon, 2005-04-11 at 12:36, Jan Kara wrote: The prevention of multiple writes in this case should also improve performance a little. That ought to be pretty straightforward, I think. The existing cases where we remove buffers from a checkpoint shouldn't have to care about which list_head we're removing from; those cases already handle buffers in both states. It's only when doing the flush/wait that we have to distinguish the two. Yes, AFAICS the changes should remain local to the checkpointing code (plus __unlink_buffer()). Should I write the patch or will you? Feel free, but please let me know if you start. I'm doing a bit of chasing of leaks and dealing with that O_SYNC thing for 2.4 right now, but I'll get back to the checkpoint code after that if you haven't started by then. Cheers, Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Mon, 2005-04-11 at 21:46, Andrew Morton wrote: > "Stephen C. Tweedie" <[EMAIL PROTECTED]> wrote: > > > > Andrew, what was the exact illegal state of the pages you were seeing > > when fixing that recent leak? It looks like it's nothing more complex > > than dirty buffers on an anon page. > > Correct. Good --- I think I've got the debugging solid against 2.4 for that case, patching against 2.6 now. > > I think that simply calling > > try_to_release_page() for all the remaining buffers at umount time will > > Presumably these pages have no ->mapping, so try_to_release_page() will > call try_to_free_buffers(). Exactly. > > The only thing that would be required on > > top of that would be a check that the page is also on the VM LRU lists. > > Why do we have dirty buffers left over at umount time? In the leak case we're worried about, the buffers are dirty but the page is anon so there's nothing to clean them up. The buffers _will_ be destroyed by unmount, sure; invalidate_bdev() should see to that. But I'm doing the bh leak check before we get to the final invalidate_bdev(), so they should still be available for testing at that point. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Mon, 2005-04-11 at 19:38, Mingming Cao wrote: > I agree. We should not skip the home block group of the file. I guess > what I was suggesting is, if allocation from the home group failed and > we continuing the linear search the rest of block groups, we could > probably try to skip the block groups without enough usable free blocks > to make a reservation. Fair enough. Once those are the only bgs left, performance is going to drop pretty quickly, but that's not really avoidable on a very full filesystem. > Ah.. I see the win with the read lock now: once the a reservation window > is added, updating it (either winding it forward and or searching for a > avaliable window) probably is the majorirty of the operations on the > tree, and using read lock for that should help reduce the latency. Right. The down side is that for things like a kernel "tar xf", we'll be doing lots of small file unpacks, and hopefully most files will be just one or two reservations --- so there's little winding forward going on. The searching will still be improved in that case. Note that this may improve average case latencies, but it's not likely to improve worst-case ones. We still need a write lock to install a new window, and that's going to have to wait for us to finish finding a free bit even if that operation starts using a read lock. I'm not really sure what to do about worst-case here. For that, we really do want to drop the lock entirely while we do the bitmap scan. That leaves two options. Hold a reservation while we do that; or don't. Holding one poses the problems we discussed before: either you hold a large reservation (bad for disk layout in the presence of concurrent allocators), or you hold smaller ones (high cost as you continually advance the window, which requires some read lock on the tree to avoid bumping into the next window.) Just how bad would it be if we didn't hold a lock _or_ a window at all while doing the search for new window space? I didn't like that alternative at first because of the problem when you've got multiple tasks trying to allocate in the same space at the same time; but given the locking overhead of the alternatives, I'm wondering if this is actually the lesser evil. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Thu, 2005-04-07 at 16:51, Stephen C. Tweedie wrote: > I'm currently running with the buffer-trace debug patch, on 2.4, with a > custom patch to put every buffer jbd ever sees onto a per-superblock > list, and remove it only when the bh is destroyed in > put_unused_buffer_head(). At unmount time, we can walk that list to > find stale buffers attached to data pages (invalidate_buffers() already > does such a walk for metadata.) After a 50-process dbench run, unmount is showing ~300 buffers unclaimed; that seems to be OK, it's a large stress test doing _lots_ of create/unlink and we expect to see a few buffers around at the end where they were truncated during commit and couldn't be instantly reclaimed. The main thing now is to test these buffers to make 100% sure that they are in a state where the VM can reclaim them. That looks fine: the buffers I see are unjournaled, have no jh, and are clean and on the BUF_CLEAN lru. Andrew, what was the exact illegal state of the pages you were seeing when fixing that recent leak? It looks like it's nothing more complex than dirty buffers on an anon page. I think that simply calling try_to_release_page() for all the remaining buffers at umount time will be enough to catch these; if that function fails, it tells us that the VM can't reclaim these pages. The only thing that would be required on top of that would be a check that the page is also on the VM LRU lists. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Fri, 2005-04-08 at 19:10, Mingming Cao wrote: > > It still needs to be done under locking to prevent us from expanding > > over the next window, though. And having to take and drop a spinlock a > > dozen times or more just to find out that there are no usable free > > blocks in the current block group is still expensive, even if we're not > > actually fully unlinking the window each time. > Isn't this a rare case? The whole group is relatively full and the free > bits are all reserved by other files. Well, that's not much different from the case where there _are_ usable bits but they are all near the end of the bitmap. And that's common enough as you fill up a block group with small files, for example. Once the bg becomes nearly full, new file opens-for-write will still try to allocate in that bg (because that's where the inode was allocated), but as it's a new fd we have no prior knowledge of _where_ in the bh to allocate, and we'll have to walk it to the end to find any free space. This is the access pattern I'd expect of (for example) "tar xvjf linux-2.6.12.tar.bz2", not exactly a rare case. > Probably we should avoid trying > to make reservation in this block group at the first place Not in this case, because it's the "home" of the file in question, and skipping to another bg would just leave useful space empty --- and that leads to fragmentation. > You are proposing that we hold the read lock first, do the window search > and bitmap scan, then once we confirm there is free bit in the candidate > window, we grab the write lock and update the tree? No, I'm suggesting that if we need the write lock for tree updates, we may still be able to get away with just a read lock when updating an individual window. If all we're doing is winding the window forwards a bit, that's not actually changing the structure of the tree. > However I am still worried that the rw lock will allow concurrent files > trying to lock the same window at the same time. Adding a new window will need the write lock, always. But with the read lock, we can still check the neighbouring windows (the structure is pinned so those remain valid), so advancing a window with just that lock can still be done without risking overlapping the next window. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in log_do_checkpoint()?
Hi, On Fri, 2005-04-08 at 18:14, Badari Pulavarty wrote: > I get OOPs in log_do_checkpoint() while using ext3 quotas. > Is this anyway related to what you are working on ? > > Unable to handle kernel NULL pointer dereference at virtual address > Doesn't look like it, no. If we understand it right, Jan's problem would only ever manifest itself as an assert failure, not as an oops. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in log_do_checkpoint()?
Hi, On Fri, 2005-04-08 at 18:14, Badari Pulavarty wrote: I get OOPs in log_do_checkpoint() while using ext3 quotas. Is this anyway related to what you are working on ? Unable to handle kernel NULL pointer dereference at virtual address Doesn't look like it, no. If we understand it right, Jan's problem would only ever manifest itself as an assert failure, not as an oops. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Fri, 2005-04-08 at 19:10, Mingming Cao wrote: It still needs to be done under locking to prevent us from expanding over the next window, though. And having to take and drop a spinlock a dozen times or more just to find out that there are no usable free blocks in the current block group is still expensive, even if we're not actually fully unlinking the window each time. Isn't this a rare case? The whole group is relatively full and the free bits are all reserved by other files. Well, that's not much different from the case where there _are_ usable bits but they are all near the end of the bitmap. And that's common enough as you fill up a block group with small files, for example. Once the bg becomes nearly full, new file opens-for-write will still try to allocate in that bg (because that's where the inode was allocated), but as it's a new fd we have no prior knowledge of _where_ in the bh to allocate, and we'll have to walk it to the end to find any free space. This is the access pattern I'd expect of (for example) tar xvjf linux-2.6.12.tar.bz2, not exactly a rare case. Probably we should avoid trying to make reservation in this block group at the first place Not in this case, because it's the home of the file in question, and skipping to another bg would just leave useful space empty --- and that leads to fragmentation. You are proposing that we hold the read lock first, do the window search and bitmap scan, then once we confirm there is free bit in the candidate window, we grab the write lock and update the tree? No, I'm suggesting that if we need the write lock for tree updates, we may still be able to get away with just a read lock when updating an individual window. If all we're doing is winding the window forwards a bit, that's not actually changing the structure of the tree. However I am still worried that the rw lock will allow concurrent files trying to lock the same window at the same time. Adding a new window will need the write lock, always. But with the read lock, we can still check the neighbouring windows (the structure is pinned so those remain valid), so advancing a window with just that lock can still be done without risking overlapping the next window. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Thu, 2005-04-07 at 16:51, Stephen C. Tweedie wrote: I'm currently running with the buffer-trace debug patch, on 2.4, with a custom patch to put every buffer jbd ever sees onto a per-superblock list, and remove it only when the bh is destroyed in put_unused_buffer_head(). At unmount time, we can walk that list to find stale buffers attached to data pages (invalidate_buffers() already does such a walk for metadata.) After a 50-process dbench run, unmount is showing ~300 buffers unclaimed; that seems to be OK, it's a large stress test doing _lots_ of create/unlink and we expect to see a few buffers around at the end where they were truncated during commit and couldn't be instantly reclaimed. The main thing now is to test these buffers to make 100% sure that they are in a state where the VM can reclaim them. That looks fine: the buffers I see are unjournaled, have no jh, and are clean and on the BUF_CLEAN lru. Andrew, what was the exact illegal state of the pages you were seeing when fixing that recent leak? It looks like it's nothing more complex than dirty buffers on an anon page. I think that simply calling try_to_release_page() for all the remaining buffers at umount time will be enough to catch these; if that function fails, it tells us that the VM can't reclaim these pages. The only thing that would be required on top of that would be a check that the page is also on the VM LRU lists. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Mon, 2005-04-11 at 19:38, Mingming Cao wrote: I agree. We should not skip the home block group of the file. I guess what I was suggesting is, if allocation from the home group failed and we continuing the linear search the rest of block groups, we could probably try to skip the block groups without enough usable free blocks to make a reservation. Fair enough. Once those are the only bgs left, performance is going to drop pretty quickly, but that's not really avoidable on a very full filesystem. Ah.. I see the win with the read lock now: once the a reservation window is added, updating it (either winding it forward and or searching for a avaliable window) probably is the majorirty of the operations on the tree, and using read lock for that should help reduce the latency. Right. The down side is that for things like a kernel tar xf, we'll be doing lots of small file unpacks, and hopefully most files will be just one or two reservations --- so there's little winding forward going on. The searching will still be improved in that case. Note that this may improve average case latencies, but it's not likely to improve worst-case ones. We still need a write lock to install a new window, and that's going to have to wait for us to finish finding a free bit even if that operation starts using a read lock. I'm not really sure what to do about worst-case here. For that, we really do want to drop the lock entirely while we do the bitmap scan. That leaves two options. Hold a reservation while we do that; or don't. Holding one poses the problems we discussed before: either you hold a large reservation (bad for disk layout in the presence of concurrent allocators), or you hold smaller ones (high cost as you continually advance the window, which requires some read lock on the tree to avoid bumping into the next window.) Just how bad would it be if we didn't hold a lock _or_ a window at all while doing the search for new window space? I didn't like that alternative at first because of the problem when you've got multiple tasks trying to allocate in the same space at the same time; but given the locking overhead of the alternatives, I'm wondering if this is actually the lesser evil. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Mon, 2005-04-11 at 21:46, Andrew Morton wrote: Stephen C. Tweedie [EMAIL PROTECTED] wrote: Andrew, what was the exact illegal state of the pages you were seeing when fixing that recent leak? It looks like it's nothing more complex than dirty buffers on an anon page. Correct. Good --- I think I've got the debugging solid against 2.4 for that case, patching against 2.6 now. I think that simply calling try_to_release_page() for all the remaining buffers at umount time will Presumably these pages have no -mapping, so try_to_release_page() will call try_to_free_buffers(). Exactly. The only thing that would be required on top of that would be a check that the page is also on the VM LRU lists. Why do we have dirty buffers left over at umount time? In the leak case we're worried about, the buffers are dirty but the page is anon so there's nothing to clean them up. The buffers _will_ be destroyed by unmount, sure; invalidate_bdev() should see to that. But I'm doing the bh leak check before we get to the final invalidate_bdev(), so they should still be available for testing at that point. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Fri, 2005-04-08 at 00:37, Mingming Cao wrote: > Actually, we do not have to do an rbtree link and unlink for every > window we search. If the reserved window(old) has no free bit and the > new reservable window's is right after the old one, no need to unlink > the old window from the rbtree and then link the new window, just update > the start and end of the old one. It still needs to be done under locking to prevent us from expanding over the next window, though. And having to take and drop a spinlock a dozen times or more just to find out that there are no usable free blocks in the current block group is still expensive, even if we're not actually fully unlinking the window each time. I wonder if this can possibly be done safely without locking? It would be really good if we could rotate windows forward with no global locks. At the very least, a fair rwlock would let us freeze the total layout of the tree, while still letting us modify individual windows safely. As long as we use wmb() to make sure that we always extend the end of the window before we shrink the start of it, I think we could get away with such shared locking. And rw locking is much better for concurrency, so we might be able to hold it over the whole bitmap search rather than taking it and dropping it at each window advance. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in log_do_checkpoint()?
Hi, On Mon, 2005-04-04 at 10:04, Jan Kara wrote: > In log_do_checkpoint() we go through the t_checkpoint_list of a > transaction and call __flush_buffer() on each buffer. Suppose there is > just one buffer on the list and it is dirty. __flush_buffer() sees it and > puts it to an array of buffers for flushing. Then the loop finishes, > retry=0, drop_count=0, batch_count=1. So __flush_batch() is called - we > drop all locks and sleep. While we are sleeping somebody else comes and > makes the buffer dirty again (OK, that is not probable, but I think it > could be possible). Yes, there's no reason why not at that point. > Now we wake up and call __cleanup_transaction(). > It's not able to do anything and returns 0. I think the _right_ answer here is to have two separate checkpoint lists: the current one, plus one for which the checkpoint write has already been submitted. That way, we can wait for IO completion on submitted writes without (a) getting conned into doing multiple rewrites if there's somebody else dirtying the buffer; or (b) getting confused about how much progress we're making. Buffers on the pre-write list get written; buffers on the post-write list get waited for; and both count as progress (eliminating the false assert-failure when we failed to detect progress). The prevention of multiple writes in this case should also improve performance a little. That ought to be pretty straightforward, I think. The existing cases where we remove buffers from a checkpoint shouldn't have to care about which list_head we're removing from; those cases already handle buffers in both states. It's only when doing the flush/wait that we have to distinguish the two. Sounds good? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in log_do_checkpoint()?
Hi, On Mon, 2005-04-04 at 10:04, Jan Kara wrote: In log_do_checkpoint() we go through the t_checkpoint_list of a transaction and call __flush_buffer() on each buffer. Suppose there is just one buffer on the list and it is dirty. __flush_buffer() sees it and puts it to an array of buffers for flushing. Then the loop finishes, retry=0, drop_count=0, batch_count=1. So __flush_batch() is called - we drop all locks and sleep. While we are sleeping somebody else comes and makes the buffer dirty again (OK, that is not probable, but I think it could be possible). Yes, there's no reason why not at that point. Now we wake up and call __cleanup_transaction(). It's not able to do anything and returns 0. I think the _right_ answer here is to have two separate checkpoint lists: the current one, plus one for which the checkpoint write has already been submitted. That way, we can wait for IO completion on submitted writes without (a) getting conned into doing multiple rewrites if there's somebody else dirtying the buffer; or (b) getting confused about how much progress we're making. Buffers on the pre-write list get written; buffers on the post-write list get waited for; and both count as progress (eliminating the false assert-failure when we failed to detect progress). The prevention of multiple writes in this case should also improve performance a little. That ought to be pretty straightforward, I think. The existing cases where we remove buffers from a checkpoint shouldn't have to care about which list_head we're removing from; those cases already handle buffers in both states. It's only when doing the flush/wait that we have to distinguish the two. Sounds good? --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Fri, 2005-04-08 at 00:37, Mingming Cao wrote: Actually, we do not have to do an rbtree link and unlink for every window we search. If the reserved window(old) has no free bit and the new reservable window's is right after the old one, no need to unlink the old window from the rbtree and then link the new window, just update the start and end of the old one. It still needs to be done under locking to prevent us from expanding over the next window, though. And having to take and drop a spinlock a dozen times or more just to find out that there are no usable free blocks in the current block group is still expensive, even if we're not actually fully unlinking the window each time. I wonder if this can possibly be done safely without locking? It would be really good if we could rotate windows forward with no global locks. At the very least, a fair rwlock would let us freeze the total layout of the tree, while still letting us modify individual windows safely. As long as we use wmb() to make sure that we always extend the end of the window before we shrink the start of it, I think we could get away with such shared locking. And rw locking is much better for concurrency, so we might be able to hold it over the whole bitmap search rather than taking it and dropping it at each window advance. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Wed, 2005-04-06 at 21:10, Stephen C. Tweedie wrote: > However, 2.6 is suspected of still having leaks in ext3. To be certain > that we're not just backporting one of those to 2.4, we need to > understand who exactly is going to clean up these bh's if they are in > fact unused once we complete this code and do the final put_bh(). > > I'll give this patch a spin with Andrew's fsx-based leak stress and see > if anything unpleasant appears. I'm currently running with the buffer-trace debug patch, on 2.4, with a custom patch to put every buffer jbd ever sees onto a per-superblock list, and remove it only when the bh is destroyed in put_unused_buffer_head(). At unmount time, we can walk that list to find stale buffers attached to data pages (invalidate_buffers() already does such a walk for metadata.) I just ran it with a quick test and it found 300,000 buffers still present at unmount. Whoops, I guess I need to move the check to _after_ the final invalidate_inodes() call. :-) But this method ought to allow us to test for this sort of leak a lot more reliably, and to report via the usual buffer history tracing the most recent activity on any bh that does leak. Andrew, I'll give this a try on 2.6 once I've got this 2.4 patch tested for Marcelo. I've got uptodate buffer_trace for 2.6 anyway, so it shouldn't be too hard. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote: > doesnt the first option also allow searches to be in parallel? In terms of CPU usage, yes. But either we use large windows, in which case we *can't* search remotely near areas of the disk in parallel; or we use small windows, in which case failure to find a free bit becomes disproportionately expensive (we have to do an rbtree link and unlink for each window we search.) > This > particular one took over 1 msec, so it seems there's a fair room for > parallellizing multiple writers and thus improving scalability. (or is > this codepath serialized globally anyway?) No, the rbtree ops are serialised per-sb, and allocations within any one inode are serialised, but the bitmap searches need not be serialised globally. (They are in the case of new window searches right now, which is what we're trying to fix.) --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Thu, 2005-04-07 at 09:14, Ingo Molnar wrote: doesnt the first option also allow searches to be in parallel? In terms of CPU usage, yes. But either we use large windows, in which case we *can't* search remotely near areas of the disk in parallel; or we use small windows, in which case failure to find a free bit becomes disproportionately expensive (we have to do an rbtree link and unlink for each window we search.) This particular one took over 1 msec, so it seems there's a fair room for parallellizing multiple writers and thus improving scalability. (or is this codepath serialized globally anyway?) No, the rbtree ops are serialised per-sb, and allocations within any one inode are serialised, but the bitmap searches need not be serialised globally. (They are in the case of new window searches right now, which is what we're trying to fix.) --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Wed, 2005-04-06 at 21:10, Stephen C. Tweedie wrote: However, 2.6 is suspected of still having leaks in ext3. To be certain that we're not just backporting one of those to 2.4, we need to understand who exactly is going to clean up these bh's if they are in fact unused once we complete this code and do the final put_bh(). I'll give this patch a spin with Andrew's fsx-based leak stress and see if anything unpleasant appears. I'm currently running with the buffer-trace debug patch, on 2.4, with a custom patch to put every buffer jbd ever sees onto a per-superblock list, and remove it only when the bh is destroyed in put_unused_buffer_head(). At unmount time, we can walk that list to find stale buffers attached to data pages (invalidate_buffers() already does such a walk for metadata.) I just ran it with a quick test and it found 300,000 buffers still present at unmount. Whoops, I guess I need to move the check to _after_ the final invalidate_inodes() call. :-) But this method ought to allow us to test for this sort of leak a lot more reliably, and to report via the usual buffer history tracing the most recent activity on any bh that does leak. Andrew, I'll give this a try on 2.6 once I've got this 2.4 patch tested for Marcelo. I've got uptodate buffer_trace for 2.6 anyway, so it shouldn't be too hard. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote: > I have measured the bh refcount before the buffer_uptodate() for a few days. > I found out that the bh refcount sometimes reached to 0 . > So, I think following modifications are effective. Thanks --- it certainly looks like this should fix the dereferencing of invalid bh's, and this code mirrors what 2.6 does around that area. However, 2.6 is suspected of still having leaks in ext3. To be certain that we're not just backporting one of those to 2.4, we need to understand who exactly is going to clean up these bh's if they are in fact unused once we complete this code and do the final put_bh(). I'll give this patch a spin with Andrew's fsx-based leak stress and see if anything unpleasant appears. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Wed, 2005-04-06 at 17:53, Mingming Cao wrote: > > Possible, but not necessarily nice. If you've got a nearly-full disk, > > most bits will be already allocated. As you scan the bitmaps, it may > > take quite a while to find a free bit; do you really want to (a) lock > > the whole block group with a temporary window just to do the scan, or > > (b) keep allocating multiple smaller windows until you finally find a > > free bit? The former is bad for concurrency if you have multiple tasks > > trying to allocate nearby on disk --- you'll force them into different > > block groups. The latter is high overhead. > I am not quite understand what you mean about (a). In this proposal, we > will drop the lock before the scan. s/lock/reserve/. > And for (b), maybe I did not make myself clear: I am not proposing to > keeping allocating multiple smaller windows until finally find a free > bit. I mean, we book the window(just link the node into the tree) before > we drop the lock, if there is no free bit inside that window, we will go > back search for another window(call find_next_reserveable_window()), > inside it, we will remove the temporary window we just created and find > next window. SO we only have one temporary window at a time. And that's the problem. Either we create small temporary windows, in which case we may end up thrashing through vast numbers of them before we find a bit that's available --- very expensive as the disk gets full --- or we use large windows but get worse layout when there are parallel allocators going on. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote: > >Certainly it's normal for a short read/write to imply either error or > >EOF, without the error necessarily needing to be returned explicitly. > >I'm not convinced that the Singleunix language actually requires that, > >but it seems the most obvious and consistent behaviour. > When an O_SYNC flag is set , if commit_write() succeed but > generic_osync_inode() return > error due to I/O failure, write() must fail . Yes. But it is conventional to interpret a short write as being a failure. Returning less bytes than were requested in the write indicates that the rest failed. It just doesn't give the exact nature of the failure (EIO vs ENOSPC etc.) For regular files, a short write is never permitted unless there are errors of some description. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Wed, 2005-04-06 at 06:35, Mingming Cao wrote: > It seems we are holding the rsv_block while searching the bitmap for a > free bit. Probably something to avoid! > In alloc_new_reservation(), we first find a available to > create a reservation window, then we check the bitmap to see if it > contains any free block. If not, we will search for next available > window, so on and on. During the whole process we are holding the global > rsv_lock. We could, and probably should, avoid that. Just unlock the > rsv_lock before the bitmap search and re-grab it after it. We need to > make sure that the available space that are still available after we re- > grab the lock. Not necessarily. As long as the windows remain mutually exclusive in the rbtree, it doesn't matter too much if we occasionally allocate a full one --- as long as that case is rare, the worst that happens is that we fail to allocate from the window and have to repeat the window reserve. The difficulty will be in keeping it rare. What we need to avoid is the case where multiple tasks need a new window, they all drop the lock, find the same bits free in the bitmap, then all try to take that window. One will succeed, the others will fail; but as the files in that bit of the disk continue to grow, we risk those processes *continually* repeating the same stomp-on-each-others'-windows operation and raising the allocation overhead significantly. > Another option is to hold that available window before we release the > rsv_lock, and if there is no free bit inside that window, we will remove > it from the tree in the next round of searching for next available > window. Possible, but not necessarily nice. If you've got a nearly-full disk, most bits will be already allocated. As you scan the bitmaps, it may take quite a while to find a free bit; do you really want to (a) lock the whole block group with a temporary window just to do the scan, or (b) keep allocating multiple smaller windows until you finally find a free bit? The former is bad for concurrency if you have multiple tasks trying to allocate nearby on disk --- you'll force them into different block groups. The latter is high overhead. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Wed, 2005-04-06 at 06:35, Mingming Cao wrote: It seems we are holding the rsv_block while searching the bitmap for a free bit. Probably something to avoid! In alloc_new_reservation(), we first find a available to create a reservation window, then we check the bitmap to see if it contains any free block. If not, we will search for next available window, so on and on. During the whole process we are holding the global rsv_lock. We could, and probably should, avoid that. Just unlock the rsv_lock before the bitmap search and re-grab it after it. We need to make sure that the available space that are still available after we re- grab the lock. Not necessarily. As long as the windows remain mutually exclusive in the rbtree, it doesn't matter too much if we occasionally allocate a full one --- as long as that case is rare, the worst that happens is that we fail to allocate from the window and have to repeat the window reserve. The difficulty will be in keeping it rare. What we need to avoid is the case where multiple tasks need a new window, they all drop the lock, find the same bits free in the bitmap, then all try to take that window. One will succeed, the others will fail; but as the files in that bit of the disk continue to grow, we risk those processes *continually* repeating the same stomp-on-each-others'-windows operation and raising the allocation overhead significantly. Another option is to hold that available window before we release the rsv_lock, and if there is no free bit inside that window, we will remove it from the tree in the next round of searching for next available window. Possible, but not necessarily nice. If you've got a nearly-full disk, most bits will be already allocated. As you scan the bitmaps, it may take quite a while to find a free bit; do you really want to (a) lock the whole block group with a temporary window just to do the scan, or (b) keep allocating multiple smaller windows until you finally find a free bit? The former is bad for concurrency if you have multiple tasks trying to allocate nearby on disk --- you'll force them into different block groups. The latter is high overhead. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote: Certainly it's normal for a short read/write to imply either error or EOF, without the error necessarily needing to be returned explicitly. I'm not convinced that the Singleunix language actually requires that, but it seems the most obvious and consistent behaviour. When an O_SYNC flag is set , if commit_write() succeed but generic_osync_inode() return error due to I/O failure, write() must fail . Yes. But it is conventional to interpret a short write as being a failure. Returning less bytes than were requested in the write indicates that the rest failed. It just doesn't give the exact nature of the failure (EIO vs ENOSPC etc.) For regular files, a short write is never permitted unless there are errors of some description. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 allocate-with-reservation latencies
Hi, On Wed, 2005-04-06 at 17:53, Mingming Cao wrote: Possible, but not necessarily nice. If you've got a nearly-full disk, most bits will be already allocated. As you scan the bitmaps, it may take quite a while to find a free bit; do you really want to (a) lock the whole block group with a temporary window just to do the scan, or (b) keep allocating multiple smaller windows until you finally find a free bit? The former is bad for concurrency if you have multiple tasks trying to allocate nearby on disk --- you'll force them into different block groups. The latter is high overhead. I am not quite understand what you mean about (a). In this proposal, we will drop the lock before the scan. s/lock/reserve/. And for (b), maybe I did not make myself clear: I am not proposing to keeping allocating multiple smaller windows until finally find a free bit. I mean, we book the window(just link the node into the tree) before we drop the lock, if there is no free bit inside that window, we will go back search for another window(call find_next_reserveable_window()), inside it, we will remove the temporary window we just created and find next window. SO we only have one temporary window at a time. And that's the problem. Either we create small temporary windows, in which case we may end up thrashing through vast numbers of them before we find a bit that's available --- very expensive as the disk gets full --- or we use large windows but get worse layout when there are parallel allocators going on. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Wed, 2005-04-06 at 11:01, Hifumi Hisashi wrote: I have measured the bh refcount before the buffer_uptodate() for a few days. I found out that the bh refcount sometimes reached to 0 . So, I think following modifications are effective. Thanks --- it certainly looks like this should fix the dereferencing of invalid bh's, and this code mirrors what 2.6 does around that area. However, 2.6 is suspected of still having leaks in ext3. To be certain that we're not just backporting one of those to 2.4, we need to understand who exactly is going to clean up these bh's if they are in fact unused once we complete this code and do the final put_bh(). I'll give this patch a spin with Andrew's fsx-based leak stress and see if anything unpleasant appears. Cheers, Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Wed, 2005-03-30 at 12:59, Marcelo Tosatti wrote: > > I'm not certain that this is right, but it seems possible and would > > explain the symptoms. Maybe Stephen or Andrew could comments? > > Andrew, Stephen? Sorry, was offline for a week last week; I'll try to look at this more closely tomorrow. Checking the buffer_uptodate() without either a refcount or a lock certainly looks unsafe at first glance. There are lots of ways to pin the bh in that particular bit of the code. The important thing will be to do so without causing leaks if we're truly finished with the buffer after this flush. > > If some of the write succeeded and some failed, then I believe the > > correct behaviour is to return the number of bytes that succeeded. > > However this change to the return status (remember the above patch is > > a reversal) causes any failure to over-ride any success. This, I > > think, is wrong. > > Yeap, that part also looks wrong. Certainly it's normal for a short read/write to imply either error or EOF, without the error necessarily needing to be returned explicitly. I'm not convinced that the Singleunix language actually requires that, but it seems the most obvious and consistent behaviour. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
Hi, On Mon, 2005-04-04 at 02:35, Andrew Morton wrote: > Without the below patch it's possible to make ext3 leak at around a > megabyte per minute by arranging for the fs to run a commit every 50 > milliseconds, btw. Ouch! > (Stephen, please review...) Doing so now. > The patch teaches journal_unmap_buffer() about buffers which are on the > committing transaction's t_locked_list. These buffers have been written and > I/O has completed. Agreed. The key here is that the buffer is locked before journal_unmap_buffer() is called, so we can indeed rely on it being safely on disk. > We can take them off the transaction and undirty them > within the context of journal_invalidatepage()->journal_unmap_buffer(). Right - the committing transaction can't be doing any more writes, and the current transaction has explicitly told us to throw away its own writes if we get here. Unfiling the buffer should be safe. > + if (jh->b_jlist == BJ_Locked) { > + /* > + * The buffer is on the committing transaction's locked > + * list. We have the buffer locked, so I/O has > + * completed. So we can nail the buffer now. > + */ > + may_free = __dispose_buffer(jh, transaction); > + goto zap_buffer; > + } ACK. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)
Hi, On Wed, 2005-03-30 at 12:59, Marcelo Tosatti wrote: I'm not certain that this is right, but it seems possible and would explain the symptoms. Maybe Stephen or Andrew could comments? Andrew, Stephen? Sorry, was offline for a week last week; I'll try to look at this more closely tomorrow. Checking the buffer_uptodate() without either a refcount or a lock certainly looks unsafe at first glance. There are lots of ways to pin the bh in that particular bit of the code. The important thing will be to do so without causing leaks if we're truly finished with the buffer after this flush. If some of the write succeeded and some failed, then I believe the correct behaviour is to return the number of bytes that succeeded. However this change to the return status (remember the above patch is a reversal) causes any failure to over-ride any success. This, I think, is wrong. Yeap, that part also looks wrong. Certainly it's normal for a short read/write to imply either error or EOF, without the error necessarily needing to be returned explicitly. I'm not convinced that the Singleunix language actually requires that, but it seems the most obvious and consistent behaviour. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OOM problems on 2.6.12-rc1 with many fsx tests
Hi, On Mon, 2005-04-04 at 02:35, Andrew Morton wrote: Without the below patch it's possible to make ext3 leak at around a megabyte per minute by arranging for the fs to run a commit every 50 milliseconds, btw. Ouch! (Stephen, please review...) Doing so now. The patch teaches journal_unmap_buffer() about buffers which are on the committing transaction's t_locked_list. These buffers have been written and I/O has completed. Agreed. The key here is that the buffer is locked before journal_unmap_buffer() is called, so we can indeed rely on it being safely on disk. We can take them off the transaction and undirty them within the context of journal_invalidatepage()-journal_unmap_buffer(). Right - the committing transaction can't be doing any more writes, and the current transaction has explicitly told us to throw away its own writes if we get here. Unfiling the buffer should be safe. + if (jh-b_jlist == BJ_Locked) { + /* + * The buffer is on the committing transaction's locked + * list. We have the buffer locked, so I/O has + * completed. So we can nail the buffer now. + */ + may_free = __dispose_buffer(jh, transaction); + goto zap_buffer; + } ACK. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 journalling BUG on full filesystem
Hi, On Thu, 2005-03-24 at 19:38, Chris Wright wrote: > OK, good to know. When I last checked you were working on a higher risk > yet more complete fix, and I thought we'd wait for that one to stabilize. > Looks like the one Jan attached is the better -stable candidate? Definitely; it's the one I gave akpm. The lock reworking is going to remove one layer of locks, so it's worthwhile from that point of view; but it's longer-term, and I don't know for sure of any paths to chaos with that simpler journal_unmap_buffer() fix in place. (It's just very hard to _prove_ all cases are correct without the locking rework.) --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 journalling BUG on full filesystem
Hi, On Thu, 2005-03-24 at 10:39, Jan Kara wrote: > Actually the patch you atached showed in the end as not covering all > the cases and so Stephen agreed to stay with the first try (attached) > which should cover all known cases (although it's not so nice). Right. The later patch is getting reworked into a proper locking overhaul for the journal_put_journal_head() code. The earlier one (that Jan attached) is the one that's appropriate in the mean time; it covers all of the holes we know about for sure and has proven robust in testing. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 journalling BUG on full filesystem
Hi, On Thu, 2005-03-24 at 10:39, Jan Kara wrote: Actually the patch you atached showed in the end as not covering all the cases and so Stephen agreed to stay with the first try (attached) which should cover all known cases (although it's not so nice). Right. The later patch is getting reworked into a proper locking overhaul for the journal_put_journal_head() code. The earlier one (that Jan attached) is the one that's appropriate in the mean time; it covers all of the holes we know about for sure and has proven robust in testing. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 journalling BUG on full filesystem
Hi, On Thu, 2005-03-24 at 19:38, Chris Wright wrote: OK, good to know. When I last checked you were working on a higher risk yet more complete fix, and I thought we'd wait for that one to stabilize. Looks like the one Jan attached is the better -stable candidate? Definitely; it's the one I gave akpm. The lock reworking is going to remove one layer of locks, so it's worthwhile from that point of view; but it's longer-term, and I don't know for sure of any paths to chaos with that simpler journal_unmap_buffer() fix in place. (It's just very hard to _prove_ all cases are correct without the locking rework.) --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CHECKER] ext3 bug in ftruncate() with O_SYNC?
Hi, On Tue, 2005-03-22 at 03:51, Andrew Morton wrote: > The spec says "Write I/O operations on the file descriptor shall complete > as defined by synchronized I/O file integrity completion". > > Is ftruncate a "write I/O operation"? No. SUS seems to be pretty clear on this. The syscall descriptions for write(2) and pwrite(2) explicitly describe O_SYNC as requiring synchronized I/O file integrity completion. ftruncate() has no such requirement. It would certainly be a reasonable thing to do, but I don't think it strictly counts as a bug that we're not honouring O_SYNC here. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CHECKER] ext3 bug in ftruncate() with O_SYNC?
Hi, On Tue, 2005-03-22 at 03:51, Andrew Morton wrote: The spec says Write I/O operations on the file descriptor shall complete as defined by synchronized I/O file integrity completion. Is ftruncate a write I/O operation? No. SUS seems to be pretty clear on this. The syscall descriptions for write(2) and pwrite(2) explicitly describe O_SYNC as requiring synchronized I/O file integrity completion. ftruncate() has no such requirement. It would certainly be a reasonable thing to do, but I don't think it strictly counts as a bug that we're not honouring O_SYNC here. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
e2fsprogs bug [was Re: ext2/3 file limits to avoid overflowing i_blocks]
Hi, On Thu, 2005-03-17 at 17:23, Stephen C. Tweedie wrote: > I wrote a small program to calculate the total indirect tree overhead > for any given file size, and 0x1ff7fffe000 turned out to be the largest > file we can get without the total i_blocks overflowing 2^32. > > But in testing, that *just* wrapped --- we need to limit the file to be > one page smaller than that to deal with the possibility of an EA/ACL > block being accounted against i_blocks. On a side, issue, e2fsck was unable to find any problem on that filesystem after the i_blocks had wrapped exactly to zero. The bug seems to be in e2fsck/pass1.c: we do the numblocks checking inside process_block(), which is called as an inode block iteration function in check_blocks(). Then, later, check_blocks() does if (inode->i_file_acl && check_ext_attr(ctx, pctx, block_buf)) pb.num_blocks++; pb.num_blocks *= (fs->blocksize / 512); but without any further testing to see if pb.num_blocks has exceeded the max_blocks. So by the time we've got to the end of check_blocks(), we're testing the wrapped i_blocks on disk against the wrapped num_blocks in memory, and so e2fsck fails to notice anything wrong. The fix may be as simple as just moving the if (inode->i_file_acl && check_ext_attr(ctx, pctx, block_buf)) pb.num_blocks++; earlier in the function; Ted, do you see any problems with that? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch] ext2/3 file limits to avoid overflowing i_blocks
Hi all, As discussed before, we can overflow i_blocks in ext2/ext3 inodes by growing a file up to 2TB. That gives us 2^32 sectors of data in the file; but once you add on the indirect tree and possible EA/ACL metadata, i_blocks will wrap beyond 2^32. Consensus seemed to be that the best way to avoid this was simply to stop files getting so large that this was a problem in the first place; anything else would lead to complications if a sparse file tried to overflow that 2^32 sector limit while filling in holes. I wrote a small program to calculate the total indirect tree overhead for any given file size, and 0x1ff7fffe000 turned out to be the largest file we can get without the total i_blocks overflowing 2^32. But in testing, that *just* wrapped --- we need to limit the file to be one page smaller than that to deal with the possibility of an EA/ACL block being accounted against i_blocks. So this patch has been tested, at least on ext3, by letting a file grow densely to its maximum size permitted by the kernel; at 0x1ff7fffe000, stat shows the file to have wrapped back exactly to 0 st_blocks, but with the limit at 0x1ff7fffd000, du shows it occupying the expected 2TB-blocksize bytes. Cheers, Stephen [Patch] ext2/3 file limits to avoid overflowing i_blocks ext2/3 can currently overflow i_blocks past the maximum 2^32 sector (2TB) limit. The ext[23]_max_size functions try to enforce an upper limit of 2^32-1 blocks by limiting the file size to 2TB-blocksize, but this fails to account for indirect block metadata and the possibility of an extended attribute block also being accounted against i_blocks. The new file size limit of 0x1ff7fffd000 allows a file to grow to 2^32-blocksize sectors, assuming 4k blocksize. (We don't have to worry about smaller blocksizes as those have indirect tree limits that don't let their size grow near this upper bound in the first place.) Signed-off-by: Stephen Tweedie <[EMAIL PROTECTED]> --- linux-2.6-ext3/fs/ext2/super.c.=K0001=.orig +++ linux-2.6-ext3/fs/ext2/super.c @@ -518,12 +518,18 @@ static int ext2_check_descriptors (struc static loff_t ext2_max_size(int bits) { loff_t res = EXT2_NDIR_BLOCKS; + /* This constant is calculated to be the largest file size for a +* dense, 4k-blocksize file such that the total number of +* sectors in the file, including data and all indirect blocks, +* does not exceed 2^32. */ + const loff_t upper_limit = 0x1ff7fffd000LL; + res += 1LL << (bits-2); res += 1LL << (2*(bits-2)); res += 1LL << (3*(bits-2)); res <<= bits; - if (res > (512LL << 32) - (1 << bits)) - res = (512LL << 32) - (1 << bits); + if (res > upper_limit) + res = upper_limit; return res; } --- linux-2.6-ext3/fs/ext3/super.c.=K0001=.orig +++ linux-2.6-ext3/fs/ext3/super.c @@ -1193,12 +1193,18 @@ static void ext3_orphan_cleanup (struct static loff_t ext3_max_size(int bits) { loff_t res = EXT3_NDIR_BLOCKS; + /* This constant is calculated to be the largest file size for a +* dense, 4k-blocksize file such that the total number of +* sectors in the file, including data and all indirect blocks, +* does not exceed 2^32. */ + const loff_t upper_limit = 0x1ff7fffd000LL; + res += 1LL << (bits-2); res += 1LL << (2*(bits-2)); res += 1LL << (3*(bits-2)); res <<= bits; - if (res > (512LL << 32) - (1 << bits)) - res = (512LL << 32) - (1 << bits); + if (res > upper_limit) + res = upper_limit; return res; }
[Patch] ext2/3 file limits to avoid overflowing i_blocks
Hi all, As discussed before, we can overflow i_blocks in ext2/ext3 inodes by growing a file up to 2TB. That gives us 2^32 sectors of data in the file; but once you add on the indirect tree and possible EA/ACL metadata, i_blocks will wrap beyond 2^32. Consensus seemed to be that the best way to avoid this was simply to stop files getting so large that this was a problem in the first place; anything else would lead to complications if a sparse file tried to overflow that 2^32 sector limit while filling in holes. I wrote a small program to calculate the total indirect tree overhead for any given file size, and 0x1ff7fffe000 turned out to be the largest file we can get without the total i_blocks overflowing 2^32. But in testing, that *just* wrapped --- we need to limit the file to be one page smaller than that to deal with the possibility of an EA/ACL block being accounted against i_blocks. So this patch has been tested, at least on ext3, by letting a file grow densely to its maximum size permitted by the kernel; at 0x1ff7fffe000, stat shows the file to have wrapped back exactly to 0 st_blocks, but with the limit at 0x1ff7fffd000, du shows it occupying the expected 2TB-blocksize bytes. Cheers, Stephen [Patch] ext2/3 file limits to avoid overflowing i_blocks ext2/3 can currently overflow i_blocks past the maximum 2^32 sector (2TB) limit. The ext[23]_max_size functions try to enforce an upper limit of 2^32-1 blocks by limiting the file size to 2TB-blocksize, but this fails to account for indirect block metadata and the possibility of an extended attribute block also being accounted against i_blocks. The new file size limit of 0x1ff7fffd000 allows a file to grow to 2^32-blocksize sectors, assuming 4k blocksize. (We don't have to worry about smaller blocksizes as those have indirect tree limits that don't let their size grow near this upper bound in the first place.) Signed-off-by: Stephen Tweedie [EMAIL PROTECTED] --- linux-2.6-ext3/fs/ext2/super.c.=K0001=.orig +++ linux-2.6-ext3/fs/ext2/super.c @@ -518,12 +518,18 @@ static int ext2_check_descriptors (struc static loff_t ext2_max_size(int bits) { loff_t res = EXT2_NDIR_BLOCKS; + /* This constant is calculated to be the largest file size for a +* dense, 4k-blocksize file such that the total number of +* sectors in the file, including data and all indirect blocks, +* does not exceed 2^32. */ + const loff_t upper_limit = 0x1ff7fffd000LL; + res += 1LL (bits-2); res += 1LL (2*(bits-2)); res += 1LL (3*(bits-2)); res = bits; - if (res (512LL 32) - (1 bits)) - res = (512LL 32) - (1 bits); + if (res upper_limit) + res = upper_limit; return res; } --- linux-2.6-ext3/fs/ext3/super.c.=K0001=.orig +++ linux-2.6-ext3/fs/ext3/super.c @@ -1193,12 +1193,18 @@ static void ext3_orphan_cleanup (struct static loff_t ext3_max_size(int bits) { loff_t res = EXT3_NDIR_BLOCKS; + /* This constant is calculated to be the largest file size for a +* dense, 4k-blocksize file such that the total number of +* sectors in the file, including data and all indirect blocks, +* does not exceed 2^32. */ + const loff_t upper_limit = 0x1ff7fffd000LL; + res += 1LL (bits-2); res += 1LL (2*(bits-2)); res += 1LL (3*(bits-2)); res = bits; - if (res (512LL 32) - (1 bits)) - res = (512LL 32) - (1 bits); + if (res upper_limit) + res = upper_limit; return res; }
e2fsprogs bug [was Re: ext2/3 file limits to avoid overflowing i_blocks]
Hi, On Thu, 2005-03-17 at 17:23, Stephen C. Tweedie wrote: I wrote a small program to calculate the total indirect tree overhead for any given file size, and 0x1ff7fffe000 turned out to be the largest file we can get without the total i_blocks overflowing 2^32. But in testing, that *just* wrapped --- we need to limit the file to be one page smaller than that to deal with the possibility of an EA/ACL block being accounted against i_blocks. On a side, issue, e2fsck was unable to find any problem on that filesystem after the i_blocks had wrapped exactly to zero. The bug seems to be in e2fsck/pass1.c: we do the numblocks checking inside process_block(), which is called as an inode block iteration function in check_blocks(). Then, later, check_blocks() does if (inode-i_file_acl check_ext_attr(ctx, pctx, block_buf)) pb.num_blocks++; pb.num_blocks *= (fs-blocksize / 512); but without any further testing to see if pb.num_blocks has exceeded the max_blocks. So by the time we've got to the end of check_blocks(), we're testing the wrapped i_blocks on disk against the wrapped num_blocks in memory, and so e2fsck fails to notice anything wrong. The fix may be as simple as just moving the if (inode-i_file_acl check_ext_attr(ctx, pctx, block_buf)) pb.num_blocks++; earlier in the function; Ted, do you see any problems with that? --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Devices/Partitions over 2TB
Hi, On Tue, 2005-03-15 at 04:54, jmerkey wrote: > Good Question. Where are the standard tools in FC2 and FC3 for these types? For LVM, the lvm2 package contains all the necessary tools. I know Alasdair did some kernel fixes for lvm2 striping on >2TB partitions recently, though, so older kernels might not work perfectly if you're using stripes. To use genuine partitions > 2TB, though, you need alternative partitioning; the GPT disk label supports that, and "parted" can create and partition such disk labels. (Note that most x86 BIOSes can't boot off them, though, so don't do this on your boot disk!) --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Devices/Partitions over 2TB
Hi, On Tue, 2005-03-15 at 04:54, jmerkey wrote: Good Question. Where are the standard tools in FC2 and FC3 for these types? For LVM, the lvm2 package contains all the necessary tools. I know Alasdair did some kernel fixes for lvm2 striping on 2TB partitions recently, though, so older kernels might not work perfectly if you're using stripes. To use genuine partitions 2TB, though, you need alternative partitioning; the GPT disk label supports that, and parted can create and partition such disk labels. (Note that most x86 BIOSes can't boot off them, though, so don't do this on your boot disk!) --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3/jbd race: releasing in-use journal_heads
Hi, On Wed, 2005-03-09 at 13:28, Jan Kara wrote: > Hmm. I see for example a place at jbd/commit.c, line 287 (which you > did not change in your patch) which does this and doesn't seem to be > protected against journal_unmap_buffer() (but maybe I miss something). > Not that I'd find that race probable but in theory... Indeed; I can't see why that wouldn't trigger (at least without the existing, low-risk journal_unmap_buffer() patch.) Andrew, I think we just go with that simple patch for now --- it catches the cases we actually see in testing. And rather than mess with temporary states where b_transaction goes NULL, I think the *correct* long-term fix is to hide those states using locking rather than by list tricks. Merging the bh_journal_head and bh_state locks really seems like the safe solution here, as the latter seems to be held nearly everywhere where we need protection against journal_put_journal_head() (and where it's not held --- as it wasn't in journal_unmap_buffer() --- it's a bug.) --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3/jbd race: releasing in-use journal_heads
Hi, On Tue, 2005-03-08 at 15:12, Jan Kara wrote: > Isn't also the following scenario dangerous? > > __journal_unfile_buffer(jh); > journal_remove_journal_head(bh); It depends. I think the biggest problem here is that there's really no written rule protecting this stuff universally. But in testing, we just don't see this triggering. Why not? We actually have a raft of other locks protecting us in various places. As long as there's some overlap in the locks, we're OK. But because it's not written down, not everybody relies on the same set of locks; it's only because journal_unmap_buffer() was dropping the jh without enough locks that we saw a problem there. So the scenario: __journal_unfile_buffer(jh); journal_remove_journal_head(bh); *might* be dangerous, if called carelessly; but in practice it works out OK. Remember, that journal_remove_journal_head() call still takes the bh_journal_head lock and still checks b_transaction with that held. I think it's time I went and worked out *why* it works out OK, though, so that we can write it down and make sure it stays working! And we may well be able to simplify the locking when we do this; collapsing the two bh state locks, for example, may help improve the robustness as well as improving performance through removing some redundant locking operations. Fortunately, there are really only three classes of operation that can remove a jh. There are the normal VFS/VM operations, like writepage, try_to_free_buffer, etc; these are all called with the page lock. There are metadata updates, which take a liberal dose of buffer, bh_state and journal locks. Finally there are the commit/checkpoint functions, which run asynchronously to the rest of the journaling and which don't relate to the current transaction, but to previous ones. This is the danger area, I think. But as long as at least the bh_state lock is held, we can protect these operations against the data/metadata update operations. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3/jbd race: releasing in-use journal_heads
Hi, On Tue, 2005-03-08 at 15:12, Jan Kara wrote: Isn't also the following scenario dangerous? __journal_unfile_buffer(jh); journal_remove_journal_head(bh); It depends. I think the biggest problem here is that there's really no written rule protecting this stuff universally. But in testing, we just don't see this triggering. Why not? We actually have a raft of other locks protecting us in various places. As long as there's some overlap in the locks, we're OK. But because it's not written down, not everybody relies on the same set of locks; it's only because journal_unmap_buffer() was dropping the jh without enough locks that we saw a problem there. So the scenario: __journal_unfile_buffer(jh); journal_remove_journal_head(bh); *might* be dangerous, if called carelessly; but in practice it works out OK. Remember, that journal_remove_journal_head() call still takes the bh_journal_head lock and still checks b_transaction with that held. I think it's time I went and worked out *why* it works out OK, though, so that we can write it down and make sure it stays working! And we may well be able to simplify the locking when we do this; collapsing the two bh state locks, for example, may help improve the robustness as well as improving performance through removing some redundant locking operations. Fortunately, there are really only three classes of operation that can remove a jh. There are the normal VFS/VM operations, like writepage, try_to_free_buffer, etc; these are all called with the page lock. There are metadata updates, which take a liberal dose of buffer, bh_state and journal locks. Finally there are the commit/checkpoint functions, which run asynchronously to the rest of the journaling and which don't relate to the current transaction, but to previous ones. This is the danger area, I think. But as long as at least the bh_state lock is held, we can protect these operations against the data/metadata update operations. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3/jbd race: releasing in-use journal_heads
Hi, On Wed, 2005-03-09 at 13:28, Jan Kara wrote: Hmm. I see for example a place at jbd/commit.c, line 287 (which you did not change in your patch) which does this and doesn't seem to be protected against journal_unmap_buffer() (but maybe I miss something). Not that I'd find that race probable but in theory... Indeed; I can't see why that wouldn't trigger (at least without the existing, low-risk journal_unmap_buffer() patch.) Andrew, I think we just go with that simple patch for now --- it catches the cases we actually see in testing. And rather than mess with temporary states where b_transaction goes NULL, I think the *correct* long-term fix is to hide those states using locking rather than by list tricks. Merging the bh_journal_head and bh_state locks really seems like the safe solution here, as the latter seems to be held nearly everywhere where we need protection against journal_put_journal_head() (and where it's not held --- as it wasn't in journal_unmap_buffer() --- it's a bug.) --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3/jbd race: releasing in-use journal_heads
Hi, On Mon, 2005-03-07 at 21:08, Stephen C. Tweedie wrote: > Right, that was what I was thinking might be possible. But for now I've > just done the simple patch --- make sure we don't clear > jh->b_transaction when we're just refiling buffers from one list to > another. That should have the desired effect here without dangerous > messing around with locks. And here it is; it has had about 3 hours' testing in conjunction with the O_DIRECT livelock fix so far, running fsx and fsstress in parallel. This one is higher-risk than the previous simple fix. However, it has other advantages: the original patch closed the race completely in the one place I know for sure it was showing up, but this version removes the opportunity for that particular race in the first place by avoiding temporary jh->b_transaction=NULL conditions completely during temporary list refiling. I'll try to get some targetted testing done by the people who were able to reproduce this in the first place, too. --Stephen Fix destruction of in-use journal_head journal_put_journal_head() can destroy a journal_head at any time as long as the jh's b_jcount is zero and b_transaction is NULL. It has no locking protection against the rest of the journaling code, as the lock it uses to protect b_jcount and bh->b_private is not used elsewhere in jbd. However, there are small windows where b_transaction is getting set temporarily to NULL during normal operations; typically this is happening in __journal_unfile_buffer(jh); __journal_file_buffer(jh, ...); call pairs, as __journal_unfile_buffer() will set b_transaction to NULL and __journal_file_buffer() re-sets it afterwards. A truncate running in parallel can lead to journal_unmap_buffer() destroying the jh if it occurs between these two calls. Fix this by adding a variant of __journal_unfile_buffer() which is only used for these temporary jh unlinks, and which leaves the b_transaction field intact so that we never leave a window open where b_transaction is NULL. Additionally, trap this error if it does occur, by checking against jh->b_jlist being non-null when we destroy a jh. Signed-off-by: Stephen Tweedie <[EMAIL PROTECTED]> --- linux-2.6-ext3/fs/jbd/commit.c.=K0003=.orig +++ linux-2.6-ext3/fs/jbd/commit.c @@ -258,7 +258,7 @@ write_out_data: BUFFER_TRACE(bh, "locked"); if (!inverted_lock(journal, bh)) goto write_out_data; - __journal_unfile_buffer(jh); + __journal_temp_unlink_buffer(jh); __journal_file_buffer(jh, commit_transaction, BJ_Locked); jbd_unlock_bh_state(bh); --- linux-2.6-ext3/fs/jbd/journal.c.=K0003=.orig +++ linux-2.6-ext3/fs/jbd/journal.c @@ -1767,6 +1767,7 @@ static void __journal_remove_journal_hea if (jh->b_transaction == NULL && jh->b_next_transaction == NULL && jh->b_cp_transaction == NULL) { + J_ASSERT_JH(jh, jh->b_jlist == BJ_None); J_ASSERT_BH(bh, buffer_jbd(bh)); J_ASSERT_BH(bh, jh2bh(jh) == bh); BUFFER_TRACE(bh, "remove journal_head"); --- linux-2.6-ext3/fs/jbd/transaction.c.=K0003=.orig +++ linux-2.6-ext3/fs/jbd/transaction.c @@ -1044,7 +1044,12 @@ int journal_dirty_data(handle_t *handle, /* journal_clean_data_list() may have got there first */ if (jh->b_transaction != NULL) { JBUFFER_TRACE(jh, "unfile from commit"); - __journal_unfile_buffer(jh); + __journal_temp_unlink_buffer(jh); + /* It still points to the committing +* transaction; move it to this one so +* that the refile assert checks are +* happy. */ + jh->b_transaction = handle->h_transaction; } /* The buffer will be refiled below */ @@ -1058,7 +1063,8 @@ int journal_dirty_data(handle_t *handle, if (jh->b_jlist != BJ_SyncData && jh->b_jlist != BJ_Locked) { JBUFFER_TRACE(jh, "not on correct data list: unfile"); J_ASSERT_JH(jh, jh->b_jlist != BJ_Shadow); - __journal_unfile_buffer(jh); + __journal_temp_unlink_buffer(jh); + jh->b_transaction = handle->h_transaction; JBUFFER_TRACE(jh, "file as data"); __journa
Re: Linux 2.6.11-ac1
Hi, On Tue, 2005-03-08 at 06:49, Chris Wright wrote: > Yes, we are intending to pick up bits from -ac (you might have missed > that in another thread). There's actually a successor patch to that which I'm just about to get feedback on here and on ext2-devel. It's higher-risk than the one Alan has already picked up, but also hopefully closes the race much more tightly throughout the whole of the jbd layer. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] invalidate/o_direct livelock {was Re: [RFC] ext3/jbd race: releasing in-use journal_heads}
Hi, On Tue, 2005-03-08 at 09:28, Stephen C. Tweedie wrote: > I think it should be OK just to move the page->mapping != mapping test > above the page>index > end test. Sure, if all the pages have been > stolen by the time we see them, then we'll repeat without advancing > "next"; but we're still making progress in that case because pages that > were previously in this mapping *have* been removed, just by a different > process. If we're really concerned about that case we can play the > trick that invalidate_mapping_pages() tries and do a "next++" in that > case. Variant on akpm's last patch with this change, plus one extra s/page->index/page_index/ that he missed. I've moved the whole next=page_index+1 and wrapping check up to just after we've tested the mapping, so even if we break early all of that updating has already been done reliably and we don't end up setting next in two places in the code. A 400-task fsstress used to livelock reliably within minutes inside O_DIRECT without any fix; this patch has been running for over an hour now without problem. --Stephen Fix livelock in invalidate_inode_pages2_range invalidate_inode_pages2_range can loop indefinitely if pagevec_lookup returns only pages beyond the end of the range being invalidated. Fix it by advancing the restart pointer, "next", every time we see a page on the correct mapping, not just if the page is within the range; and break out early when we encounter such a page. Based on a proposed fix by akpm. Signed-off-by: Stephen Tweedie <[EMAIL PROTECTED]> --- linux-2.6-ext3/mm/truncate.c.=K0002=.orig +++ linux-2.6-ext3/mm/truncate.c @@ -268,25 +268,31 @@ int invalidate_inode_pages2_range(struct min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) { for (i = 0; !ret && i < pagevec_count(); i++) { struct page *page = pvec.pages[i]; + pgoff_t page_index; int was_dirty; lock_page(page); - if (page->mapping != mapping || page->index > end) { + if (page->mapping != mapping) { unlock_page(page); continue; } - wait_on_page_writeback(page); - next = page->index + 1; + page_index = page->index; + next = page_index + 1; if (next == 0) wrapped = 1; + if (page_index > end) { + unlock_page(page); + break; + } + wait_on_page_writeback(page); while (page_mapped(page)) { if (!did_range_unmap) { /* * Zap the rest of the file in one hit. */ unmap_mapping_range(mapping, - page->index << PAGE_CACHE_SHIFT, - (end - page->index + 1) + page_index << PAGE_CACHE_SHIFT, + (end - page_index + 1) << PAGE_CACHE_SHIFT, 0); did_range_unmap = 1; @@ -295,7 +301,7 @@ int invalidate_inode_pages2_range(struct * Just zap this page */ unmap_mapping_range(mapping, - page->index << PAGE_CACHE_SHIFT, + page_index << PAGE_CACHE_SHIFT, PAGE_CACHE_SIZE, 0); } }
Re: [RFC] ext3/jbd race: releasing in-use journal_heads
Hi, On Mon, 2005-03-07 at 23:50, Andrew Morton wrote: > truncate_inode_pages_range() seems to dtrt here. Can we do it in the same > manner in invalidate_inode_pages2_range()? > > > Something like: > - if (page->mapping != mapping || page->index > end) { > + page_index = page->index; > + if (page_index > end) { > + next = page_index; > + unlock_page(page); > + break; > + } > + if (page->mapping != mapping) { > unlock_page(page); > continue; > } Yes, breaking early seems fine for this. But don't we need to test page->mapping == mapping again with the page lock held before we can reliably break on page_index? I think it should be OK just to move the page->mapping != mapping test above the page>index > end test. Sure, if all the pages have been stolen by the time we see them, then we'll repeat without advancing "next"; but we're still making progress in that case because pages that were previously in this mapping *have* been removed, just by a different process. If we're really concerned about that case we can play the trick that invalidate_mapping_pages() tries and do a "next++" in that case. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3/jbd race: releasing in-use journal_heads
Hi, On Mon, 2005-03-07 at 23:50, Andrew Morton wrote: truncate_inode_pages_range() seems to dtrt here. Can we do it in the same manner in invalidate_inode_pages2_range()? Something like: - if (page-mapping != mapping || page-index end) { + page_index = page-index; + if (page_index end) { + next = page_index; + unlock_page(page); + break; + } + if (page-mapping != mapping) { unlock_page(page); continue; } Yes, breaking early seems fine for this. But don't we need to test page-mapping == mapping again with the page lock held before we can reliably break on page_index? I think it should be OK just to move the page-mapping != mapping test above the pageindex end test. Sure, if all the pages have been stolen by the time we see them, then we'll repeat without advancing next; but we're still making progress in that case because pages that were previously in this mapping *have* been removed, just by a different process. If we're really concerned about that case we can play the trick that invalidate_mapping_pages() tries and do a next++ in that case. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] invalidate/o_direct livelock {was Re: [RFC] ext3/jbd race: releasing in-use journal_heads}
Hi, On Tue, 2005-03-08 at 09:28, Stephen C. Tweedie wrote: I think it should be OK just to move the page-mapping != mapping test above the pageindex end test. Sure, if all the pages have been stolen by the time we see them, then we'll repeat without advancing next; but we're still making progress in that case because pages that were previously in this mapping *have* been removed, just by a different process. If we're really concerned about that case we can play the trick that invalidate_mapping_pages() tries and do a next++ in that case. Variant on akpm's last patch with this change, plus one extra s/page-index/page_index/ that he missed. I've moved the whole next=page_index+1 and wrapping check up to just after we've tested the mapping, so even if we break early all of that updating has already been done reliably and we don't end up setting next in two places in the code. A 400-task fsstress used to livelock reliably within minutes inside O_DIRECT without any fix; this patch has been running for over an hour now without problem. --Stephen Fix livelock in invalidate_inode_pages2_range invalidate_inode_pages2_range can loop indefinitely if pagevec_lookup returns only pages beyond the end of the range being invalidated. Fix it by advancing the restart pointer, next, every time we see a page on the correct mapping, not just if the page is within the range; and break out early when we encounter such a page. Based on a proposed fix by akpm. Signed-off-by: Stephen Tweedie [EMAIL PROTECTED] --- linux-2.6-ext3/mm/truncate.c.=K0002=.orig +++ linux-2.6-ext3/mm/truncate.c @@ -268,25 +268,31 @@ int invalidate_inode_pages2_range(struct min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) { for (i = 0; !ret i pagevec_count(pvec); i++) { struct page *page = pvec.pages[i]; + pgoff_t page_index; int was_dirty; lock_page(page); - if (page-mapping != mapping || page-index end) { + if (page-mapping != mapping) { unlock_page(page); continue; } - wait_on_page_writeback(page); - next = page-index + 1; + page_index = page-index; + next = page_index + 1; if (next == 0) wrapped = 1; + if (page_index end) { + unlock_page(page); + break; + } + wait_on_page_writeback(page); while (page_mapped(page)) { if (!did_range_unmap) { /* * Zap the rest of the file in one hit. */ unmap_mapping_range(mapping, - page-index PAGE_CACHE_SHIFT, - (end - page-index + 1) + page_index PAGE_CACHE_SHIFT, + (end - page_index + 1) PAGE_CACHE_SHIFT, 0); did_range_unmap = 1; @@ -295,7 +301,7 @@ int invalidate_inode_pages2_range(struct * Just zap this page */ unmap_mapping_range(mapping, - page-index PAGE_CACHE_SHIFT, + page_index PAGE_CACHE_SHIFT, PAGE_CACHE_SIZE, 0); } }
Re: Linux 2.6.11-ac1
Hi, On Tue, 2005-03-08 at 06:49, Chris Wright wrote: Yes, we are intending to pick up bits from -ac (you might have missed that in another thread). There's actually a successor patch to that which I'm just about to get feedback on here and on ext2-devel. It's higher-risk than the one Alan has already picked up, but also hopefully closes the race much more tightly throughout the whole of the jbd layer. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3/jbd race: releasing in-use journal_heads
Hi, On Mon, 2005-03-07 at 21:08, Stephen C. Tweedie wrote: Right, that was what I was thinking might be possible. But for now I've just done the simple patch --- make sure we don't clear jh-b_transaction when we're just refiling buffers from one list to another. That should have the desired effect here without dangerous messing around with locks. And here it is; it has had about 3 hours' testing in conjunction with the O_DIRECT livelock fix so far, running fsx and fsstress in parallel. This one is higher-risk than the previous simple fix. However, it has other advantages: the original patch closed the race completely in the one place I know for sure it was showing up, but this version removes the opportunity for that particular race in the first place by avoiding temporary jh-b_transaction=NULL conditions completely during temporary list refiling. I'll try to get some targetted testing done by the people who were able to reproduce this in the first place, too. --Stephen Fix destruction of in-use journal_head journal_put_journal_head() can destroy a journal_head at any time as long as the jh's b_jcount is zero and b_transaction is NULL. It has no locking protection against the rest of the journaling code, as the lock it uses to protect b_jcount and bh-b_private is not used elsewhere in jbd. However, there are small windows where b_transaction is getting set temporarily to NULL during normal operations; typically this is happening in __journal_unfile_buffer(jh); __journal_file_buffer(jh, ...); call pairs, as __journal_unfile_buffer() will set b_transaction to NULL and __journal_file_buffer() re-sets it afterwards. A truncate running in parallel can lead to journal_unmap_buffer() destroying the jh if it occurs between these two calls. Fix this by adding a variant of __journal_unfile_buffer() which is only used for these temporary jh unlinks, and which leaves the b_transaction field intact so that we never leave a window open where b_transaction is NULL. Additionally, trap this error if it does occur, by checking against jh-b_jlist being non-null when we destroy a jh. Signed-off-by: Stephen Tweedie [EMAIL PROTECTED] --- linux-2.6-ext3/fs/jbd/commit.c.=K0003=.orig +++ linux-2.6-ext3/fs/jbd/commit.c @@ -258,7 +258,7 @@ write_out_data: BUFFER_TRACE(bh, locked); if (!inverted_lock(journal, bh)) goto write_out_data; - __journal_unfile_buffer(jh); + __journal_temp_unlink_buffer(jh); __journal_file_buffer(jh, commit_transaction, BJ_Locked); jbd_unlock_bh_state(bh); --- linux-2.6-ext3/fs/jbd/journal.c.=K0003=.orig +++ linux-2.6-ext3/fs/jbd/journal.c @@ -1767,6 +1767,7 @@ static void __journal_remove_journal_hea if (jh-b_transaction == NULL jh-b_next_transaction == NULL jh-b_cp_transaction == NULL) { + J_ASSERT_JH(jh, jh-b_jlist == BJ_None); J_ASSERT_BH(bh, buffer_jbd(bh)); J_ASSERT_BH(bh, jh2bh(jh) == bh); BUFFER_TRACE(bh, remove journal_head); --- linux-2.6-ext3/fs/jbd/transaction.c.=K0003=.orig +++ linux-2.6-ext3/fs/jbd/transaction.c @@ -1044,7 +1044,12 @@ int journal_dirty_data(handle_t *handle, /* journal_clean_data_list() may have got there first */ if (jh-b_transaction != NULL) { JBUFFER_TRACE(jh, unfile from commit); - __journal_unfile_buffer(jh); + __journal_temp_unlink_buffer(jh); + /* It still points to the committing +* transaction; move it to this one so +* that the refile assert checks are +* happy. */ + jh-b_transaction = handle-h_transaction; } /* The buffer will be refiled below */ @@ -1058,7 +1063,8 @@ int journal_dirty_data(handle_t *handle, if (jh-b_jlist != BJ_SyncData jh-b_jlist != BJ_Locked) { JBUFFER_TRACE(jh, not on correct data list: unfile); J_ASSERT_JH(jh, jh-b_jlist != BJ_Shadow); - __journal_unfile_buffer(jh); + __journal_temp_unlink_buffer(jh); + jh-b_transaction = handle-h_transaction; JBUFFER_TRACE(jh, file as data); __journal_file_buffer(jh, handle-h_transaction, BJ_SyncData); @@ -1233,8 +1239,6 @@ int journal_forget (handle_t *handle, st
Re: [RFC] ext3/jbd race: releasing in-use journal_heads
Hi, On Mon, 2005-03-07 at 21:22, Stephen C. Tweedie wrote: > altgr-scrlck is showing a range of EIPs all in ext3_direct_IO-> > invalidate_inode_pages2_range(). I'm seeing > > invalidate_inode_pages2_range()->pagevec_lookup()->find_get_pages() In invalidate_inode_pages2_range(), what happens if we lookup a pagevec, get a bunch of pages back, but all the pages in the vec are beyond the end of the range we want? That's quite possible: pagevec_lookup() gets told how many pages we want, but not the index of the end of the range. The loop over the pages in the pagevec contains this: lock_page(page); if (page->mapping != mapping || page->index > end) { unlock_page(page); continue; } wait_on_page_writeback(page); next = page->index + 1; Now, if all of the pages have page->index > end, we'll skip them all... but we'll do it before we've advanced "next" to indicate that we've made progress. The next iteration is just going to start in the same place. Truncate always invalidates to EOF, which is probably why we haven't seen this before. But O_DIRECT is doing limited range cache invalidates, and is getting stuck here pretty repeatably. I'm currently testing the small patch below; it seems to be running OK so far, although it hasn't been going for long yet. I'll run a longer test in the morning. --Stephen --- linux-2.6-ext3/mm/truncate.c.=K0002=.orig +++ linux-2.6-ext3/mm/truncate.c @@ -271,12 +271,13 @@ int invalidate_inode_pages2_range(struct int was_dirty; lock_page(page); + if (page->mapping == mapping) + next = page->index + 1; if (page->mapping != mapping || page->index > end) { unlock_page(page); continue; } wait_on_page_writeback(page); - next = page->index + 1; if (next == 0) wrapped = 1; while (page_mapped(page)) {
Re: [RFC] ext3/jbd race: releasing in-use journal_heads
Hi, On Mon, 2005-03-07 at 21:11, Andrew Morton wrote: > > I'm having trouble testing it, though --- I seem to be getting livelocks > > in O_DIRECT running 400 fsstress processes in parallel; ring any bells? > > Nope. I dont think anyone has been that cruel to ext3 for a while. > I assume this workload used to succeed? I think so. I remember getting a lockup a couple of months ago that I couldn't get to the bottom of and that looked very similar, so it may just be a matter of it being much more reproducible now. But I seem to be able to livelock the directIO bits in minutes now, quite reliably. I'll try to narrow things down. altgr-scrlck is showing a range of EIPs all in ext3_direct_IO-> invalidate_inode_pages2_range(). I'm seeing invalidate_inode_pages2_range()->pagevec_lookup()->find_get_pages() as by far the most common trace, but also invalidate_inode_pages2_range()->__might_sleep() invalidate_inode_pages2_range()->unlock_page() and a few other similar paths. invalidate_inode_pages2_range() does do a cond_resched(), so the machine carries on despite having 316 fsstress tasks in system-mode R state at the moment. :-) The scheduler is doing a rather impressive job. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3/jbd race: releasing in-use journal_heads
Hi, On Mon, 2005-03-07 at 20:31, Andrew Morton wrote: > jbd_lock_bh_journal_head() is supposed to be a > finegrained innermost lock whose mandate is purely for atomicity of adding > and removing the journal_head and the b_jcount refcounting. I don't recall > there being any deeper meaning than that. > It could be that we can optimise jbd_lock_bh_journal_head() away, as you > mention. If we have an assertion in there to check that > jbd_lock_bh_state() is held... Right, that was what I was thinking might be possible. But for now I've just done the simple patch --- make sure we don't clear jh->b_transaction when we're just refiling buffers from one list to another. That should have the desired effect here without dangerous messing around with locks. I'm having trouble testing it, though --- I seem to be getting livelocks in O_DIRECT running 400 fsstress processes in parallel; ring any bells? I get it with today's bk tree without any ext3 changes too, so at least it's not a regression. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/