Re: very poor ext3 write performance on big filesystems?
On Mon, Feb 18, 2008 at 10:16:32AM -0500, Theodore Tso wrote: > On Mon, Feb 18, 2008 at 04:02:36PM +0100, Tomasz Chmielewski wrote: > > I tried to copy that filesystem once (when it was much smaller) with "rsync > > -a -H", but after 3 days, rsync was still building an index and didn't copy > > any file. > > If you're going to copy the whole filesystem don't use rsync! Yes, I managed to kill systems (drive them really badly into oom and get very long swap storms) with rsync -H in the past too. Something is very wrong with the rsync implementation of this. > Use cp > or a tar pipeline to move the files. Are you sure cp handles hardlinks correctly? I know tar does, but I have my doubts about cp. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
On Mon, Feb 18, 2008 at 09:16:41AM -0500, Theodore Tso wrote: > ext3 tries to keep inodes in the same block group as their containing > directory. If you have lots of hard links, obviously it can't really > do that, especially since we don't have a good way at mkdir time to > tell the filesystem, "Psst! This is going to be a hard link clone of > that directory over there, put it in the same block group". Hmm, you think such a hint interface would be worth it? > > > What has helped a bit was to recreate the file system with -O^dir_index > > dir_index seems to cause more seeks. > > Part of it may have simply been recreating the filesystem, not Undoubtedly. > necessarily removing the dir_index feature. Dir_index speeds up > individual lookups, but it slows down workloads that do a readdir But only for large directories right? For kernel source like directory sizes it seems to be a general loss. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
Tomasz Chmielewski <[EMAIL PROTECTED]> writes: > > Is it normal to expect the write speed go down to only few dozens of > kilobytes/s? Is it because of that many seeks? Can it be somehow > optimized? I have similar problems on my linux source partition which also has a lot of hard linked files (although probably not quite as many as you do). It seems like hard linking prevents some of the heuristics ext* uses to generate non fragmented disk layouts and the resulting seeking makes things slow. What has helped a bit was to recreate the file system with -O^dir_index dir_index seems to cause more seeks. Also keeping enough free space is also a good idea because that allows the file system code better choices on where to place data. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8][for -mm] mem_notify v6: memory_pressure_notify() caller
KOSAKI Motohiro <[EMAIL PROTECTED]> writes: > > to be honest, I don't think at mem-cgroup until now. There is not only mem-cgroup BTW, but also NUMA node restrictons from NUMA memory policy. So this means a process might not be able to access all memory. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
On Monday 28 January 2008 14:38:57 Alan Cox wrote: > > Also worse really fixing it would be a major change to the VFS > > because of the way ->read/write are defined :/ > > I don't see a problem there. ->read and ->write update the passed pointer > which is not the real f_pos anyway. Just the copies need fixing. They are effectually doing a decoupled read/modify/write cycle. e.g.: A B read fpos read fpos fpos += A fpos += B write fpos write fpos So you get overlapping reads. Probably not good. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
On Monday 28 January 2008 13:56:05 Alan Cox wrote: > > > No specific spec, just general quality of implementation. > > > > I completely agree. If one thread writes A and another writes B then the > > kernel should record either A or B, not ((A & 0x) | (B & > > 0x)) > > Agree entirely: the spec doesn't allow for random scribbling in the wrong > place. It doesn't cover which goes first or who "wins" the race but > provides pwrite/pread for that situation. Writing somewhere unrelated is > definitely not to spec Actually it would probably -- i guess it's undefined and in undefined country such things can happen. Also to be fair I think it's only a problem for the 4GB wrapping case which is presumably rare (otherwise we would have heard about it) Also worse really fixing it would be a major change to the VFS because of the way ->read/write are defined :/ -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
> I completely agree. If one thread writes A and another writes B then the > kernel should record either A or B, not ((A & 0x) | (B & > 0x)) The problem is pretty nasty unfortunately. To solve it properly I think the file_operations->read/write prototypes would need to be changed because otherwise it is not possible to do atomic relative updates of f_pos. Right now the actual update is burrowed deeply in the low level read/write implementation. But that would be a huge impact all over the tree :/ Or maybe define a new read/write64 and keep the default as 32bit only-- i suppose most users don't really need 64bit. Still would be a nasty API change. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [3/18] BKL-removal: Convert ext3 to use unlocked_ioctl
On Monday 28 January 2008 06:33, Andrew Morton wrote: > On Sun, 27 Jan 2008 03:17:09 +0100 (CET) Andi Kleen <[EMAIL PROTECTED]> wrote: > > I checked ext3_ioctl and it looked largely safe to not be used > > without BKL. So convert it over to unlocked_ioctl. > > > > The only case where I wasn't quite sure was for the > > dynamic fs grow ioctls versus umounting -- I kept the BKL for those. > > Please cpoy linux-ext4 on ext2/3/4 material. Ok I'll resubmit those to tytso/ext4-devel (or perhaps he has already seen them) > > I skippped a lot of these patches because I just got bored of fixing > rejects. Now is a very optimistic time to be raising patches against > mainline. JFS and CIFS are already taken care of by the maintainers. This leaves remote_llseek which touches a couple of file systems. Could you perhaps take that one only please? And perhaps Nick's minix patchkit which looks safe to me and is unlikely to cause conflicts. > > + /* AK: not sure the BKL is needed, but this might prevent > > +* races against umount */ > > + lock_kernel(); > > err = ext3_group_add(sb, &input); > > journal_lock_updates(EXT3_SB(sb)->s_journal); > > journal_flush(EXT3_SB(sb)->s_journal); > > journal_unlock_updates(EXT3_SB(sb)->s_journal); > > + unlock_kernel(); > > The ext3_ioctl() caller has an open fd against the fs - should be > sufficient to keep unmount away? True. I am still conservative because group_add is a lot of code which I didn't fully check. But with the open fd it's likely safe to not take the BKL because there is nothing else (except readdir?) in ext* that takes it. > It's all reached the stage of stupid. I'll resubmit ->fasync_unlocked against -mm. Also I wanted to recheck the ->f_flags locking. I found one bug in those already and I can extract the bug fix for that one. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
On Monday 28 January 2008 05:13:09 Trond Myklebust wrote: > > On Mon, 2008-01-28 at 03:58 +0100, Andi Kleen wrote: > > The problem is that it's not a race in who gets to do its thing first, but > > a > > parallel reader can actually see a corrupted value from the two independent > > words on 32bit (e.g. during a 4GB). And this could actually completely > > corrupt > > f_pos when it happens with two racing relative seeks or read/write()s > > > > I would consider that a bug. > > I disagree. The corruption occurs because this isn't a situation that is > allowed by either POSIX or SUSv2/v3. Exactly what spec are you referring > to here? No specific spec, just general quality of implementation. We normally don't have non thread safe system calls even if it was in theory allowed by some specification. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [0/18] Implement some low hanging BKL removal fruit in fs/*
> BTW. here is a patch I did a while back for minix. I know it isn't > a big deal, but the work is done so I guess I should send it along. Looks safe, although I'm surprised it actually gets around with such little locking in general. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
On Monday 28 January 2008 00:08:56 Trond Myklebust wrote: > > On Sun, 2008-01-27 at 16:18 -0600, Steve French wrote: > > If two seeks overlap, can't you end up with an f_pos value that is > > different than what either thread seeked to? or if you have a seek and > > a read overlap can't you end up with the read occurring in the midst > > of an update of f_pos (which takes more than one instruction on > > various architectures), e.g. reading an f_pos, which has only the > > lower half of a 64 bit field updated? I agree that you shouldn't > > have seeks racing in parallel but I think it is preferable to get > > either the updated f_pos or the earlier f_pos not something 1/2 > > updated. > > Why? The threads are doing something inherently liable to corrupt data > anyway. If they can race over the seek, why wouldn't they race over the > read or write too? > The race in lseek() should probably be the least of your worries in this > case. The problem is that it's not a race in who gets to do its thing first, but a parallel reader can actually see a corrupted value from the two independent words on 32bit (e.g. during a 4GB). And this could actually completely corrupt f_pos when it happens with two racing relative seeks or read/write()s I would consider that a bug. Fixes would be either to always take a spinlock to update this (nasty on platforms where spinlocks are expensive like P4) or define some architecture specific way to read/write 64bit values consistently. In theory also some lazy locking seqlock like mechanism could be used, but that had the disadvantage of being theoretically starvable. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
On Sunday 27 January 2008 23:18:26 Steve French wrote: > If two seeks overlap, can't you end up with an f_pos value that is > different than what either thread seeked to? Yes you can on 32bit. Especially during the 4GB wrap -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
On Sunday 27 January 2008 17:57:14 Steve French wrote: > Don't you need to a spinlock/spinunlock(i_lock) or something similar > (there isn't a spinlock in the file struct unfortunately) around the > reads and writes from f_pos in fs/read_write.c in remote_llseek with > your patch since the reads/writes from that field are not necessarily > atomic and threads could be racing in seek on the same file struct? Funny that you mention it. I actually noticed this too while working on this, but noticed that it is wrong everywhere (as in even plain sys_write/read gets it wrong). So I decided to not address it because it is already broken. I did actually send email to a few people about this, but no answer yet. I agree it's probably all broken on 32bit platforms, but I'm not sure how to best address this. When it is comprehensively addressed remote_llseek can use that new method too. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [14/18] BKL-removal: Add unlocked_fasync v2
> No goto if you use unlocked_fasync? Indeed. There was another problem in the patch too. Here's an updated patch that also fixes another latent bug. The whole f_flags still seems to be somewhat fragile because the checks tend to happen without any lock, but that has not changed to the previous state. --- Add unlocked_fasync v2 Add a new fops entry point to allow fasync without BKL. While it's arguably unclear this entry point is called often enough for it really matters it was still relatively easy to do. And there are far less async users in the tree than ioctls so it's likely they can be all converted eventually and then the non unlocked async entry point could be dropped. There was still the problem of the actual flags change being protected against other setters of flags. Instead of using BKL for this use the i_mutex now. I also added a mutex_lock against one other flags change that was lockless and could potentially lose updates. Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- Documentation/filesystems/vfs.txt |5 - fs/fcntl.c|6 +- fs/ioctl.c|5 - include/linux/fs.h|1 + 4 files changed, 14 insertions(+), 3 deletions(-) Index: linux/fs/fcntl.c === --- linux.orig/fs/fcntl.c +++ linux/fs/fcntl.c @@ -238,18 +238,26 @@ static int setfl(int fd, struct file * f if (error) return error; - lock_kernel(); + /* AK: potential race here. Since test is unlocked fasync could + be called several times in a row with on. We hope the drivers + deal with it. */ if ((arg ^ filp->f_flags) & FASYNC) { - if (filp->f_op && filp->f_op->fasync) { - error = filp->f_op->fasync(fd, filp, (arg & FASYNC) != 0); - if (error < 0) - goto out; + int on = !!(arg & FASYNC); + if (filp->f_op && filp->f_op->unlocked_fasync) + error = filp->f_op->unlocked_fasync(fd, filp, on); + else if (filp->f_op && filp->f_op->fasync) { + lock_kernel(); + error = filp->f_op->fasync(fd, filp, on); + unlock_kernel(); } + /* AK: no else error = -EINVAL here? */ + if (error < 0) + return error; } + mutex_lock(&filp->f_dentry->d_inode->i_mutex); filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK); - out: - unlock_kernel(); + mutex_unlock(&filp->f_dentry->d_inode->i_mutex); return error; } Index: linux/fs/ioctl.c === --- linux.orig/fs/ioctl.c +++ linux/fs/ioctl.c @@ -105,10 +105,14 @@ int vfs_ioctl(struct file *filp, unsigne if(O_NONBLOCK != O_NDELAY) flag |= O_NDELAY; #endif + + mutex_lock(&filp->f_dentry->d_inode->i_mutex); if (on) filp->f_flags |= flag; else filp->f_flags &= ~flag; + + mutex_unlock(&filp->f_dentry->d_inode->i_mutex); break; case FIOASYNC: @@ -117,8 +121,13 @@ int vfs_ioctl(struct file *filp, unsigne flag = on ? FASYNC : 0; /* Did FASYNC state change ? */ + /* AK: potential race here: ->fasync could be called with on two times + in a row. We hope the drivers deal with it. */ if ((flag ^ filp->f_flags) & FASYNC) { - if (filp->f_op && filp->f_op->fasync) { + if (filp->f_op && filp->f_op->unlocked_fasync) + error = filp->f_op->unlocked_fasync(fd, + filp, on); + else if (filp->f_op && filp->f_op->fasync) { lock_kernel(); error = filp->f_op->fasync(fd, filp, on); unlock_kernel(); @@ -128,10 +137,12 @@ int vfs_ioctl(struct file *filp, unsigne if (error != 0) break; + mutex_lock(&filp->f_dentry->d_inode->i_mutex); if (on) filp
[PATCH] BKL-Removal: Convert pipe to use unlocked_ioctl too
Here's another patch that was missing in the previous BKL-removal series. No BKL needed in pipe_ioctl Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> Index: linux/fs/pipe.c === --- linux.orig/fs/pipe.c +++ linux/fs/pipe.c @@ -576,9 +576,7 @@ bad_pipe_w(struct file *filp, const char return -EBADF; } -static int -pipe_ioctl(struct inode *pino, struct file *filp, - unsigned int cmd, unsigned long arg) +static long pipe_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { struct inode *inode = filp->f_path.dentry->d_inode; struct pipe_inode_info *pipe; @@ -785,7 +783,7 @@ const struct file_operations read_fifo_f .aio_read = pipe_read, .write = bad_pipe_w, .poll = pipe_poll, - .ioctl = pipe_ioctl, + .unlocked_ioctl = pipe_ioctl, .open = pipe_read_open, .release= pipe_read_release, .unlocked_fasync = pipe_read_fasync, @@ -797,7 +795,7 @@ const struct file_operations write_fifo_ .write = do_sync_write, .aio_write = pipe_write, .poll = pipe_poll, - .ioctl = pipe_ioctl, + .unlocked_ioctl = pipe_ioctl, .open = pipe_write_open, .release= pipe_write_release, .unlocked_fasync = pipe_write_fasync, @@ -810,7 +808,7 @@ const struct file_operations rdwr_fifo_f .write = do_sync_write, .aio_write = pipe_write, .poll = pipe_poll, - .ioctl = pipe_ioctl, + .unlocked_ioctl = pipe_ioctl, .open = pipe_rdwr_open, .release= pipe_rdwr_release, .unlocked_fasync = pipe_rdwr_fasync, @@ -822,7 +820,7 @@ static const struct file_operations read .aio_read = pipe_read, .write = bad_pipe_w, .poll = pipe_poll, - .ioctl = pipe_ioctl, + .unlocked_ioctl = pipe_ioctl, .open = pipe_read_open, .release= pipe_read_release, .unlocked_fasync = pipe_read_fasync, @@ -834,7 +832,7 @@ static const struct file_operations writ .write = do_sync_write, .aio_write = pipe_write, .poll = pipe_poll, - .ioctl = pipe_ioctl, + .unlocked_ioctl = pipe_ioctl, .open = pipe_write_open, .release= pipe_write_release, .unlocked_fasync = pipe_write_fasync, @@ -847,7 +845,7 @@ static const struct file_operations rdwr .write = do_sync_write, .aio_write = pipe_write, .poll = pipe_poll, - .ioctl = pipe_ioctl, + .unlocked_ioctl = pipe_ioctl, .open = pipe_rdwr_open, .release= pipe_rdwr_release, .unlocked_fasync = pipe_rdwr_fasync, - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [6/18] BKL-removal: Convert ext4 to use unlocked_ioctl
I checked ext4_ioctl and it looked largely safe to not be used without BKL. So convert it over to unlocked_ioctl. The only case where I wasn't quite sure was for the dynamic fs grow ioctls versus umounting -- I kept the BKL for those. Cc: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/ext4/dir.c |2 +- fs/ext4/file.c |2 +- fs/ext4/ioctl.c | 20 +++- include/linux/ext4_fs.h |3 +-- 4 files changed, 14 insertions(+), 13 deletions(-) Index: linux/fs/ext4/dir.c === --- linux.orig/fs/ext4/dir.c +++ linux/fs/ext4/dir.c @@ -42,7 +42,7 @@ const struct file_operations ext4_dir_op .llseek = generic_file_llseek, .read = generic_read_dir, .readdir= ext4_readdir, /* we take BKL. needed?*/ - .ioctl = ext4_ioctl, /* BKL held */ + .unlocked_ioctl = ext4_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ext4_compat_ioctl, #endif Index: linux/fs/ext4/file.c === --- linux.orig/fs/ext4/file.c +++ linux/fs/ext4/file.c @@ -112,7 +112,7 @@ const struct file_operations ext4_file_o .write = do_sync_write, .aio_read = generic_file_aio_read, .aio_write = ext4_file_write, - .ioctl = ext4_ioctl, + .unlocked_ioctl = ext4_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ext4_compat_ioctl, #endif Index: linux/fs/ext4/ioctl.c === --- linux.orig/fs/ext4/ioctl.c +++ linux/fs/ext4/ioctl.c @@ -17,9 +17,9 @@ #include #include -int ext4_ioctl (struct inode * inode, struct file * filp, unsigned int cmd, - unsigned long arg) +long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { + struct inode *inode = filp->f_dentry->d_inode; struct ext4_inode_info *ei = EXT4_I(inode); unsigned int flags; unsigned short rsv_window_size; @@ -224,10 +224,14 @@ flags_err: if (get_user(n_blocks_count, (__u32 __user *)arg)) return -EFAULT; + /* AK: not sure the BKL is needed, but this might prevent +* races against umount. */ + lock_kernel(); err = ext4_group_extend(sb, EXT4_SB(sb)->s_es, n_blocks_count); jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal); jbd2_journal_flush(EXT4_SB(sb)->s_journal); jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); + unlock_kernel(); return err; } @@ -246,10 +250,14 @@ flags_err: sizeof(input))) return -EFAULT; + /* AK: not sure the BKL is needed, but this might prevent +* races against umount. */ + lock_kernel(); err = ext4_group_add(sb, &input); jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal); jbd2_journal_flush(EXT4_SB(sb)->s_journal); jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); + unlock_kernel(); return err; } @@ -262,9 +270,6 @@ flags_err: #ifdef CONFIG_COMPAT long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { - struct inode *inode = file->f_path.dentry->d_inode; - int ret; - /* These are just misnamed, they actually get/put from/to user an int */ switch (cmd) { case EXT4_IOC32_GETFLAGS: @@ -304,9 +309,6 @@ long ext4_compat_ioctl(struct file *file default: return -ENOIOCTLCMD; } - lock_kernel(); - ret = ext4_ioctl(inode, file, cmd, (unsigned long) compat_ptr(arg)); - unlock_kernel(); - return ret; + return ext4_ioctl(file, cmd, (unsigned long) compat_ptr(arg)); } #endif Index: linux/include/linux/ext4_fs.h === --- linux.orig/include/linux/ext4_fs.h +++ linux/include/linux/ext4_fs.h @@ -939,8 +939,7 @@ extern int ext4_block_truncate_page(hand struct address_space *mapping, loff_t from); /* ioctl.c */ -extern int ext4_ioctl (struct inode *, struct file *, unsigned int, - unsigned long); +extern long ext4_ioctl(struct file *, unsigned int, unsigned long); extern long ext4_compat_ioctl (struct file *, unsigned int, unsigned long); /* namei.c */ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [5/18] BKL-removal: Remove incorrect comment refering to lock_kernel() from jbd/jbd2
None of the callers of this function does actually take the BKL as far as I can see. So remove the comment refering to the BKL. Cc: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/jbd/recovery.c |2 +- fs/jbd2/recovery.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux/fs/jbd/recovery.c === --- linux.orig/fs/jbd/recovery.c +++ linux/fs/jbd/recovery.c @@ -354,7 +354,7 @@ static int do_one_pass(journal_t *journa struct buffer_head *obh; struct buffer_head *nbh; - cond_resched(); /* We're under lock_kernel() */ + cond_resched(); /* If we already know where to stop the log traversal, * check right now that we haven't gone past the end of Index: linux/fs/jbd2/recovery.c === --- linux.orig/fs/jbd2/recovery.c +++ linux/fs/jbd2/recovery.c @@ -364,7 +364,7 @@ static int do_one_pass(journal_t *journa struct buffer_head *obh; struct buffer_head *nbh; - cond_resched(); /* We're under lock_kernel() */ + cond_resched(); /* If we already know where to stop the log traversal, * check right now that we haven't gone past the end of - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [4/18] ext3: Remove incorrect BKL comment
There is no BKL held on entry in ->fsync nor in the low level ext3_sync_file. Cc: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/ext3/dir.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux/fs/ext3/dir.c === --- linux.orig/fs/ext3/dir.c +++ linux/fs/ext3/dir.c @@ -46,7 +46,7 @@ const struct file_operations ext3_dir_op #ifdef CONFIG_COMPAT .compat_ioctl = ext3_compat_ioctl, #endif - .fsync = ext3_sync_file, /* BKL held */ + .fsync = ext3_sync_file, .release= ext3_release_dir, }; - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [12/18] BKL-removal: Convert CIFS over to unlocked_ioctl
cifs_ioctl doesn't seem to need the BKL for anything, so convert it over to use unlocked_ioctl. Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/cifs/cifsfs.c | 10 +- fs/cifs/cifsfs.h |4 ++-- fs/cifs/ioctl.c |4 ++-- 3 files changed, 9 insertions(+), 9 deletions(-) Index: linux/fs/cifs/cifsfs.c === --- linux.orig/fs/cifs/cifsfs.c +++ linux/fs/cifs/cifsfs.c @@ -625,7 +625,7 @@ const struct file_operations cifs_file_o .splice_read = generic_file_splice_read, .llseek = cifs_llseek, #ifdef CONFIG_CIFS_POSIX - .ioctl = cifs_ioctl, + .unlocked_ioctl = cifs_ioctl, #endif /* CONFIG_CIFS_POSIX */ #ifdef CONFIG_CIFS_EXPERIMENTAL @@ -645,7 +645,7 @@ const struct file_operations cifs_file_d .flush = cifs_flush, .splice_read = generic_file_splice_read, #ifdef CONFIG_CIFS_POSIX - .ioctl = cifs_ioctl, + .unlocked_ioctl = cifs_ioctl, #endif /* CONFIG_CIFS_POSIX */ .llseek = cifs_llseek, #ifdef CONFIG_CIFS_EXPERIMENTAL @@ -665,7 +665,7 @@ const struct file_operations cifs_file_n .splice_read = generic_file_splice_read, .llseek = cifs_llseek, #ifdef CONFIG_CIFS_POSIX - .ioctl = cifs_ioctl, + .unlocked_ioctl = cifs_ioctl, #endif /* CONFIG_CIFS_POSIX */ #ifdef CONFIG_CIFS_EXPERIMENTAL @@ -684,7 +684,7 @@ const struct file_operations cifs_file_d .flush = cifs_flush, .splice_read = generic_file_splice_read, #ifdef CONFIG_CIFS_POSIX - .ioctl = cifs_ioctl, + .unlocked_ioctl = cifs_ioctl, #endif /* CONFIG_CIFS_POSIX */ .llseek = cifs_llseek, #ifdef CONFIG_CIFS_EXPERIMENTAL @@ -699,7 +699,7 @@ const struct file_operations cifs_dir_op #ifdef CONFIG_CIFS_EXPERIMENTAL .dir_notify = cifs_dir_notify, #endif /* CONFIG_CIFS_EXPERIMENTAL */ - .ioctl = cifs_ioctl, + .unlocked_ioctl = cifs_ioctl, }; static void Index: linux/fs/cifs/cifsfs.h === --- linux.orig/fs/cifs/cifsfs.h +++ linux/fs/cifs/cifsfs.h @@ -99,8 +99,8 @@ extern intcifs_setxattr(struct dentry size_t, int); extern ssize_t cifs_getxattr(struct dentry *, const char *, void *, size_t); extern ssize_t cifs_listxattr(struct dentry *, char *, size_t); -extern int cifs_ioctl(struct inode *inode, struct file *filep, - unsigned int command, unsigned long arg); +extern long cifs_ioctl(struct file *filep, unsigned int command, + unsigned long arg); #ifdef CONFIG_CIFS_EXPERIMENTAL extern const struct export_operations cifs_export_ops; Index: linux/fs/cifs/ioctl.c === --- linux.orig/fs/cifs/ioctl.c +++ linux/fs/cifs/ioctl.c @@ -30,9 +30,9 @@ #define CIFS_IOC_CHECKUMOUNT _IO(0xCF, 2) -int cifs_ioctl (struct inode *inode, struct file *filep, - unsigned int command, unsigned long arg) +long cifs_ioctl(struct file *filep, unsigned int command, unsigned long arg) { + struct inode *inode = filep->f_dentry->d_inode; int rc = -ENOTTY; /* strange error - but the precedent */ int xid; struct cifs_sb_info *cifs_sb; - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [11/18] BKL-removal: Convert ocfs2 over to unlocked_ioctl
As far as I can see there is nothing in ocfs2_ioctl that requires the BKL, so use unlocked_ioctl Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/ocfs2/file.c |4 ++-- fs/ocfs2/ioctl.c | 12 +++- fs/ocfs2/ioctl.h |3 +-- 3 files changed, 6 insertions(+), 13 deletions(-) Index: linux/fs/ocfs2/ioctl.c === --- linux.orig/fs/ocfs2/ioctl.c +++ linux/fs/ocfs2/ioctl.c @@ -111,9 +111,9 @@ bail: return status; } -int ocfs2_ioctl(struct inode * inode, struct file * filp, - unsigned int cmd, unsigned long arg) +long ocfs2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { + struct inode *inode = filp->f_path.dentry->d_inode; unsigned int flags; int status; struct ocfs2_space_resv sr; @@ -148,9 +148,6 @@ int ocfs2_ioctl(struct inode * inode, st #ifdef CONFIG_COMPAT long ocfs2_compat_ioctl(struct file *file, unsigned cmd, unsigned long arg) { - struct inode *inode = file->f_path.dentry->d_inode; - int ret; - switch (cmd) { case OCFS2_IOC32_GETFLAGS: cmd = OCFS2_IOC_GETFLAGS; @@ -167,9 +164,6 @@ long ocfs2_compat_ioctl(struct file *fil return -ENOIOCTLCMD; } - lock_kernel(); - ret = ocfs2_ioctl(inode, file, cmd, arg); - unlock_kernel(); - return ret; + return ocfs2_ioctl(file, cmd, arg); } #endif Index: linux/fs/ocfs2/ioctl.h === --- linux.orig/fs/ocfs2/ioctl.h +++ linux/fs/ocfs2/ioctl.h @@ -10,8 +10,7 @@ #ifndef OCFS2_IOCTL_H #define OCFS2_IOCTL_H -int ocfs2_ioctl(struct inode * inode, struct file * filp, - unsigned int cmd, unsigned long arg); +long ocfs2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg); long ocfs2_compat_ioctl(struct file *file, unsigned cmd, unsigned long arg); #endif /* OCFS2_IOCTL_H */ Index: linux/fs/ocfs2/file.c === --- linux.orig/fs/ocfs2/file.c +++ linux/fs/ocfs2/file.c @@ -2212,7 +2212,7 @@ const struct file_operations ocfs2_fops .open = ocfs2_file_open, .aio_read = ocfs2_file_aio_read, .aio_write = ocfs2_file_aio_write, - .ioctl = ocfs2_ioctl, + .unlocked_ioctl = ocfs2_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ocfs2_compat_ioctl, #endif @@ -2224,7 +2224,7 @@ const struct file_operations ocfs2_dops .read = generic_read_dir, .readdir= ocfs2_readdir, .fsync = ocfs2_sync_file, - .ioctl = ocfs2_ioctl, + .unlocked_ioctl = ocfs2_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ocfs2_compat_ioctl, #endif - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [13/18] BKL-removal: Add compat_ioctl for cifs
Similar to the compat handlers of other file systems. The ioctls are compatible except that they have different numbers. Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/cifs/cifsfs.c | 15 +++ fs/cifs/cifsfs.h |2 ++ fs/cifs/ioctl.c | 19 +++ 3 files changed, 36 insertions(+) Index: linux/fs/cifs/cifsfs.c === --- linux.orig/fs/cifs/cifsfs.c +++ linux/fs/cifs/cifsfs.c @@ -626,6 +626,9 @@ const struct file_operations cifs_file_o .llseek = cifs_llseek, #ifdef CONFIG_CIFS_POSIX .unlocked_ioctl = cifs_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = cifs_compat_ioctl, +#endif #endif /* CONFIG_CIFS_POSIX */ #ifdef CONFIG_CIFS_EXPERIMENTAL @@ -646,6 +649,9 @@ const struct file_operations cifs_file_d .splice_read = generic_file_splice_read, #ifdef CONFIG_CIFS_POSIX .unlocked_ioctl = cifs_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = cifs_compat_ioctl, +#endif #endif /* CONFIG_CIFS_POSIX */ .llseek = cifs_llseek, #ifdef CONFIG_CIFS_EXPERIMENTAL @@ -666,6 +672,9 @@ const struct file_operations cifs_file_n .llseek = cifs_llseek, #ifdef CONFIG_CIFS_POSIX .unlocked_ioctl = cifs_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = cifs_compat_ioctl, +#endif #endif /* CONFIG_CIFS_POSIX */ #ifdef CONFIG_CIFS_EXPERIMENTAL @@ -685,6 +694,9 @@ const struct file_operations cifs_file_d .splice_read = generic_file_splice_read, #ifdef CONFIG_CIFS_POSIX .unlocked_ioctl = cifs_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = cifs_compat_ioctl, +#endif #endif /* CONFIG_CIFS_POSIX */ .llseek = cifs_llseek, #ifdef CONFIG_CIFS_EXPERIMENTAL @@ -700,6 +712,9 @@ const struct file_operations cifs_dir_op .dir_notify = cifs_dir_notify, #endif /* CONFIG_CIFS_EXPERIMENTAL */ .unlocked_ioctl = cifs_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = cifs_compat_ioctl, +#endif }; static void Index: linux/fs/cifs/cifsfs.h === --- linux.orig/fs/cifs/cifsfs.h +++ linux/fs/cifs/cifsfs.h @@ -101,6 +101,8 @@ extern ssize_t cifs_getxattr(struct dent extern ssize_t cifs_listxattr(struct dentry *, char *, size_t); extern long cifs_ioctl(struct file *filep, unsigned int command, unsigned long arg); +extern long cifs_compat_ioctl(struct file *filep, unsigned int command, + unsigned long arg); #ifdef CONFIG_CIFS_EXPERIMENTAL extern const struct export_operations cifs_export_ops; Index: linux/fs/cifs/ioctl.c === --- linux.orig/fs/cifs/ioctl.c +++ linux/fs/cifs/ioctl.c @@ -108,3 +108,22 @@ long cifs_ioctl(struct file *filep, unsi FreeXid(xid); return rc; } + +#ifdef CONFIG_COMPAT +#define FS_IOC_GETFLAGS32_IOR('f', 1, int) +#define FS_IOC_SETFLAGS32_IOR('f', 2, int) + +long cifs_compat_ioctl(struct file *f, unsigned int command, unsigned long arg) +{ + /* Native ioctl is just mistyped, the real type is always int */ + switch (command) { + case FS_IOC_GETFLAGS32: + command = FS_IOC_GETFLAGS; + break; + case FS_IOC_SETFLAGS32: + command = FS_IOC_SETFLAGS; + break; + } + return cifs_ioctl(f, command, arg); +} +#endif - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [7/18] BKL-removal: Remove incorrect comments refering to BKL from ext4
BKL is not hold in any of those Cc: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/ext4/dir.c |2 +- fs/ext4/inode.c |1 - 2 files changed, 1 insertion(+), 2 deletions(-) Index: linux/fs/ext4/dir.c === --- linux.orig/fs/ext4/dir.c +++ linux/fs/ext4/dir.c @@ -46,7 +46,7 @@ const struct file_operations ext4_dir_op #ifdef CONFIG_COMPAT .compat_ioctl = ext4_compat_ioctl, #endif - .fsync = ext4_sync_file, /* BKL held */ + .fsync = ext4_sync_file, .release= ext4_release_dir, }; Index: linux/fs/ext4/inode.c === --- linux.orig/fs/ext4/inode.c +++ linux/fs/ext4/inode.c @@ -778,7 +778,6 @@ err_out: * * `handle' can be NULL if create == 0. * - * The BKL may not be held on entry here. Be sure to take it early. * return > 0, # of blocks mapped or allocated. * return = 0, if plain lookup failed. * return < 0, error case. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
- Replace remote_llseek with remote_llseek_unlocked (to force compilation failures in all users) - Change all users to either use remote_llseek directly or take the BKL around. I changed the file systems who don't use the BKL for anything (CIFS, GFS) to call it directly. NCPFS and SMBFS and NFS take the BKL, but explicitely in their own source now. I moved them all over in a single patch to avoid unbisectable sections. Cc: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/cifs/cifsfs.c |2 +- fs/gfs2/ops_file.c |4 ++-- fs/ncpfs/file.c| 12 +++- fs/nfs/file.c |6 +- fs/read_write.c|7 +++ fs/smbfs/file.c| 11 ++- include/linux/fs.h |3 ++- 7 files changed, 34 insertions(+), 11 deletions(-) Index: linux/fs/cifs/cifsfs.c === --- linux.orig/fs/cifs/cifsfs.c +++ linux/fs/cifs/cifsfs.c @@ -549,7 +549,7 @@ static loff_t cifs_llseek(struct file *f if (retval < 0) return (loff_t)retval; } - return remote_llseek(file, offset, origin); + return remote_llseek_unlocked(file, offset, origin); } static struct file_system_type cifs_fs_type = { Index: linux/fs/gfs2/ops_file.c === --- linux.orig/fs/gfs2/ops_file.c +++ linux/fs/gfs2/ops_file.c @@ -107,11 +107,11 @@ static loff_t gfs2_llseek(struct file *f error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY, &i_gh); if (!error) { - error = remote_llseek(file, offset, origin); + error = remote_llseek_unlocked(file, offset, origin); gfs2_glock_dq_uninit(&i_gh); } } else - error = remote_llseek(file, offset, origin); + error = remote_llseek_unlocked(file, offset, origin); return error; } Index: linux/fs/ncpfs/file.c === --- linux.orig/fs/ncpfs/file.c +++ linux/fs/ncpfs/file.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include "ncplib_kernel.h" @@ -281,9 +282,18 @@ static int ncp_release(struct inode *ino return 0; } +static loff_t ncp_remote_llseek(struct file *file, loff_t offset, int origin) +{ + loff_t ret; + lock_kernel(); + ret = remote_llseek_unlocked(file, offset, origin); + unlock_kernel(); + return ret; +} + const struct file_operations ncp_file_operations = { - .llseek = remote_llseek, + .llseek = ncp_remote_llseek, .read = ncp_file_read, .write = ncp_file_write, .ioctl = ncp_ioctl, Index: linux/fs/read_write.c === --- linux.orig/fs/read_write.c +++ linux/fs/read_write.c @@ -58,11 +58,10 @@ loff_t generic_file_llseek(struct file * EXPORT_SYMBOL(generic_file_llseek); -loff_t remote_llseek(struct file *file, loff_t offset, int origin) +loff_t remote_llseek_unlocked(struct file *file, loff_t offset, int origin) { long long retval; - lock_kernel(); switch (origin) { case SEEK_END: offset += i_size_read(file->f_path.dentry->d_inode); @@ -73,15 +72,15 @@ loff_t remote_llseek(struct file *file, retval = -EINVAL; if (offset>=0 && offset<=file->f_path.dentry->d_inode->i_sb->s_maxbytes) { if (offset != file->f_pos) { + /* AK: do we need a lock for those? */ file->f_pos = offset; file->f_version = 0; } retval = offset; } - unlock_kernel(); return retval; } -EXPORT_SYMBOL(remote_llseek); +EXPORT_SYMBOL(remote_llseek_unlocked); loff_t no_llseek(struct file *file, loff_t offset, int origin) { Index: linux/fs/smbfs/file.c === --- linux.orig/fs/smbfs/file.c +++ linux/fs/smbfs/file.c @@ -422,9 +422,18 @@ smb_file_permission(struct inode *inode, return error; } +static loff_t smb_remote_llseek(struct file *file, loff_t offset, int origin) +{ + loff_t ret; + lock_kernel(); + ret = remote_llseek_unlocked(file, offset, origin); + unlock_kernel(); + return ret; +} + const struct file_operations smb_file_operations = { - .llseek = remote_llseek, + .llseek = smb_remote_llseek, .read = do_sync_read, .aio_read = smb_file_aio_re
[PATCH] [16/18] BKL-removal: Convert socket fasync to unlocked_fasync
Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- net/socket.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux/net/socket.c === --- linux.orig/net/socket.c +++ linux/net/socket.c @@ -131,7 +131,7 @@ static const struct file_operations sock .mmap = sock_mmap, .open = sock_no_open, /* special open code to disallow open via /proc */ .release = sock_close, - .fasync = sock_fasync, + .unlocked_fasync = sock_fasync, .sendpage = sock_sendpage, .splice_write = generic_splice_sendpage, }; - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [17/18] BKL-removal: Convert fuse to unlocked_fasync
Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/fuse/dev.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux/fs/fuse/dev.c === --- linux.orig/fs/fuse/dev.c +++ linux/fs/fuse/dev.c @@ -1040,7 +1040,7 @@ const struct file_operations fuse_dev_op .aio_write = fuse_dev_write, .poll = fuse_dev_poll, .release= fuse_dev_release, - .fasync = fuse_dev_fasync, + .unlocked_fasync = fuse_dev_fasync, }; static struct miscdevice fuse_miscdevice = { - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [18/18] BKL-removal: Convert bad_inode to unlocked_fasync
Not that it matters much, but it was easy. Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/bad_inode.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux/fs/bad_inode.c === --- linux.orig/fs/bad_inode.c +++ linux/fs/bad_inode.c @@ -174,7 +174,7 @@ static const struct file_operations bad_ .release= bad_file_release, .fsync = bad_file_fsync, .aio_fsync = bad_file_aio_fsync, - .fasync = bad_file_fasync, + .unlocked_fasync = bad_file_fasync, .lock = bad_file_lock, .sendpage = bad_file_sendpage, .get_unmapped_area = bad_file_get_unmapped_area, - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [14/18] BKL-removal: Add unlocked_fasync
Add a new fops entry point to allow fasync without BKL. While it's arguably unclear this entry point is called often enough for it really matters it was still relatively easy to do. And there are far less async users in the tree than ioctls so it's likely they can be all converted eventually and then the non unlocked async entry point could be dropped. Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- Documentation/filesystems/vfs.txt |5 - fs/fcntl.c|6 +- fs/ioctl.c|5 - include/linux/fs.h|1 + 4 files changed, 14 insertions(+), 3 deletions(-) Index: linux/fs/fcntl.c === --- linux.orig/fs/fcntl.c +++ linux/fs/fcntl.c @@ -240,11 +240,15 @@ static int setfl(int fd, struct file * f lock_kernel(); if ((arg ^ filp->f_flags) & FASYNC) { - if (filp->f_op && filp->f_op->fasync) { + if (filp->f_op && filp->f_op->unlocked_fasync) + error = filp->f_op->unlocked_fasync(fd, filp, + !!(arg & FASYNC)); + else if (filp->f_op && filp->f_op->fasync) { error = filp->f_op->fasync(fd, filp, (arg & FASYNC) != 0); if (error < 0) goto out; } + /* AK: no else error = -EINVAL here? */ } filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK); Index: linux/fs/ioctl.c === --- linux.orig/fs/ioctl.c +++ linux/fs/ioctl.c @@ -118,7 +118,10 @@ int vfs_ioctl(struct file *filp, unsigne /* Did FASYNC state change ? */ if ((flag ^ filp->f_flags) & FASYNC) { - if (filp->f_op && filp->f_op->fasync) { + if (filp->f_op && filp->f_op->unlocked_fasync) + error = filp->f_op->unlocked_fasync(fd, + filp, on); + else if (filp->f_op && filp->f_op->fasync) { lock_kernel(); error = filp->f_op->fasync(fd, filp, on); unlock_kernel(); Index: linux/include/linux/fs.h === --- linux.orig/include/linux/fs.h +++ linux/include/linux/fs.h @@ -1179,6 +1179,7 @@ struct file_operations { int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); + int (*unlocked_fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); Index: linux/Documentation/filesystems/vfs.txt === --- linux.orig/Documentation/filesystems/vfs.txt +++ linux/Documentation/filesystems/vfs.txt @@ -769,6 +769,7 @@ struct file_operations { int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); + int (*unlocked_fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); @@ -828,7 +829,9 @@ otherwise noted. fsync: called by the fsync(2) system call fasync: called by the fcntl(2) system call when asynchronous - (non-blocking) mode is enabled for a file + (non-blocking) mode is enabled for a file. BKL hold + + unlocked_fasync: like fasync, but without BKL lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW commands - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [15/18] BKL-removal: Convert pipe over to unlocked_fasync
Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/pipe.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) Index: linux/fs/pipe.c === --- linux.orig/fs/pipe.c +++ linux/fs/pipe.c @@ -788,7 +788,7 @@ const struct file_operations read_fifo_f .ioctl = pipe_ioctl, .open = pipe_read_open, .release= pipe_read_release, - .fasync = pipe_read_fasync, + .unlocked_fasync = pipe_read_fasync, }; const struct file_operations write_fifo_fops = { @@ -800,7 +800,7 @@ const struct file_operations write_fifo_ .ioctl = pipe_ioctl, .open = pipe_write_open, .release= pipe_write_release, - .fasync = pipe_write_fasync, + .unlocked_fasync = pipe_write_fasync, }; const struct file_operations rdwr_fifo_fops = { @@ -813,7 +813,7 @@ const struct file_operations rdwr_fifo_f .ioctl = pipe_ioctl, .open = pipe_rdwr_open, .release= pipe_rdwr_release, - .fasync = pipe_rdwr_fasync, + .unlocked_fasync = pipe_rdwr_fasync, }; static const struct file_operations read_pipe_fops = { @@ -825,7 +825,7 @@ static const struct file_operations read .ioctl = pipe_ioctl, .open = pipe_read_open, .release= pipe_read_release, - .fasync = pipe_read_fasync, + .unlocked_fasync = pipe_read_fasync, }; static const struct file_operations write_pipe_fops = { @@ -837,7 +837,7 @@ static const struct file_operations writ .ioctl = pipe_ioctl, .open = pipe_write_open, .release= pipe_write_release, - .fasync = pipe_write_fasync, + .unlocked_fasync = pipe_write_fasync, }; static const struct file_operations rdwr_pipe_fops = { @@ -850,7 +850,7 @@ static const struct file_operations rdwr .ioctl = pipe_ioctl, .open = pipe_rdwr_open, .release= pipe_rdwr_release, - .fasync = pipe_rdwr_fasync, + .unlocked_fasync = pipe_rdwr_fasync, }; struct pipe_inode_info * alloc_pipe_info(struct inode *inode) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [10/18] BKL-removal: Implement a compat_ioctl handler for JFS
The ioctls were already compatible except for the actual values so this was fairly easy to do. Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/jfs/file.c |3 +++ fs/jfs/ioctl.c | 18 ++ fs/jfs/jfs_dinode.h |2 ++ fs/jfs/jfs_inode.h |1 + fs/jfs/namei.c |3 +++ 5 files changed, 27 insertions(+) Index: linux/fs/jfs/file.c === --- linux.orig/fs/jfs/file.c +++ linux/fs/jfs/file.c @@ -113,4 +113,7 @@ const struct file_operations jfs_file_op .fsync = jfs_fsync, .release= jfs_release, .unlocked_ioctl = jfs_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = jfs_compat_ioctl, +#endif }; Index: linux/fs/jfs/ioctl.c === --- linux.orig/fs/jfs/ioctl.c +++ linux/fs/jfs/ioctl.c @@ -117,3 +117,21 @@ long jfs_ioctl(struct file *filp, unsign } } +#ifdef CONFIG_COMPAT +long jfs_compat_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) +{ + /* While these ioctl numbers defined with 'long' and have different +* numbers than the 64bit ABI, +* the actual implementation only deals with ints and is compatible. +*/ + switch (cmd) { + case JFS_IOC_GETFLAGS32: + cmd = JFS_IOC_GETFLAGS; + break; + case JFS_IOC_SETFLAGS32: + cmd = JFS_IOC_SETFLAGS; + break; + } + return jfs_ioctl(filp, cmd, arg); +} +#endif Index: linux/fs/jfs/jfs_inode.h === --- linux.orig/fs/jfs/jfs_inode.h +++ linux/fs/jfs/jfs_inode.h @@ -23,6 +23,7 @@ struct fid; extern struct inode *ialloc(struct inode *, umode_t); extern int jfs_fsync(struct file *, struct dentry *, int); extern long jfs_ioctl(struct file *, unsigned int, unsigned long); +extern long jfs_compat_ioctl(struct file *, unsigned int, unsigned long); extern void jfs_read_inode(struct inode *); extern int jfs_commit_inode(struct inode *, int); extern int jfs_write_inode(struct inode*, int); Index: linux/fs/jfs/namei.c === --- linux.orig/fs/jfs/namei.c +++ linux/fs/jfs/namei.c @@ -1563,6 +1563,9 @@ const struct file_operations jfs_dir_ope .readdir= jfs_readdir, .fsync = jfs_fsync, .unlocked_ioctl = jfs_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = jfs_compat_ioctl, +#endif }; static int jfs_ci_hash(struct dentry *dir, struct qstr *this) Index: linux/fs/jfs/jfs_dinode.h === --- linux.orig/fs/jfs/jfs_dinode.h +++ linux/fs/jfs/jfs_dinode.h @@ -170,5 +170,7 @@ struct dinode { #define JFS_IOC_GETFLAGS _IOR('f', 1, long) #define JFS_IOC_SETFLAGS _IOW('f', 2, long) +#define JFS_IOC_GETFLAGS32 _IOR('f', 1, int) +#define JFS_IOC_SETFLAGS32 _IOW('f', 2, int) #endif /*_H_JFS_DINODE */ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [9/18] BKL-removal: Use unlocked_ioctl for jfs
Convert jfs_ioctl over to not use the BKL. The only potential race I could see was with two ioctls in parallel changing the flags and losing the updates. Use the i_mutex to protect against this. Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/jfs/file.c |2 +- fs/jfs/ioctl.c | 13 ++--- fs/jfs/jfs_inode.h |3 +-- fs/jfs/namei.c |2 +- 4 files changed, 13 insertions(+), 7 deletions(-) Index: linux/fs/jfs/ioctl.c === --- linux.orig/fs/jfs/ioctl.c +++ linux/fs/jfs/ioctl.c @@ -51,9 +51,9 @@ static long jfs_map_ext2(unsigned long f } -int jfs_ioctl(struct inode * inode, struct file * filp, unsigned int cmd, - unsigned long arg) +long jfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { + struct inode *inode = filp->f_dentry->d_inode; struct jfs_inode_info *jfs_inode = JFS_IP(inode); unsigned int flags; @@ -82,6 +82,10 @@ int jfs_ioctl(struct inode * inode, stru /* Is it quota file? Do not allow user to mess with it */ if (IS_NOQUOTA(inode)) return -EPERM; + + /* Lock against other parallel changes of flags */ + mutex_lock(&inode->i_mutex); + jfs_get_inode_flags(jfs_inode); oldflags = jfs_inode->mode2; @@ -92,8 +96,10 @@ int jfs_ioctl(struct inode * inode, stru if ((oldflags & JFS_IMMUTABLE_FL) || ((flags ^ oldflags) & (JFS_APPEND_FL | JFS_IMMUTABLE_FL))) { - if (!capable(CAP_LINUX_IMMUTABLE)) + if (!capable(CAP_LINUX_IMMUTABLE)) { + mutex_unlock(&inode->i_mutex); return -EPERM; + } } flags = flags & JFS_FL_USER_MODIFIABLE; @@ -101,6 +107,7 @@ int jfs_ioctl(struct inode * inode, stru jfs_inode->mode2 = flags; jfs_set_inode_flags(inode); + mutex_unlock(&inode->i_mutex); inode->i_ctime = CURRENT_TIME_SEC; mark_inode_dirty(inode); return 0; Index: linux/fs/jfs/file.c === --- linux.orig/fs/jfs/file.c +++ linux/fs/jfs/file.c @@ -112,5 +112,5 @@ const struct file_operations jfs_file_op .splice_write = generic_file_splice_write, .fsync = jfs_fsync, .release= jfs_release, - .ioctl = jfs_ioctl, + .unlocked_ioctl = jfs_ioctl, }; Index: linux/fs/jfs/jfs_inode.h === --- linux.orig/fs/jfs/jfs_inode.h +++ linux/fs/jfs/jfs_inode.h @@ -22,8 +22,7 @@ struct fid; extern struct inode *ialloc(struct inode *, umode_t); extern int jfs_fsync(struct file *, struct dentry *, int); -extern int jfs_ioctl(struct inode *, struct file *, - unsigned int, unsigned long); +extern long jfs_ioctl(struct file *, unsigned int, unsigned long); extern void jfs_read_inode(struct inode *); extern int jfs_commit_inode(struct inode *, int); extern int jfs_write_inode(struct inode*, int); Index: linux/fs/jfs/namei.c === --- linux.orig/fs/jfs/namei.c +++ linux/fs/jfs/namei.c @@ -1562,7 +1562,7 @@ const struct file_operations jfs_dir_ope .read = generic_read_dir, .readdir= jfs_readdir, .fsync = jfs_fsync, - .ioctl = jfs_ioctl, + .unlocked_ioctl = jfs_ioctl, }; static int jfs_ci_hash(struct dentry *dir, struct qstr *this) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [3/18] BKL-removal: Convert ext3 to use unlocked_ioctl
I checked ext3_ioctl and it looked largely safe to not be used without BKL. So convert it over to unlocked_ioctl. The only case where I wasn't quite sure was for the dynamic fs grow ioctls versus umounting -- I kept the BKL for those. Cc: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/ext3/dir.c |2 +- fs/ext3/file.c |2 +- fs/ext3/ioctl.c | 21 +++-- include/linux/ext3_fs.h |3 +-- 4 files changed, 14 insertions(+), 14 deletions(-) Index: linux/fs/ext3/dir.c === --- linux.orig/fs/ext3/dir.c +++ linux/fs/ext3/dir.c @@ -42,7 +42,7 @@ const struct file_operations ext3_dir_op .llseek = generic_file_llseek, .read = generic_read_dir, .readdir= ext3_readdir, /* we take BKL. needed?*/ - .ioctl = ext3_ioctl, /* BKL held */ + .unlocked_ioctl = ext3_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ext3_compat_ioctl, #endif Index: linux/fs/ext3/file.c === --- linux.orig/fs/ext3/file.c +++ linux/fs/ext3/file.c @@ -112,7 +112,7 @@ const struct file_operations ext3_file_o .write = do_sync_write, .aio_read = generic_file_aio_read, .aio_write = ext3_file_write, - .ioctl = ext3_ioctl, + .unlocked_ioctl = ext3_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ext3_compat_ioctl, #endif Index: linux/fs/ext3/ioctl.c === --- linux.orig/fs/ext3/ioctl.c +++ linux/fs/ext3/ioctl.c @@ -17,9 +17,9 @@ #include #include -int ext3_ioctl (struct inode * inode, struct file * filp, unsigned int cmd, - unsigned long arg) +long ext3_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { + struct inode *inode = filp->f_dentry->d_inode; struct ext3_inode_info *ei = EXT3_I(inode); unsigned int flags; unsigned short rsv_window_size; @@ -224,10 +224,14 @@ flags_err: if (get_user(n_blocks_count, (__u32 __user *)arg)) return -EFAULT; + /* AK: not sure the BKL is needed, but this might prevent +* races against umount */ + lock_kernel(); err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count); journal_lock_updates(EXT3_SB(sb)->s_journal); journal_flush(EXT3_SB(sb)->s_journal); journal_unlock_updates(EXT3_SB(sb)->s_journal); + unlock_kernel(); return err; } @@ -245,11 +249,14 @@ flags_err: if (copy_from_user(&input, (struct ext3_new_group_input __user *)arg, sizeof(input))) return -EFAULT; - + /* AK: not sure the BKL is needed, but this might prevent +* races against umount */ + lock_kernel(); err = ext3_group_add(sb, &input); journal_lock_updates(EXT3_SB(sb)->s_journal); journal_flush(EXT3_SB(sb)->s_journal); journal_unlock_updates(EXT3_SB(sb)->s_journal); + unlock_kernel(); return err; } @@ -263,9 +270,6 @@ flags_err: #ifdef CONFIG_COMPAT long ext3_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { - struct inode *inode = file->f_path.dentry->d_inode; - int ret; - /* These are just misnamed, they actually get/put from/to user an int */ switch (cmd) { case EXT3_IOC32_GETFLAGS: @@ -305,9 +309,6 @@ long ext3_compat_ioctl(struct file *file default: return -ENOIOCTLCMD; } - lock_kernel(); - ret = ext3_ioctl(inode, file, cmd, (unsigned long) compat_ptr(arg)); - unlock_kernel(); - return ret; + return ext3_ioctl(file, cmd, (unsigned long) compat_ptr(arg)); } #endif Index: linux/include/linux/ext3_fs.h === --- linux.orig/include/linux/ext3_fs.h +++ linux/include/linux/ext3_fs.h @@ -838,8 +838,7 @@ extern void ext3_get_inode_flags(struct extern void ext3_set_aops(struct inode *inode); /* ioctl.c */ -extern int ext3_ioctl (struct inode *, struct file *, unsigned int, - unsigned long); +extern long ext3_ioctl(struct file *, unsigned int, unsigned long); extern long ext3_compat_ioctl (struct file *, unsigned int, unsigned long); /* namei.c */ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [0/18] Implement some low hanging BKL removal fruit in fs/*
[Andrew: I believe this is -mm material for .25] - Convert some more file systems (generally those who don't use the BKL for anything except mount) to use unlocked_bkl. - Implement BKL less fasync (see patch for the rationale) This is currently a separate entry point, but since the number of fasync users in the tree is relatively small I hope the older entry point can be removed then in the not too far future [help from other people converting more fasync users to unlocked_fasync would be appreciated] - Implement BKL less remote_llseek - While I was at it I also added a few missing compat ioctl handlers - Fix a few comments This fixes a lot of relatively trivial BKL users in fs/*. The main remaining non legacy offenders are now locks.c, nfs/nfsd and reiserfs. I believe BKL removal for all of those is being worked on by other people. Also a lot of "legacy" file systems still use it, but converting those does not seem to be very pressing. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [1/18] BKL-removal: Convert ext2 over to use unlocked_ioctl
I checked ext2_ioctl and could not find anything in there that would need the BKL. So convert it over to use unlocked_ioctl Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/ext2/dir.c |2 +- fs/ext2/ext2.h |3 +-- fs/ext2/file.c |4 ++-- fs/ext2/ioctl.c | 12 +++- 4 files changed, 7 insertions(+), 14 deletions(-) Index: linux/fs/ext2/dir.c === --- linux.orig/fs/ext2/dir.c +++ linux/fs/ext2/dir.c @@ -703,7 +703,7 @@ const struct file_operations ext2_dir_op .llseek = generic_file_llseek, .read = generic_read_dir, .readdir= ext2_readdir, - .ioctl = ext2_ioctl, + .unlocked_ioctl = ext2_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ext2_compat_ioctl, #endif Index: linux/fs/ext2/ext2.h === --- linux.orig/fs/ext2/ext2.h +++ linux/fs/ext2/ext2.h @@ -139,8 +139,7 @@ int __ext2_write_begin(struct file *file struct page **pagep, void **fsdata); /* ioctl.c */ -extern int ext2_ioctl (struct inode *, struct file *, unsigned int, - unsigned long); +extern long ext2_ioctl(struct file *, unsigned int, unsigned long); extern long ext2_compat_ioctl(struct file *, unsigned int, unsigned long); /* namei.c */ Index: linux/fs/ext2/file.c === --- linux.orig/fs/ext2/file.c +++ linux/fs/ext2/file.c @@ -48,7 +48,7 @@ const struct file_operations ext2_file_o .write = do_sync_write, .aio_read = generic_file_aio_read, .aio_write = generic_file_aio_write, - .ioctl = ext2_ioctl, + .unlocked_ioctl = ext2_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ext2_compat_ioctl, #endif @@ -65,7 +65,7 @@ const struct file_operations ext2_xip_fi .llseek = generic_file_llseek, .read = xip_file_read, .write = xip_file_write, - .ioctl = ext2_ioctl, + .unlocked_ioctl = ext2_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ext2_compat_ioctl, #endif Index: linux/fs/ext2/ioctl.c === --- linux.orig/fs/ext2/ioctl.c +++ linux/fs/ext2/ioctl.c @@ -17,9 +17,9 @@ #include -int ext2_ioctl (struct inode * inode, struct file * filp, unsigned int cmd, - unsigned long arg) +long ext2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { + struct inode *inode = filp->f_dentry->d_inode; struct ext2_inode_info *ei = EXT2_I(inode); unsigned int flags; unsigned short rsv_window_size; @@ -141,9 +141,6 @@ int ext2_ioctl (struct inode * inode, st #ifdef CONFIG_COMPAT long ext2_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { - struct inode *inode = file->f_path.dentry->d_inode; - int ret; - /* These are just misnamed, they actually get/put from/to user an int */ switch (cmd) { case EXT2_IOC32_GETFLAGS: @@ -161,9 +158,6 @@ long ext2_compat_ioctl(struct file *file default: return -ENOIOCTLCMD; } - lock_kernel(); - ret = ext2_ioctl(inode, file, cmd, (unsigned long) compat_ptr(arg)); - unlock_kernel(); - return ret; + return ext2_ioctl(file, cmd, (unsigned long) compat_ptr(arg)); } #endif - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [2/18] BKL-removal: Remove incorrect BKL comment in ext2
No BKL used anywhere, so don't mention it. Signed-off-by: Andi Kleen <[EMAIL PROTECTED]> --- fs/ext2/inode.c |1 - 1 file changed, 1 deletion(-) Index: linux/fs/ext2/inode.c === --- linux.orig/fs/ext2/inode.c +++ linux/fs/ext2/inode.c @@ -569,7 +569,6 @@ static void ext2_splice_branch(struct in * * `handle' can be NULL if create == 0. * - * The BKL may not be held on entry here. Be sure to take it early. * return > 0, # of blocks mapped or allocated. * return = 0, if plain lookup failed. * return < 0, error case. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] fast file mapping for loop
Jens Axboe <[EMAIL PROTECTED]> writes: > > So how does it work? Instead of punting IO to a thread and passing it > through the page cache, we instead attempt to send the IO directly to the Great -- something like this was needed for a long time. > - The file block mappings must not change while loop is using the file. > This means that we have to ensure exclusive access to the file and > this is the bit that is currently missing in the implementation. It > would be nice if we could just do this via open(), ideas welcome... get_write_access()/put_write_access() will block other writers. But as pointed out by others that is not enough for this. I suppose you could use a white list like a special flag for file systems (like ext2/ext3) that do not reallocate blocks. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Remove BKL from fs/locks.c
> The only problem I can see from an NFS perspective is with NFSv2/v3 > locking: unfortunately the protocol provides no way for the server to > notify that a lock may not be granted after the client has been told to > block. You would therefore have to bend the protocol rules by simply > delaying replying to the client until the deadlock timeout occurred > instead of telling it to block. I'm not sure that all clients would be > able to cope... If the delay is short enough (let's say < 2 jiffies) that should be surely no problem? If they couldn't deal with that they couldn't deal with a congested network either. Otherwise lockd could just force a 0 timeout. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Remove BKL from fs/locks.c
Matthew Wilcox <[EMAIL PROTECTED]> writes: > > The blocked_list is a bit more complex since we need to check every lock > on the blocked list, and would need to acquire all the sb_file_lock_locks > to check this list consistently. I don't see a nice way to do this -- > particularly when you consider that we need to run this check every time > someone takes out a POSIX lock that blocks on another lock. Have you considered using a timeout approach? e.g. just start a timer when aquiring the lock and when you can't get it in some short (user configurable) time and only do then the expensive deadlock check. Timers are quite optimized and have per cpu state so they should be cheap enough. AFAIK that's a standard technique used in databases. Advantage is that it keeps all that out of the fast path. Disadvantage is that it takes at least timeout time to detect a deadlock, but they should be infrequent anyways I guess so it's hopefully not a problem (and if it was the user could reset the timeout to 0) -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Beagle and logging inotify events
"Jon Smirl" <[EMAIL PROTECTED]> writes: > On 11/14/07, Chuck Lever <[EMAIL PROTECTED]> wrote: >> On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote: >> > Is it feasible to do something like this in the linux file system >> > architecture? >> > >> > Beagle beats on my disk for an hour when I reboot. Of course I don't >> > like that and I shut Beagle off. >> >> Leopard, by the way, does exactly this: it has a daemon that starts >> at boot time and taps FSEvents then journals file system changes to a >> well-known file on local disk. > > Logging file systems have all of the needed info. Actually most journaling file systems in Linux use block logging and it would be probably hard to get specific file names out of a random collection of logged blocks. And even if you could they would hit a lot of false positives since everything is rounded up to block level. With intent logging like in XFS/JFS it would be easier, but even then costly :- e.g. they might log changes to the inode but there is no back pointer to the file name short of searching the whole directory tree. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SLUB performance regression vs SLAB
Jens Axboe <[EMAIL PROTECTED]> writes: > > Writing a small test module to exercise slub/slab in various ways > (allocating from all cpus freeing from one, as described) should not be > too hard. Perhaps that would be enough to find this performance > discrepancy between slab and slub? You could simulate that by just sending packets using unix sockets between threads bound to different CPUs. Sending a packet allocates; receiving deallocates. But it's not clear that will really simulate the cache bounce environment of the database test. I don't think all passing of data between CPUs using slub objects is slow. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/17] [RFC] Virtual Compound Page Support
Christoph Lameter <[EMAIL PROTECTED]> writes: It seems like a good idea simply because the same functionality is already open coded in a couple of places and unifying that would be a good thing. But ... > The patchset provides this functionality in stages. Stage 1 introduces > the basic fall back mechanism necessary to replace vmalloc allocations > with > > alloc_page(GFP_VFALLBACK, order, ) Is there a reason this needs to be a GFP flag versus a wrapper around alloc_page/free_page ? page_alloc.c is already too complicated and it's better to keep new features separated. The only drawback would be that free_pages would need a different call, but that doesn't seem like a big problem. Especially integrating it into slab would seem wrong to me. slab is already too complicated and for users who need that large areas page granuality rounding to pages is probably fine. Also such a wrapper could do the old alloc_page_exact() trick: instead of always rounding up to next order return the left over pages to the VM. In some cases this can save significant memory. I'm also a little dubious about your attempts to do vmalloc in interrupt context. Is that really needed? GFP_ATOMIC allocations of large areas seem to be extremly unreliable to me and not design. Even if it works sometimes free probably wouldn't work there due to the flushes, which is very nasty. It would be better to drop that. -Andi > which signifies to the page allocator that a higher order is to be found > but a virtual mapping may stand in if there is an issue with fragmentation. > > Stage 1 functionality does not allow allocation and freeing of virtual > mappings from interrupt contexts. > > The stage 1 series ends with the conversion of a few key uses of vmalloc > in the VM to alloc_pages() for the allocation of sparsemems memmap table > and the wait table in each zone. Other uses of vmalloc could be converted > in the same way. > > > Stage 2 functionality enhances the fallback even more allowing allocation > and frees in interrupt context. > > SLUB is then modified to use the virtual mappings for slab caches > that are marked with SLAB_VFALLBACK. If a slab cache is marked this way > then we drop all the restraints regarding page order and allocate > good large memory areas that fit lots of objects so that we rarely > have to use the slow paths. > > Two slab caches--the dentry cache and the buffer_heads--are then flagged > that way. Others could be converted in the same way. > > The patch set also provides a debugging aid through setting > > CONFIG_VFALLBACK_ALWAYS > > If set then all GFP_VFALLBACK allocations fall back to the virtual > mappings. This is useful for verification tests. The test of this > patch set was done by enabling that options and compiling a kernel. > > > Stage 3 functionality could be the adding of support for the large > buffer size patchset. Not done yet and not sure if it would be useful > to do. > > Much of this patchset may only be needed for special cases in which the > existing defragmentation methods fail for some reason. It may be better to > have the system operate without such a safety net and make sure that the > page allocator can return large orders in a reliable way. > > The initial idea for this patchset came from Nick Piggin's fsblock > and from his arguments about reliability and guarantees. Since his > fsblock uses the virtual mappings I think it is legitimate to > generalize the use of virtual mappings to support higher order > allocations in this way. The application of these ideas to the large > block size patchset etc are straightforward. If wanted I can base > the next rev of the largebuffer patchset on this one and implement > fallback. > > Contrary to Nick, I still doubt that any of this provides a "guarantee". > Have said that I have to deal with various failure scenarios in the VM > daily and I'd certainly like to see it work in a more reliable manner. > > IMHO getting rid of the various workarounds to deal with the small 4k > pages and avoiding additional layers that group these pages in subsystem > specific ways is something that can simplify the kernel and make the > kernel more reliable overall. > > If people feel that a virtual fall back is needed then so be it. Maybe > we can shed our security blanket later when the approaches to deal > with fragmentation have matured. > > The patch set is also available via git from the largeblock git tree via > > git pull > git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git > vcompound > > -- > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/2] 2.6.23-rc3: known regressions with patches
On Mon, Aug 13, 2007 at 10:24:52PM +0200, Michal Piotrowski wrote: > On 13/08/07, Andi Kleen <[EMAIL PROTECTED]> wrote: > > > Unclassified > > > > > > Subject : reset during bootup - 2.6.23-rc2 (git d23cf676) > > > > This is already fixed in mainline > > commit b8d3f2448b8f4ba24f301e23585547ba1acc1f04 > > > > > There is a real regression with failing builds on some old binutils on > > x86-64 know, but I don't know how to fix it. > > Did you mean this? Yes. Anyways it be likely a WONT-FIX -- aka document newer binutils version required. We haven't had any other reports so likely that version is not too widely used. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/2] 2.6.23-rc3: known regressions with patches
> Unclassified > > Subject : reset during bootup - 2.6.23-rc2 (git d23cf676) This is already fixed in mainline There is a real regression with failing builds on some old binutils on x86-64 know, but I don't know how to fix it. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/3] add the fsblock layer
On Sun, Jun 24, 2007 at 01:18:42PM -0700, Arjan van de Ven wrote: > > > Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the > > smallest > > possible type for each architecture. And a couple of ugly casts for set_bit > > et.al. > > but those could be also hidden in macros. Should be relatively easy to do. > > or make a "smallbit" type that is small/supported, so 64 bit if 32 bit > isn't supported, otherwise 32 That wouldn't handle the case where you only need e.g. 8 bits That's fine for x86 too. It only hates atomic accesses crossing cache line boundaries (but handles them too, just slow) -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/3] add the fsblock layer
Nick Piggin <[EMAIL PROTECTED]> writes: [haven't read everything, just commenting on something that caught my eye] > +struct fsblock { > + atomic_tcount; > + union { > + struct { > + unsigned long flags; /* XXX: flags could be int for > better packing */ int is not supported by many architectures, but works on x86 at least. Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the smallest possible type for each architecture. And a couple of ugly casts for set_bit et.al. but those could be also hidden in macros. Should be relatively easy to do. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
Nick Piggin <[EMAIL PROTECTED]> writes: > > - Structure packing. A page gets a number of buffer heads that are > allocated in a linked list. fsblocks are allocated contiguously, so > cacheline footprint is smaller in the above situation. It would be interesting to test if that makes a difference for database benchmarks running over file systems. Databases eat a lot of cache so in theory any cache improvements in the kernel which often runs cache cold then should be beneficial. But I guess it would need at least ext2 to test; Minix is probably not good enough. In general have you benchmarked the CPU overhead of old vs new code? e.g. when we went to BIO scalability went up, but CPU costs of a single request also went up. It would be nice to not continue or better reverse that trend. > - Large block support. I can mount and run an 8K block size minix3 fs on > my 4K page system and it didn't require anything special in the fs. We > can go up to about 32MB blocks now, and gigabyte+ blocks would only > require one more bit in the fsblock flags. fsblock_superpage blocks > are > PAGE_CACHE_SIZE, midpage ==, and subpage <. Can it be cleanly ifdefed or optimized away? Unless the fragmentation problem is not solved it would seem rather pointless to me. Also I personally still think the right way to approach this is larger softpage size. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NFS: Make NFS root work again
Andrew Morton <[EMAIL PROTECTED]> writes: > > What is not working, and how does this patch fix it? FWIW i use nfs root regularly for testing kernels and it works for me. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NILFS version 2 now available
[EMAIL PROTECTED] writes: > NILFS (a New Implementation of a Log-structured Filesystem) Version 2 have > been available at the project website > > http://www.nilfs.org/ > > If you are interested, please visit to our website. Could you please give some information on the use cases for this file system. Is it another flash file system or optimized for block devices? What do you think are the primary use cases? Do you see it as a general purpose file system like ext3 or targetted at some specific niche? Is it supposed to be production ready eventually or more intended for research? Does it avoid the long mount time problem the other Linux log file system have? Thanks, -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
Matt Mackall <[EMAIL PROTECTED]> writes: > This is a relatively simple scheme for making a filesystem with > incremental online consistency checks of both data and metadata. > Overhead can be well under 1% disk space and CPU overhead may also be > very small, while greatly improving filesystem integrity. Problem I see is that your scheme doesn't support metadata checksums only. IMHO those are the most interesting because they have the potential to be basically zero cost, unlike full data checksumming. And doing metadata checksums is enough to handle the fsck problem. I'm sure there are many cases where full checksumming makes sense too, but those shouldn't be forced on everybody because it will slow down some important workloads (like O_DIRECT) Metadata checksums would be best just put into the file systems data structures. Essentially every object (inode, extent, directory entry, super block) should have a checksum that can be incrementially updated. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: Dentry still in use during umount in 2.6.21-rc5-git6
On Tuesday 24 April 2007 12:40:24 Jan Kara wrote: > > One of my autoboot test clients gave me this during shutdown. It used > > reiserfs and autofs and NFS heavily. > Jeff has a fix for this bug so it should go away soon... Thanks for > report anyway :). Well I hit two more -- see other mails if you're interested. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
More reiserfs trouble in 2.6.21-rc5
FYI, This was a debugging kernel (preempt, slab debugging, lockdep etc. enabled) running autotest and some other load on a 4 core Opteron system There was also another lockdep warning before that which I'm sending separately. Looks like some memory corruption. Could be something else, but at least reiserfs is the messenger. BTW these kind of backtraces are a good example on why I want the dwarf2 unwinder back. -Andi [ cut here ] kernel BUG at /mnt/dm-2/newautoboot/autoboot/lsrc/mainline/linux/mm/slab.c:2380! invalid opcode: [1] PREEMPT SMP CPU 0 Modules linked in: Pid: 12205, comm: find Not tainted 2.6.21-rc5-git6 #44 RIP: 0010:[] [] cache_alloc_refill+0xe6/0x22a RSP: 0018:810078be7a28 EFLAGS: 00010002 RAX: 0001 RBX: 0001 RCX: 8027b826 RDX: 0003 RSI: 81017d077000 RDI: 8100f7f7c540 RBP: 81017d077000 R08: 8100f7f7ec78 R09: 8101fa6cf178 R10: 810100116070 R11: R12: 810100116070 R13: 8100f7f7ec78 R14: 8100f7f7c540 R15: 000c FS: 2acc1ce646d0() GS:8074a000() knlGS:5b3f2b90 CS: 0010 DS: ES: CR0: 8005003b CR2: 43d45000 CR3: 88a9b000 CR4: 06e0 Process find (pid: 12205, threadinfo 810078be6000, task 81002a76d0c0) Stack: 000100d0 8101002b6248 8100f7f7c540 0246 00d0 802ce4cb 802c7590 8027bda3 8101002b6248 8100070a9988 8101002b6248 Call Trace: [] reiserfs_alloc_inode+0x15/0x2a [] reiserfs_find_actor+0x0/0x1b [] kmem_cache_alloc+0x8c/0xe2 [] reiserfs_alloc_inode+0x15/0x2a [] alloc_inode+0x12/0x13f [] iget5_locked+0x5a/0x18a [] reiserfs_init_locked_inode+0x0/0x12 [] reiserfs_iget+0x30/0x92 [] pathrelse+0x24/0x3c [] reiserfs_lookup+0xcf/0x138 [] _read_trylock+0x47/0x6b [] d_alloc+0x1c4/0x1d0 [] do_lookup+0xc4/0x1ae [] __link_path_walk+0x885/0xd2b [] link_path_walk+0x58/0xe0 [] do_path_lookup+0x1be/0x1e2 [] getname+0x152/0x196 [] __user_walk_fd+0x37/0x53 [] vfs_lstat_fd+0x18/0x47 [] sys_newlstat+0x19/0x31 [] trace_hardirqs_on_thunk+0x35/0x37 [] _atomic_dec_and_lock+0x39/0x58 [] system_call+0x7e/0x83 Code: 0f 0b eb fe 49 8b 86 70 03 00 00 49 ff 86 78 03 00 00 48 ff RIP [] cache_alloc_refill+0xe6/0x22a RSP note: find[12205] exited with preempt_count 1 BUG: scheduling while atomic: find/0x1001/12205 INFO: lockdep is turned off. UG: scheduling while atomic: find/0x1001/12205 INFO: lockdep is turned off. Call Trace: [] __sched_text_start+0x81/0x80b [] vt_console_print+0x21f/0x235 [] _spin_unlock_irqrestore+0x49/0x69 [] __cond_resched+0x13/0x32 [] cond_resched+0x2e/0x39 [] unmap_vmas+0x5e5/0x778 [] exit_mmap+0x80/0x117 [] mmput+0x2c/0x9e [] do_exit+0x21a/0x82e [] _spin_unlock_irqrestore+0x49/0x69 [] kernel_math_error+0x0/0x90 [] do_invalid_op+0xb2/0xbc [] cache_alloc_refill+0xe6/0x22a [] trace_hardirqs_on_thunk+0x35/0x37 [] restore_args+0x0/0x30 [] error_exit+0x0/0x96 [] cache_alloc_refill+0x6f/0x22a [] cache_alloc_refill+0xe6/0x22a [] cache_alloc_refill+0xcd/0x22a [] reiserfs_alloc_inode+0x15/0x2a [] reiserfs_find_actor+0x0/0x1b [] kmem_cache_alloc+0x8c/0xe2 [] reiserfs_alloc_inode+0x15/0x2a [] alloc_inode+0x12/0x13f [] iget5_locked+0x5a/0x18a [] reiserfs_init_locked_inode+0x0/0x12 [] reiserfs_iget+0x30/0x92 [] pathrelse+0x24/0x3c [] reiserfs_lookup+0xcf/0x138 [] _read_trylock+0x47/0x6b [] d_alloc+0x1c4/0x1d0 [] do_lookup+0xc4/0x1ae [] __link_path_walk+0x885/0xd2b [] link_path_walk+0x58/0xe0 [] do_path_lookup+0x1be/0x1e2 [] getname+0x152/0x196 [] __user_walk_fd+0x37/0x53 [] vfs_lstat_fd+0x18/0x47 [] sys_newlstat+0x19/0x31 [] trace_hardirqs_on_thunk+0x35/0x37 [] _atomic_dec_and_lock+0x39/0x58 [] system_call+0x7e/0x83 BUG: scheduling while atomic: find/0x1001/12205 INFO: lockdep is turned off. [... more similar messages ...] - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
reiserfs lockdep warning in 2.6.21-rc5
=== [ INFO: possible circular locking dependency detected ] 2.6.21-rc5-git6 #44 --- perl/7968 is trying to acquire lock: (&inode->i_mutex){--..}, at: [] reiserfs_file_release+0x109/0x2cc but task is already holding lock: (&mm->mmap_sem){}, at: [] sys_munmap+0x32/0x5a which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mm->mmap_sem){}: [] __lock_acquire+0x9e0/0xb79 [] lock_acquire+0x48/0x63 [] do_page_fault+0x3a4/0x7b2 [] down_read_trylock+0xe/0x3b [] down_read+0x21/0x2a [] do_page_fault+0x3a4/0x7b2 [] trace_hardirqs_on+0x11c/0x140 [] _read_unlock_irq+0x2f/0x4a [] find_lock_page+0x91/0x9d [] find_or_create_page+0x1e/0x75 [] error_exit+0x0/0x96 [] reiserfs_release_claimed_blocks+0x22/0x49 [] reiserfs_copy_from_user_to_file_region+0x7e/0xf3 [] reiserfs_file_write+0x15a1/0x1795 [] _spin_unlock_irqrestore+0x49/0x69 [] trace_hardirqs_on_thunk+0x35/0x37 [] _spin_unlock_irq+0x24/0x4a [] trace_hardirqs_on+0x11c/0x140 [] _spin_unlock_irq+0x2f/0x4a [] thread_return+0xee/0x135 [] _read_unlock_irq+0x24/0x4a [] trace_hardirqs_on+0x11c/0x140 [] _read_unlock_irq+0x2f/0x4a [] find_get_pages_tag+0x75/0x80 [] vfs_write+0xad/0x136 [] sys_pwrite64+0x50/0x70 [] ia32_sysret+0x0/0xa [] 0x -> #0 (&inode->i_mutex){--..}: [] print_circular_bug_header+0xcc/0xd3 [] __lock_acquire+0x8dc/0xb79 [] lock_acquire+0x48/0x63 [] reiserfs_file_release+0x109/0x2cc [] debug_mutex_lock_common+0x16/0x23 [] __mutex_lock_slowpath+0xe1/0x293 [] reiserfs_file_release+0x109/0x2cc [] __fput+0xa1/0x15e [] remove_vma+0x35/0x5c [] do_munmap+0x258/0x27a [] __down_write_nested+0x34/0x9e [] sys_munmap+0x40/0x5a [] system_call+0x7e/0x83 [] 0x other info that might help us debug this: 1 lock held by perl/7968: #0: (&mm->mmap_sem){}, at: [] sys_munmap+0x32/0x5a stack backtrace: Call Trace: [] print_circular_bug_tail+0x69/0x72 [] print_circular_bug_header+0xcc/0xd3 [] __lock_acquire+0x8dc/0xb79 [] lock_acquire+0x48/0x63 [] reiserfs_file_release+0x109/0x2cc [] debug_mutex_lock_common+0x16/0x23 [] __mutex_lock_slowpath+0xe1/0x293 [] reiserfs_file_release+0x109/0x2cc [] __fput+0xa1/0x15e [] remove_vma+0x35/0x5c [] do_munmap+0x258/0x27a [] __down_write_nested+0x34/0x9e [] sys_munmap+0x40/0x5a [] system_call+0x7e/0x83 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 7/8] allow unprivileged mounts
Andrew Morton <[EMAIL PROTECTED]> writes: > On Fri, 20 Apr 2007 12:25:39 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > > > Define a new fs flag FS_SAFE, which denotes, that unprivileged > > mounting of this filesystem may not constitute a security problem. > > > > Since most filesystems haven't been designed with unprivileged > > mounting in mind, a thorough audit is needed before setting this flag. > > Practically speaking, is there any realistic likelihood that any filesystem > apart from FUSE will ever use this? If it worked for mount --bind for any fs I could see uses of this. I haven't thought through the security implications though, so it might not work. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
BUG: Dentry still in use during umount in 2.6.21-rc5-git6
One of my autoboot test clients gave me this during shutdown. It used reiserfs and autofs and NFS heavily. Unmounting file systems BUG: Dentry 8100f3693a40{i=2352220,n=xattrs} still in use (1) [unmount of reiserfs sda9] [ cut here ] kernel BUG at /mnt/dm-2/newautoboot/autoboot/lsrc/mainline/linux/fs/dcache.c:623! invalid opcode: [1] SMP CPU 1 Modules linked in: Pid: 15791, comm: umount Not tainted 2.6.21-rc5-git6 #44 RIP: 0010:[] [] shrink_dcache_for_umount_subtree+0x178/0x250 RSP: 0018:8100f5f67e18 EFLAGS: 00010292 RAX: 0060 RBX: 8100f3693a40 RCX: 5207 RDX: RSI: 0046 RDI: 00014661 RBP: 8100f6dc9cc0 R08: 00a0 R09: 0005 R10: R11: R12: 8100f3693aa0 R13: 00014661 R14: 0050ea70 R15: 0050ead0 FS: 2adc863a86d0() GS:8100f7fdc1c0() knlGS:b7be38d0 CS: 0010 DS: ES: CR0: 8005003b CR2: 2adc8626a688 CR3: f628b000 CR4: 06e0 Process umount (pid: 15791, threadinfo 8100f5f66000, task 8100f7a08100) Stack: 810004dab218 810004dab000 80558860 810004dab000 8028815b 810004dab000 8027a1a5 8100f6c50980 806c1600 8027a2a4 Call Trace: [] shrink_dcache_for_umount+0x2f/0x3d [] generic_shutdown_super+0x19/0xf2 [] kill_block_super+0x26/0x3b [] deactivate_super+0x47/0x60 [] sys_umount+0x1f7/0x22a [] sys_newstat+0x19/0x31 [] system_call+0x7e/0x83 Code: 0f 0b eb fe 48 8b 6b 28 48 39 dd 75 04 31 ed eb 04 f0 ff 4d RIP [] shrink_dcache_for_umount_subtree+0x178/0x250 RSP /etc/init.d/boot.d/K14boot.localfs: line 93: 15791 Segmentation fault umount -avt noproc,nonfs,nonfs4,nosmbfs,nocifs,notmpfs -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AppArmor FAQ
> The vast majority of applications are not > modified to be SELinux aware - only a small handful of security aware > applications are modified. All applications that can edit /etc/resolv.conf? That's nearly everything. You yourself gave the example; I'm not making anything up. -Andi (sensing a loop in the thread -- things that already have been discussed come back from the dead.) P.S.: If you want to loop further please drop me from cc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AppArmor FAQ
> For SELinux to be effective it has to have a complete policy definition. > This would prevent the OpenOffice access (unless OpenOffice is in the > modify_resolv_conf_t domain) above. This would mean no fully functional root user anymore. My understanding is rather that at least in the Fedora default setup individual applications are confined with targetted policy and root left alone because normal system administrators get very unhappy when root becomes powerless. I was merely pointing out that in this setup path or namespace based access control are much easier to fit in than label based access because they normally don't require changing applications. John's original document also listed some other advantages that I don't need to repeat. In "i don't care if it looks like Unix anymore" security setups like you're describing that's undoubtedly different and labels might indeed work because you forbid just anybody changing them easily. If that makes the users happy is a different question though. I suppose it will keep security consultants employed at least @) Arguably the preserving label issue is not unique to SELinux but also applies to ACLs and other possible uses of EAs, but then people normally don't need to set any ACLs on /etc/resolv.conf. I personally don't like either too much. Path based access control is somewhat hackish and ugly and slow in the current implementation, but I haven't seen an similarly easy to configure solution yet. plan9 like limited namespaces for individual processes initially seem like a nice alternative, but in practice they're also too hard to use and suffer from many of the problems the EA label approach has. But easy to use security is probably better than complicated security because normal people will more likely use it. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AppArmor FAQ
On Tue, Apr 17, 2007 at 01:47:39PM -0400, James Morris wrote: > Normal applications need zero modification under SELinux. > > Some applications which manage security may need to be made SELinux-aware, Anything that can touch /etc/resolv.conf? That's potentially a lot of binaries if you consider anything scripts could do with it. > although this can often be done with PAM plugins, which is a standard way > to do this kind of thing in modern Unix & Linux OSs. PAM plugins in vi and emacs? Scary idea. And what do you do if someone decides to use OpenOffice to edit their /etc/resolv.conf? For a lot of people that's the only text editor they know. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AppArmor FAQ
Karl MacMillan <[EMAIL PROTECTED]> writes: > No - the real fix is to change the applications or to run under a policy > that confines all applications. Most of the problems with resolv.conf, > mtab, etc. stem from admin processes (e.g., editors or shell scripts) > all running under the same unconfined domain. > > In some cases applications need modification as only the application has > enough information to determine the correct label. Usually this means > preserving labels from input files or separating the output into > distinct directories so type transitions or label inheritance will work. > > restorecond is just a hack not a requirement or a sign that something is > wrong with the model. That is why it is a userspace application and not > integrated into the kernel mechanism. You nicely show one of the major disadvantages of the label model vs the path model here: it requires modification of a lot of applications. Maybe John can borrow your statement for new versions of his FAQ @) -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching
> It's nice to check for consistency though, so we're adding that. Profile > loading is a trusted operation, at least so far, and so security wise we > don't actually have to care --- if loading an invalid profile can bring down > the system, then that's no worse than an arbitrary module that crashes the > machine. Not sure if there will ever be user loadable profiles; at least at > that point we had to care. A security system that allows to crash the kernel is a little weird though. It would be better to check. Not that a recursion check is particularly expensive. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching
[EMAIL PROTECTED] writes: [didn't review code fully, just some stuff I noticed] > + > +struct aa_dfa { > + struct table_header *tables[YYTD_ID_NXT]; > +}; If that is passed in from user space you would need special compat code for 64bit kernels who support 32bit userland. Better to avoid pointers. > + > + /* get optional subprofiles */ > + if (aa_is_nameX(e, AA_LIST, "hats")) { > + while (!aa_is_nameX(e, AA_LISTEND, NULL)) { > + struct aa_profile *subprofile; > + subprofile = aa_unpack_profile(e); Is there any check that would guard the recursion from stack overflow on malicious input? > + /* > + * Replacement needs to allocate a new aa_task_context for each > + * task confined by old_profile. To do this the profile locks > + * are only held when the actual switch is done per task. While > + * looping to allocate a new aa_task_context the old_task list > + * may get shorter if tasks exit/change their profile but will > + * not get longer as new task will not use old_profile detecting > + * that is stale. > + */ > + do { > + new_cxt = aa_alloc_task_context(GFP_KERNEL | __GFP_NOFAIL); NOFAIL is usually a bad sign. It should be only used if there is no alternative. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take36 10/10] kevent: Kevent based generic AIO.
Evgeniy Polyakov <[EMAIL PROTECTED]> writes: > > aio_sendfile_path() is essentially aio_sendfile(), except that it takes > source filename as parameter, has a pointer to private header > and its size (which allows to send header and file's content in one syscall > instead of three (open, send, sendfile) and returns opened file descriptor. Are you sure this is a useful optimization? Do you have numbers vs open+aio_sendfile+close? Compared to the cost of sending a complete file three system calls should be quite in the noise. And Linux system calls are not that expensive (few hundred cycles normally) Adding such compound system calls would be a worrying precedent because I'm sure others would want them then for their favourite system call combo too. If they were really useful it might make more sense to have a batch() system call that works for arbitary calls, but I'm not convinced yet it's even needed. It would be certainly ugly. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix quadratic behavior of shrink_dcache_parent()
Miklos Szeredi <[EMAIL PROTECTED]> writes: > So the second part of the problem is to somehow limit the number of > dentries used. Not easy... OpenVZ has some existing work in this area to separate their virtual machines. I assume they will eventually submit it. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remaining
Andrew Morton <[EMAIL PROTECTED]> writes: > > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot > > > possibly gain (or vice versa) > > > > > > - Relative merits of the two offerings > > > > You missed the important one - people actively use it and have been for > > some years. Same reason with have NTFS, HPFS, and all the others. On > > that alone it makes sense to include. > > Again, that's not a technical reason. It's _a_ reason, sure. But what are > the technical reasons for merging gfs[2], ocfs2, both or neither? There seems to be clearly a need for a shared-storage fs of some sort for HA clusters and virtualized usage (multiple guests sharing a partition). Shared storage can be more efficient than network file systems like NFS because the storage access is often more efficient than network access and it is more reliable because it doesn't have a single point of failure in form of the NFS server. It's also a logical extension of the "failover on failure" clusters many people run now - instead of only failing over the shared fs at failure and keeping one machine idle the load can be balanced between multiple machines at any time. One argument to merge both might be that nobody really knows yet which shared-storage file system (GFS or OCFS2) is better. The only way to find out would be to let the user base try out both, and that's most practical when they're merged. Personally I think ocfs2 has nicer&cleaner code than GFS. It seems to be more or less a 64bit ext3 with cluster support, while GFS seems to reinvent a lot more things and has somewhat uglier code. On the other hand GFS' cluster support seems to be more aimed at being a universal cluster service open for other usages too, which might be a good thing. OCFS2s cluster seems to be more aimed at only serving the file system. But which one works better in practice is really an open question. The only thing that should be probably resolved is a common API for at least the clustered lock manager. Having multiple incompatible user space APIs for that would be sad. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: File Locking in Linux 2.5
On Thu, May 03, 2001 at 06:03:07PM -0700, Jeremy Allison wrote: > The only thing we need to fix (IMHO) is the close(dup(fd)) "bug", > which I have yet to see any application treat as anything other > than a spec. bug that must be worked around. I don't think it's a bug right now; not dropping them and keeping them around forever adds a quite lot of new problems: just see how unnatural sysv shared memory segments and semaphores fit into the rest of the mostly elegant unix model. You need to have special administration tools for them, you need to clean up after crashing applications that keep their resources around, you need special global resource limits to avoid users running amok etc. Persistent locks have similar problems. They probably can all be solved, but it won't be pretty or easy. One way to work around this would be to have a special device where these locks are associated with an open descriptor. e.g. in samba you would open that device (it would work like the /dev/ptmx multiplexer and just give new instances for every open) and remove close-on-exec and inherit it to every child. It would even work for more distributed applications without clear child-parent relations, you could pass that fd around via unix sockets. Then all persistent locks would be associated with the fd. When samba dies all references to that lock manager instance would be closed and the system could clean up the persistent locks without administrator support. I think such a scheme would be much nicer to the rest of the system and avoid the problems the sysv stuff has. Comments? -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: File Locking in Linux 2.5
On Thu, May 03, 2001 at 08:36:13AM +0100, Matthew Wilcox wrote: > I'll get to it this weekend then. Should be a relatively simple patch. Just don't forget to add a per user ulimit for it and probably an admin tool like ipcs. > Are there any other semantics you want changing from the POSIX lock? > I'm thinking of removing deadlock detection, for one... If you're willing to delay deadlock detection a bit you can just do it without a global list: on waiting for the lock you do schedule_timeout for 5-10s; after that you put the lock onto a timeout queue and search the queue for deadlock and repeat that searching for some time in slow intervals. (That's the textbook algorithm out of the Grey/Reuter database book) Disadvantage is that the big timer list must be modified for every lock operation; but that one is already efficient and will in 2.5 probably be per CPU. Advantage is that it moves near completely out of the fast path. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: File Locking in 2.5
On Mon, Apr 30, 2001 at 12:39:23PM -0600, Matthew Wilcox wrote: > * All filesystems will fill in their ->lock method. Why when a common stub should work for 90% of them? Please keep global search-and-edit operation low when not absolutely possible. > * Local filesystems should all use local_lock() for this method, >unless they have a good reason to provide their own facilities. So this should be default for ->lock >A clustered filesystem might call out to the network and say `I want >to put this lock on this file'. Either some other node in the cluster >says `Denied', `Blocked' or `Granted' (ie handles the request), OR no >other node accepts responsibility for the lock, in which case we lock >it locally by calling local_lock(). It would be nice if it was possible to interface it to the distributed lock manager IBM has recently GPLed... >With the changes above, even this code can go away. The nfs client can >keep a per-fs list of locks, and reestablish them at server restart. >No need to interact with the local locking at all. It's still linear lists I guess? Any evidence that a hash table or a ternary search tree makes sense? > * Mandatory Locks. These are byte-range mandatory locks. To use >them, mount the filesystem with the `mand' option enabled, and set >the file mode to g-x g+s. POSIX locks applied to this file will >now be mandatory. Mandatory locks do not prevent accesses via >mmap(). You should not use Mandatory locks in new code. I would add a printk warning that fires 10 times, then you can get rid of it. >locks.c still runs almost entirely under the BKL. An earlier attempt >to move it to a different locking scheme was thwarted when the code >was integrated into 2.4.0-test9 while I was on holiday, and without me >submitting it to Linus. Grumble. I plan to move it to _one_ spinlock >to cover all lock-related structures, and I think that will be >possible with the plan described above (since this code will no longer >sleep). I never understood why you can't do it completely inode local protected by the inode lock; but then I don't claim that I understand fs/locks.c It would be a small drawback for your new global locks because these inodes could not be flushed (so you need a new resource management and perhaps per user quotas for it anyways), but probably not worse than the current state. >As soon as lockd no longer needs to keep its fingers inside locks.c, I >want to remove the global list of locks. It's also used by /proc/locks >-- which probably needs to go away anyway. So what's useful about >/proc/locks? I'd like to be able to see which locks my process has, Or just walk the inode hash ? -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Q: sb->s_bdev same device as sb->s_dev?
On Sun, Mar 11, 2001 at 04:55:24AM -0500, Alexander Viro wrote: > > > On Sun, 11 Mar 2001, Andi Kleen wrote: > > > On Sat, Mar 10, 2001 at 08:44:41PM -0500, Alexander Viro wrote: > > > Too many places are using ->i_dev right now to eliminate ->i_dev. > > > Yes, it should eventually go away (and icache should work by ->i_sb/->i_ino > > > instead of ->i_dev/->i_ino). ->i_dev should eventually go away, but that > > > was too late in 2.4 cycle to do that. > > > > And how would you implement stateless file handles for NFS then if i_dev > > was gone ? > > (->i_sb is not constant over reboots) > > ->i_sb->s_dev is not worse than ->i_dev. Yes, it's an obvious improvement ;) -Andi (who thinks VFS is turning more and more into pointer hell) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Q: sb->s_bdev same device as sb->s_dev?
On Sat, Mar 10, 2001 at 08:44:41PM -0500, Alexander Viro wrote: > Too many places are using ->i_dev right now to eliminate ->i_dev. > Yes, it should eventually go away (and icache should work by ->i_sb/->i_ino > instead of ->i_dev/->i_ino). ->i_dev should eventually go away, but that > was too late in 2.4 cycle to do that. And how would you implement stateless file handles for NFS then if i_dev was gone ? (->i_sb is not constant over reboots) -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: NULL f_ops
On Thu, Mar 08, 2001 at 03:48:08PM +, Matthew Wilcox wrote: > > Someone tell me if my chain of reasoning is wrong here... > > (1) The only way to get a `struct file' is to call get_empty_filp() There is code that creates private struct files without calling get_empty_filp() NFS comes to mind, but there is other too. Now of course these could be changed, but I think the change would cause much more work for debugging/ fixing than you think ;) [I think it causes performance problems there BTW -- it kills ext2 readahead because the file is not preserved] -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: block bitmap readahead for ext2
On Wed, Jan 24, 2001 at 11:00:41AM -0500, Jeff Darcy wrote: > From: "Andi Kleen" <[EMAIL PROTECTED]> > > Does this mean that unwriten extents are supported now in pagebuf? > Otherwise > > this ioctl would need to prezero the disks blocks to prevent old data > > from being leaked, right? > > I don't know specifically about the current Linux pagebuf code, but in > general this is not necessarily the case. If the interfaces are set up so > that one can know when the whole block is about to be overwritten with new > data, no zeroing is necessary for that block. This is what XFS does on Irix according to the design documentation on the XFS site. It has special unwriten extents that on read return zeroes. On write an unwriten extent is turned into a normal extent. In the Linux port the VM support needed for that just seems not to be implemented yet. Pagebuf in this case is a XFS datastructure to describe multi page objects that are continuous on disk in memory -- normal linux page cache only knows about allocated individual pages and doesn't support preallocation. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: block bitmap readahead for ext2
Quick question. On Wed, Jan 24, 2001 at 09:20:43AM -0600, Steve Lord wrote: > What XFS also has is an ioctl to preallocate disk space, there is very > little documentation on this (none), but if you look in the file > cmd/xfstests/src/randholes.c you will see an ioctl like this: > > struct flock64 fl; > fl.l_start = offset; > fl.l_len = blocksize; > fl.l_whence = 0; > > if (ioctl(fd, XFS_IOC_RESVSP64, &fl) < 0) { > > This is preallocating disk blocks for file data. Of course, it only > works if you know file size in advance. We should also have direct > I/O in the next month or so. Does this mean that unwriten extents are supported now in pagebuf? Otherwise this ioctl would need to prezero the disks blocks to prevent old data from being leaked, right? -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: [RFC] api for consistent lvm snapshots
On Wed, Oct 04, 2000 at 01:47:28PM -0400, Ken Hirsch wrote: > > The advantages I see are that database systems and other resource managers > would have much less complicated buffering logic, that the data would not be > double-buffered, that the LRU algorithm would be more accurate (there was a > discussion of this a few months ago with regard to multiple file systems). > Hopefully we could better performance without having to tune so much. > > It would also be a step toward a common open-source transaction manager. > > There are issues with regard to checkpointing, but other than that, what do > you think of this idea? The main motivation for that change seems to be simpler buffering logic in a user space log manager. How about simply adding a madvise() flag that tells the MM system that you don't want it to flush the page to the backing store yet. After the log manager did its job it could clear that flag and the system would flush it. The only nasty issue is that it could cause deadlock in the low memory case when the system cannot flush pages because they all have this flag set. A very simple workaround would be to write the page cache page to swap then instead of the underlying file [this would make swap accounting harder, but linux doesn't do that anyways]. Later the swapped page could be paged in again and then writen to the real file. It would be slow in the trashing case, but in this case you should not expect much performance anyways. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: Stupid Question
On Sun, Sep 10, 2000 at 01:18:40PM -0700, Ion Badulescu wrote: > In article <[EMAIL PROTECTED]> you wrote: > > Can I ask a stupid question: why do we pass a pointer to the file > > position along with the file when calling read/write methods as in: > > > > read(file, buf, count, &file->f_pos) > > > > Is there ever a case when we don't want to use the f_pos from the file > > struct? > > Yes: pread()/pwrite(). > > BTW, they've caused us quite a bit of trouble in the stacking fs code. > We need to detect if the VFS code calling us is read() or pread(), in > order to know how to call our own underlying fs method. The only -- > and very unclean -- way we can detect this is by comparing the fpos > pointer passed to us to the address of file->fpos... > > Passing an update_fpos boolean would be so much cleaner. An update_fpos boolean only would require that pread/pwrite create their own file structure to pass in the user offset. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]