Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Andi Kleen
On Mon, Feb 18, 2008 at 10:16:32AM -0500, Theodore Tso wrote:
> On Mon, Feb 18, 2008 at 04:02:36PM +0100, Tomasz Chmielewski wrote:
> > I tried to copy that filesystem once (when it was much smaller) with "rsync 
> > -a -H", but after 3 days, rsync was still building an index and didn't copy 
> > any file.
> 
> If you're going to copy the whole filesystem don't use rsync! 

Yes, I managed to kill systems (drive them really badly into oom and
get very long swap storms) with rsync -H in the past too. Something is very 
wrong with the rsync implementation of this.

> Use cp
> or a tar pipeline to move the files.

Are you sure cp handles hardlinks correctly? I know tar does,
but I have my doubts about cp.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Andi Kleen
On Mon, Feb 18, 2008 at 09:16:41AM -0500, Theodore Tso wrote:
> ext3 tries to keep inodes in the same block group as their containing
> directory.  If you have lots of hard links, obviously it can't really
> do that, especially since we don't have a good way at mkdir time to
> tell the filesystem, "Psst!  This is going to be a hard link clone of
> that directory over there, put it in the same block group".

Hmm, you think such a hint interface would be worth it?

> 
> > What has helped a bit was to recreate the file system with -O^dir_index
> > dir_index seems to cause more seeks.
> 
> Part of it may have simply been recreating the filesystem, not

Undoubtedly.

> necessarily removing the dir_index feature.  Dir_index speeds up
> individual lookups, but it slows down workloads that do a readdir

But only for large directories right? For kernel source like
directory sizes it seems to be a general loss.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Andi Kleen
Tomasz Chmielewski <[EMAIL PROTECTED]> writes:
>
> Is it normal to expect the write speed go down to only few dozens of
> kilobytes/s? Is it because of that many seeks? Can it be somehow
> optimized? 

I have similar problems on my linux source partition which also
has a lot of hard linked files (although probably not quite
as many as you do). It seems like hard linking prevents
some of the heuristics ext* uses to generate non fragmented
disk layouts and the resulting seeking makes things slow.

What has helped a bit was to recreate the file system with -O^dir_index
dir_index seems to cause more seeks.

Also keeping enough free space is also a good idea because that
allows the file system code better choices on where to place data.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8][for -mm] mem_notify v6: memory_pressure_notify() caller

2008-02-13 Thread Andi Kleen
KOSAKI Motohiro <[EMAIL PROTECTED]> writes:
>
> to be honest, I don't think at mem-cgroup until now.

There is not only mem-cgroup BTW, but also NUMA node restrictons from
NUMA memory policy. So this means a process might not be able to access
all memory.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Andi Kleen
On Monday 28 January 2008 14:38:57 Alan Cox wrote:
> > Also worse really fixing it would be a major change to the VFS 
> > because of the way ->read/write are defined :/
> 
> I don't see a problem there. ->read and ->write update the passed pointer
> which is not the real f_pos anyway. Just the copies need fixing. 

They are effectually doing a decoupled read/modify/write cycle. e.g.:

A   B

read fpos   

read fpos

fpos += A   fpos += B
write fpos


write fpos

So you get overlapping reads. Probably not good.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Andi Kleen
On Monday 28 January 2008 13:56:05 Alan Cox wrote:
> > > No specific spec, just general quality of implementation.
> > 
> > I completely agree.  If one thread writes A and another writes B then the
> > kernel should record either A or B, not ((A & 0x) | (B &
> > 0x))
> 
> Agree entirely: the spec doesn't allow for random scribbling in the wrong
> place. It doesn't cover which goes first or who "wins" the race but
> provides pwrite/pread for that situation. Writing somewhere unrelated is
> definitely not to spec 

Actually it would probably -- i guess it's undefined and in undefined
country such things can happen.

Also to be fair I think it's only a problem for the 4GB wrapping case
which is presumably rare (otherwise we would have heard about it)

Also worse really fixing it would be a major change to the VFS 
because of the way ->read/write are defined :/

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Andi Kleen

> I completely agree.  If one thread writes A and another writes B then the
> kernel should record either A or B, not ((A & 0x) | (B &
> 0x))

The problem is pretty nasty unfortunately. To solve it properly I think
the file_operations->read/write prototypes would need to be changed
because otherwise it is not possible to do atomic relative updates
of f_pos. Right now the actual update is burrowed deeply in the low level 
read/write implementation. But that would be a huge impact all over
the tree :/

Or maybe define a new read/write64 and keep the default as 32bit only-- i 
suppose most users don't really need 64bit. Still would be a nasty API 
change.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [3/18] BKL-removal: Convert ext3 to use unlocked_ioctl

2008-01-27 Thread Andi Kleen
On Monday 28 January 2008 06:33, Andrew Morton wrote:
> On Sun, 27 Jan 2008 03:17:09 +0100 (CET) Andi Kleen <[EMAIL PROTECTED]> wrote:
> > I checked ext3_ioctl and it looked largely safe to not be used
> > without BKL.  So convert it over to unlocked_ioctl.
> >
> > The only case where I wasn't quite sure was for the
> > dynamic fs grow ioctls versus umounting -- I kept the BKL for those.
>
> Please cpoy linux-ext4 on ext2/3/4 material.

Ok I'll resubmit those to tytso/ext4-devel (or perhaps he has already seen 
them) 

>
> I skippped a lot of these patches because I just got bored of fixing
> rejects.  Now is a very optimistic time to be raising patches against
> mainline.

JFS and CIFS are already taken care of by the maintainers. This leaves
remote_llseek which touches a couple of file systems. Could you
perhaps take that one only please? And perhaps Nick's minix 
patchkit which looks safe to me and is unlikely to cause conflicts.

> > +   /* AK: not sure the BKL is needed, but this might prevent
> > +* races against umount */
> > +   lock_kernel();
> > err = ext3_group_add(sb, &input);
> > journal_lock_updates(EXT3_SB(sb)->s_journal);
> > journal_flush(EXT3_SB(sb)->s_journal);
> > journal_unlock_updates(EXT3_SB(sb)->s_journal);
> > +   unlock_kernel();
>
> The ext3_ioctl() caller has an open fd against the fs - should be
> sufficient to keep unmount away?

True. I am still conservative because group_add is a lot of code
which I didn't fully check. But with the open fd it's likely safe
to not take the BKL because there is nothing else (except
readdir?) in ext* that takes it.

> It's all reached the stage of stupid.

I'll resubmit ->fasync_unlocked against -mm.

Also I wanted to recheck the ->f_flags locking. I found one bug in those 
already and I can extract the bug fix for that one. 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-27 Thread Andi Kleen
On Monday 28 January 2008 05:13:09 Trond Myklebust wrote:
> 
> On Mon, 2008-01-28 at 03:58 +0100, Andi Kleen wrote:
> > The problem is that it's not a race in who gets to do its thing first, but 
> > a 
> > parallel reader can actually see a corrupted value from the two independent 
> > words on 32bit (e.g. during a 4GB). And this could actually completely 
> > corrupt 
> > f_pos when it happens with two racing relative seeks or read/write()s
> > 
> > I would consider that a bug.
> 
> I disagree. The corruption occurs because this isn't a situation that is
> allowed by either POSIX or SUSv2/v3. Exactly what spec are you referring
> to here?

No specific spec, just general quality of implementation. We normally don't have
non thread safe system calls even if it was in theory allowed by some 
specification.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [0/18] Implement some low hanging BKL removal fruit in fs/*

2008-01-27 Thread Andi Kleen

> BTW. here is a patch I did a while back for minix. I know it isn't
> a big deal, but the work is done so I guess I should send it along.

Looks safe, although I'm surprised it actually gets around with such
little locking in general.

-Andi



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-27 Thread Andi Kleen
On Monday 28 January 2008 00:08:56 Trond Myklebust wrote:
> 
> On Sun, 2008-01-27 at 16:18 -0600, Steve French wrote:
> > If two seeks overlap, can't you end up with an f_pos value that is
> > different than what either thread seeked to? or if you have a seek and
> > a read overlap can't you end up with the read occurring in the midst
> > of an update of f_pos (which takes more than one instruction on
> > various architectures), e.g. reading an f_pos, which has only the
> > lower half of a 64 bit field updated?   I agree that you shouldn't
> > have seeks racing in parallel but I think it is preferable to get
> > either the updated f_pos or the earlier f_pos not something 1/2
> > updated.
> 
> Why? The threads are doing something inherently liable to corrupt data
> anyway. If they can race over the seek, why wouldn't they race over the
> read or write too?
> The race in lseek() should probably be the least of your worries in this
> case.

The problem is that it's not a race in who gets to do its thing first, but a 
parallel reader can actually see a corrupted value from the two independent 
words on 32bit (e.g. during a 4GB). And this could actually completely corrupt 
f_pos when it happens with two racing relative seeks or read/write()s

I would consider that a bug.

Fixes would be either to always take a spinlock to update this (nasty on 
platforms where spinlocks are expensive like P4) or define some architecture
specific way to read/write 64bit values consistently. In theory also some
lazy locking seqlock like mechanism could be used, but that had the disadvantage
of being theoretically starvable.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-27 Thread Andi Kleen
On Sunday 27 January 2008 23:18:26 Steve French wrote:
> If two seeks overlap, can't you end up with an f_pos value that is
> different than what either thread seeked to?

Yes you can on 32bit. Especially during the 4GB wrap

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-27 Thread Andi Kleen
On Sunday 27 January 2008 17:57:14 Steve French wrote:
> Don't you need to a spinlock/spinunlock(i_lock) or something similar
> (there isn't a spinlock in the file struct unfortunately) around the
> reads and writes from f_pos in fs/read_write.c in remote_llseek with
> your patch since the reads/writes from that field are not necessarily
> atomic and threads could be racing in seek on the same file struct?

Funny that you mention it. I actually noticed this too while working on this, 
but noticed that it is wrong everywhere (as in even plain sys_write/read gets 
it wrong). So I decided to not address it because it is already
broken. 

I did actually send email to a few people about this, but no answer
yet.

I agree it's probably all broken on 32bit platforms, but I'm not
sure how to best address this. When it is comprehensively addressed 
remote_llseek 
can use that new method too.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [14/18] BKL-removal: Add unlocked_fasync v2

2008-01-27 Thread Andi Kleen
> No goto if you use unlocked_fasync?

Indeed. There was another problem in the patch too. Here's an updated
patch that also fixes another latent bug.

The whole f_flags still seems to be somewhat fragile because
the checks tend to happen without any lock, but that has not
changed to the previous state.

---

Add unlocked_fasync v2

Add a new fops entry point to allow fasync without BKL. While it's arguably
unclear this entry point is called often enough for it really matters
it was still relatively easy to do. And there are far less async users
in the tree than ioctls so it's likely they can be all converted 
eventually and then the non unlocked async entry point could be dropped.

There was still the problem of the actual flags change being
protected against other setters of flags. Instead of using BKL
for this use the i_mutex now.

I also added a mutex_lock against one other flags change
that was lockless and could potentially lose updates.

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 Documentation/filesystems/vfs.txt |5 -
 fs/fcntl.c|6 +-
 fs/ioctl.c|5 -
 include/linux/fs.h|1 +
 4 files changed, 14 insertions(+), 3 deletions(-)

Index: linux/fs/fcntl.c
===
--- linux.orig/fs/fcntl.c
+++ linux/fs/fcntl.c
@@ -238,18 +238,26 @@ static int setfl(int fd, struct file * f
if (error)
return error;
 
-   lock_kernel();
+   /* AK: potential race here. Since test is unlocked fasync could
+  be called several times in a row with on. We hope the drivers
+  deal with it. */
if ((arg ^ filp->f_flags) & FASYNC) {
-   if (filp->f_op && filp->f_op->fasync) {
-   error = filp->f_op->fasync(fd, filp, (arg & FASYNC) != 
0);
-   if (error < 0)
-   goto out;
+   int on = !!(arg & FASYNC);
+   if (filp->f_op && filp->f_op->unlocked_fasync)
+   error = filp->f_op->unlocked_fasync(fd, filp, on);
+   else if (filp->f_op && filp->f_op->fasync) {
+   lock_kernel();
+   error = filp->f_op->fasync(fd, filp, on);
+   unlock_kernel();
}
+   /* AK: no else error = -EINVAL here? */
+   if (error < 0)
+   return error;
}
 
+   mutex_lock(&filp->f_dentry->d_inode->i_mutex);
filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK);
- out:
-   unlock_kernel();
+   mutex_unlock(&filp->f_dentry->d_inode->i_mutex);
return error;
 }
 
Index: linux/fs/ioctl.c
===
--- linux.orig/fs/ioctl.c
+++ linux/fs/ioctl.c
@@ -105,10 +105,14 @@ int vfs_ioctl(struct file *filp, unsigne
if(O_NONBLOCK != O_NDELAY)
flag |= O_NDELAY;
 #endif
+
+   mutex_lock(&filp->f_dentry->d_inode->i_mutex);
if (on)
filp->f_flags |= flag;
else
filp->f_flags &= ~flag;
+
+   mutex_unlock(&filp->f_dentry->d_inode->i_mutex);
break;
 
case FIOASYNC:
@@ -117,8 +121,13 @@ int vfs_ioctl(struct file *filp, unsigne
flag = on ? FASYNC : 0;
 
/* Did FASYNC state change ? */
+   /* AK: potential race here: ->fasync could be called 
with on two times
+  in a row. We hope the drivers deal with it. */
if ((flag ^ filp->f_flags) & FASYNC) {
-   if (filp->f_op && filp->f_op->fasync) {
+   if (filp->f_op && filp->f_op->unlocked_fasync)
+   error = filp->f_op->unlocked_fasync(fd,
+   filp, on);
+   else if (filp->f_op && filp->f_op->fasync) {
lock_kernel();
error = filp->f_op->fasync(fd, filp, 
on);
unlock_kernel();
@@ -128,10 +137,12 @@ int vfs_ioctl(struct file *filp, unsigne
if (error != 0)
break;
 
+   mutex_lock(&filp->f_dentry->d_inode->i_mutex);
if (on)
filp

[PATCH] BKL-Removal: Convert pipe to use unlocked_ioctl too

2008-01-26 Thread Andi Kleen

Here's another patch that was missing in the previous BKL-removal series.

No BKL needed in pipe_ioctl

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

Index: linux/fs/pipe.c
===
--- linux.orig/fs/pipe.c
+++ linux/fs/pipe.c
@@ -576,9 +576,7 @@ bad_pipe_w(struct file *filp, const char
return -EBADF;
 }
 
-static int
-pipe_ioctl(struct inode *pino, struct file *filp,
-  unsigned int cmd, unsigned long arg)
+static long pipe_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
struct inode *inode = filp->f_path.dentry->d_inode;
struct pipe_inode_info *pipe;
@@ -785,7 +783,7 @@ const struct file_operations read_fifo_f
.aio_read   = pipe_read,
.write  = bad_pipe_w,
.poll   = pipe_poll,
-   .ioctl  = pipe_ioctl,
+   .unlocked_ioctl = pipe_ioctl,
.open   = pipe_read_open,
.release= pipe_read_release,
.unlocked_fasync = pipe_read_fasync,
@@ -797,7 +795,7 @@ const struct file_operations write_fifo_
.write  = do_sync_write,
.aio_write  = pipe_write,
.poll   = pipe_poll,
-   .ioctl  = pipe_ioctl,
+   .unlocked_ioctl = pipe_ioctl,
.open   = pipe_write_open,
.release= pipe_write_release,
.unlocked_fasync = pipe_write_fasync,
@@ -810,7 +808,7 @@ const struct file_operations rdwr_fifo_f
.write  = do_sync_write,
.aio_write  = pipe_write,
.poll   = pipe_poll,
-   .ioctl  = pipe_ioctl,
+   .unlocked_ioctl = pipe_ioctl,
.open   = pipe_rdwr_open,
.release= pipe_rdwr_release,
.unlocked_fasync = pipe_rdwr_fasync,
@@ -822,7 +820,7 @@ static const struct file_operations read
.aio_read   = pipe_read,
.write  = bad_pipe_w,
.poll   = pipe_poll,
-   .ioctl  = pipe_ioctl,
+   .unlocked_ioctl = pipe_ioctl,
.open   = pipe_read_open,
.release= pipe_read_release,
.unlocked_fasync = pipe_read_fasync,
@@ -834,7 +832,7 @@ static const struct file_operations writ
.write  = do_sync_write,
.aio_write  = pipe_write,
.poll   = pipe_poll,
-   .ioctl  = pipe_ioctl,
+   .unlocked_ioctl = pipe_ioctl,
.open   = pipe_write_open,
.release= pipe_write_release,
.unlocked_fasync = pipe_write_fasync,
@@ -847,7 +845,7 @@ static const struct file_operations rdwr
.write  = do_sync_write,
.aio_write  = pipe_write,
.poll   = pipe_poll,
-   .ioctl  = pipe_ioctl,
+   .unlocked_ioctl = pipe_ioctl,
.open   = pipe_rdwr_open,
.release= pipe_rdwr_release,
.unlocked_fasync = pipe_rdwr_fasync,
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [6/18] BKL-removal: Convert ext4 to use unlocked_ioctl

2008-01-26 Thread Andi Kleen

I checked ext4_ioctl and it looked largely safe to not be used
without BKL.  So convert it over to unlocked_ioctl.

The only case where I wasn't quite sure was for the
dynamic fs grow ioctls versus umounting -- I kept the BKL for those. 

Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/ext4/dir.c   |2 +-
 fs/ext4/file.c  |2 +-
 fs/ext4/ioctl.c |   20 +++-
 include/linux/ext4_fs.h |3 +--
 4 files changed, 14 insertions(+), 13 deletions(-)

Index: linux/fs/ext4/dir.c
===
--- linux.orig/fs/ext4/dir.c
+++ linux/fs/ext4/dir.c
@@ -42,7 +42,7 @@ const struct file_operations ext4_dir_op
.llseek = generic_file_llseek,
.read   = generic_read_dir,
.readdir= ext4_readdir, /* we take BKL. needed?*/
-   .ioctl  = ext4_ioctl,   /* BKL held */
+   .unlocked_ioctl = ext4_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext4_compat_ioctl,
 #endif
Index: linux/fs/ext4/file.c
===
--- linux.orig/fs/ext4/file.c
+++ linux/fs/ext4/file.c
@@ -112,7 +112,7 @@ const struct file_operations ext4_file_o
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
.aio_write  = ext4_file_write,
-   .ioctl  = ext4_ioctl,
+   .unlocked_ioctl = ext4_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext4_compat_ioctl,
 #endif
Index: linux/fs/ext4/ioctl.c
===
--- linux.orig/fs/ext4/ioctl.c
+++ linux/fs/ext4/ioctl.c
@@ -17,9 +17,9 @@
 #include 
 #include 
 
-int ext4_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
-   unsigned long arg)
+long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
+   struct inode *inode = filp->f_dentry->d_inode;
struct ext4_inode_info *ei = EXT4_I(inode);
unsigned int flags;
unsigned short rsv_window_size;
@@ -224,10 +224,14 @@ flags_err:
if (get_user(n_blocks_count, (__u32 __user *)arg))
return -EFAULT;
 
+   /* AK: not sure the BKL is needed, but this might prevent
+* races against umount. */
+   lock_kernel();
err = ext4_group_extend(sb, EXT4_SB(sb)->s_es, n_blocks_count);
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
jbd2_journal_flush(EXT4_SB(sb)->s_journal);
jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+   unlock_kernel();
 
return err;
}
@@ -246,10 +250,14 @@ flags_err:
sizeof(input)))
return -EFAULT;
 
+   /* AK: not sure the BKL is needed, but this might prevent
+* races against umount. */
+   lock_kernel();
err = ext4_group_add(sb, &input);
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
jbd2_journal_flush(EXT4_SB(sb)->s_journal);
jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+   unlock_kernel();
 
return err;
}
@@ -262,9 +270,6 @@ flags_err:
 #ifdef CONFIG_COMPAT
 long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
-   struct inode *inode = file->f_path.dentry->d_inode;
-   int ret;
-
/* These are just misnamed, they actually get/put from/to user an int */
switch (cmd) {
case EXT4_IOC32_GETFLAGS:
@@ -304,9 +309,6 @@ long ext4_compat_ioctl(struct file *file
default:
return -ENOIOCTLCMD;
}
-   lock_kernel();
-   ret = ext4_ioctl(inode, file, cmd, (unsigned long) compat_ptr(arg));
-   unlock_kernel();
-   return ret;
+   return ext4_ioctl(file, cmd, (unsigned long) compat_ptr(arg));
 }
 #endif
Index: linux/include/linux/ext4_fs.h
===
--- linux.orig/include/linux/ext4_fs.h
+++ linux/include/linux/ext4_fs.h
@@ -939,8 +939,7 @@ extern int ext4_block_truncate_page(hand
struct address_space *mapping, loff_t from);
 
 /* ioctl.c */
-extern int ext4_ioctl (struct inode *, struct file *, unsigned int,
-  unsigned long);
+extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
 extern long ext4_compat_ioctl (struct file *, unsigned int, unsigned long);
 
 /* namei.c */
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [5/18] BKL-removal: Remove incorrect comment refering to lock_kernel() from jbd/jbd2

2008-01-26 Thread Andi Kleen

None of the callers of this function does actually take the BKL
as far as I can see. So remove the comment refering to the BKL.

Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/jbd/recovery.c  |2 +-
 fs/jbd2/recovery.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux/fs/jbd/recovery.c
===
--- linux.orig/fs/jbd/recovery.c
+++ linux/fs/jbd/recovery.c
@@ -354,7 +354,7 @@ static int do_one_pass(journal_t *journa
struct buffer_head *obh;
struct buffer_head *nbh;
 
-   cond_resched(); /* We're under lock_kernel() */
+   cond_resched();
 
/* If we already know where to stop the log traversal,
 * check right now that we haven't gone past the end of
Index: linux/fs/jbd2/recovery.c
===
--- linux.orig/fs/jbd2/recovery.c
+++ linux/fs/jbd2/recovery.c
@@ -364,7 +364,7 @@ static int do_one_pass(journal_t *journa
struct buffer_head *obh;
struct buffer_head *nbh;
 
-   cond_resched(); /* We're under lock_kernel() */
+   cond_resched();
 
/* If we already know where to stop the log traversal,
 * check right now that we haven't gone past the end of
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [4/18] ext3: Remove incorrect BKL comment

2008-01-26 Thread Andi Kleen

There is no BKL held on entry in ->fsync nor in the low level ext3_sync_file.

Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/ext3/dir.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/fs/ext3/dir.c
===
--- linux.orig/fs/ext3/dir.c
+++ linux/fs/ext3/dir.c
@@ -46,7 +46,7 @@ const struct file_operations ext3_dir_op
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext3_compat_ioctl,
 #endif
-   .fsync  = ext3_sync_file,   /* BKL held */
+   .fsync  = ext3_sync_file,
.release= ext3_release_dir,
 };
 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [12/18] BKL-removal: Convert CIFS over to unlocked_ioctl

2008-01-26 Thread Andi Kleen

cifs_ioctl doesn't seem to need the BKL for anything, so convert it over
to use unlocked_ioctl.

Cc: [EMAIL PROTECTED]

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/cifs/cifsfs.c |   10 +-
 fs/cifs/cifsfs.h |4 ++--
 fs/cifs/ioctl.c  |4 ++--
 3 files changed, 9 insertions(+), 9 deletions(-)

Index: linux/fs/cifs/cifsfs.c
===
--- linux.orig/fs/cifs/cifsfs.c
+++ linux/fs/cifs/cifsfs.c
@@ -625,7 +625,7 @@ const struct file_operations cifs_file_o
.splice_read = generic_file_splice_read,
.llseek = cifs_llseek,
 #ifdef CONFIG_CIFS_POSIX
-   .ioctl  = cifs_ioctl,
+   .unlocked_ioctl = cifs_ioctl,
 #endif /* CONFIG_CIFS_POSIX */
 
 #ifdef CONFIG_CIFS_EXPERIMENTAL
@@ -645,7 +645,7 @@ const struct file_operations cifs_file_d
.flush = cifs_flush,
.splice_read = generic_file_splice_read,
 #ifdef CONFIG_CIFS_POSIX
-   .ioctl  = cifs_ioctl,
+   .unlocked_ioctl = cifs_ioctl,
 #endif /* CONFIG_CIFS_POSIX */
.llseek = cifs_llseek,
 #ifdef CONFIG_CIFS_EXPERIMENTAL
@@ -665,7 +665,7 @@ const struct file_operations cifs_file_n
.splice_read = generic_file_splice_read,
.llseek = cifs_llseek,
 #ifdef CONFIG_CIFS_POSIX
-   .ioctl  = cifs_ioctl,
+   .unlocked_ioctl = cifs_ioctl,
 #endif /* CONFIG_CIFS_POSIX */
 
 #ifdef CONFIG_CIFS_EXPERIMENTAL
@@ -684,7 +684,7 @@ const struct file_operations cifs_file_d
.flush = cifs_flush,
.splice_read = generic_file_splice_read,
 #ifdef CONFIG_CIFS_POSIX
-   .ioctl  = cifs_ioctl,
+   .unlocked_ioctl = cifs_ioctl,
 #endif /* CONFIG_CIFS_POSIX */
.llseek = cifs_llseek,
 #ifdef CONFIG_CIFS_EXPERIMENTAL
@@ -699,7 +699,7 @@ const struct file_operations cifs_dir_op
 #ifdef CONFIG_CIFS_EXPERIMENTAL
.dir_notify = cifs_dir_notify,
 #endif /* CONFIG_CIFS_EXPERIMENTAL */
-   .ioctl  = cifs_ioctl,
+   .unlocked_ioctl = cifs_ioctl,
 };
 
 static void
Index: linux/fs/cifs/cifsfs.h
===
--- linux.orig/fs/cifs/cifsfs.h
+++ linux/fs/cifs/cifsfs.h
@@ -99,8 +99,8 @@ extern intcifs_setxattr(struct dentry 
size_t, int);
 extern ssize_t cifs_getxattr(struct dentry *, const char *, void *, size_t);
 extern ssize_t cifs_listxattr(struct dentry *, char *, size_t);
-extern int cifs_ioctl(struct inode *inode, struct file *filep,
-  unsigned int command, unsigned long arg);
+extern long cifs_ioctl(struct file *filep, unsigned int command,
+   unsigned long arg);
 
 #ifdef CONFIG_CIFS_EXPERIMENTAL
 extern const struct export_operations cifs_export_ops;
Index: linux/fs/cifs/ioctl.c
===
--- linux.orig/fs/cifs/ioctl.c
+++ linux/fs/cifs/ioctl.c
@@ -30,9 +30,9 @@
 
 #define CIFS_IOC_CHECKUMOUNT _IO(0xCF, 2)
 
-int cifs_ioctl (struct inode *inode, struct file *filep,
-   unsigned int command, unsigned long arg)
+long cifs_ioctl(struct file *filep, unsigned int command, unsigned long arg)
 {
+   struct inode *inode = filep->f_dentry->d_inode;
int rc = -ENOTTY; /* strange error - but the precedent */
int xid;
struct cifs_sb_info *cifs_sb;
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [11/18] BKL-removal: Convert ocfs2 over to unlocked_ioctl

2008-01-26 Thread Andi Kleen

As far as I can see there is nothing in ocfs2_ioctl that requires the BKL,
so use unlocked_ioctl

Cc: [EMAIL PROTECTED]

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/ocfs2/file.c  |4 ++--
 fs/ocfs2/ioctl.c |   12 +++-
 fs/ocfs2/ioctl.h |3 +--
 3 files changed, 6 insertions(+), 13 deletions(-)

Index: linux/fs/ocfs2/ioctl.c
===
--- linux.orig/fs/ocfs2/ioctl.c
+++ linux/fs/ocfs2/ioctl.c
@@ -111,9 +111,9 @@ bail:
return status;
 }
 
-int ocfs2_ioctl(struct inode * inode, struct file * filp,
-   unsigned int cmd, unsigned long arg)
+long ocfs2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
+   struct inode *inode = filp->f_path.dentry->d_inode;
unsigned int flags;
int status;
struct ocfs2_space_resv sr;
@@ -148,9 +148,6 @@ int ocfs2_ioctl(struct inode * inode, st
 #ifdef CONFIG_COMPAT
 long ocfs2_compat_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 {
-   struct inode *inode = file->f_path.dentry->d_inode;
-   int ret;
-
switch (cmd) {
case OCFS2_IOC32_GETFLAGS:
cmd = OCFS2_IOC_GETFLAGS;
@@ -167,9 +164,6 @@ long ocfs2_compat_ioctl(struct file *fil
return -ENOIOCTLCMD;
}
 
-   lock_kernel();
-   ret = ocfs2_ioctl(inode, file, cmd, arg);
-   unlock_kernel();
-   return ret;
+   return ocfs2_ioctl(file, cmd, arg);
 }
 #endif
Index: linux/fs/ocfs2/ioctl.h
===
--- linux.orig/fs/ocfs2/ioctl.h
+++ linux/fs/ocfs2/ioctl.h
@@ -10,8 +10,7 @@
 #ifndef OCFS2_IOCTL_H
 #define OCFS2_IOCTL_H
 
-int ocfs2_ioctl(struct inode * inode, struct file * filp,
-   unsigned int cmd, unsigned long arg);
+long ocfs2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg);
 long ocfs2_compat_ioctl(struct file *file, unsigned cmd, unsigned long arg);
 
 #endif /* OCFS2_IOCTL_H */
Index: linux/fs/ocfs2/file.c
===
--- linux.orig/fs/ocfs2/file.c
+++ linux/fs/ocfs2/file.c
@@ -2212,7 +2212,7 @@ const struct file_operations ocfs2_fops 
.open   = ocfs2_file_open,
.aio_read   = ocfs2_file_aio_read,
.aio_write  = ocfs2_file_aio_write,
-   .ioctl  = ocfs2_ioctl,
+   .unlocked_ioctl = ocfs2_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ocfs2_compat_ioctl,
 #endif
@@ -2224,7 +2224,7 @@ const struct file_operations ocfs2_dops 
.read   = generic_read_dir,
.readdir= ocfs2_readdir,
.fsync  = ocfs2_sync_file,
-   .ioctl  = ocfs2_ioctl,
+   .unlocked_ioctl = ocfs2_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ocfs2_compat_ioctl,
 #endif
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [13/18] BKL-removal: Add compat_ioctl for cifs

2008-01-26 Thread Andi Kleen

Similar to the compat handlers of other file systems. The ioctls
are compatible except that they have different numbers.

Cc: [EMAIL PROTECTED]

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/cifs/cifsfs.c |   15 +++
 fs/cifs/cifsfs.h |2 ++
 fs/cifs/ioctl.c  |   19 +++
 3 files changed, 36 insertions(+)

Index: linux/fs/cifs/cifsfs.c
===
--- linux.orig/fs/cifs/cifsfs.c
+++ linux/fs/cifs/cifsfs.c
@@ -626,6 +626,9 @@ const struct file_operations cifs_file_o
.llseek = cifs_llseek,
 #ifdef CONFIG_CIFS_POSIX
.unlocked_ioctl = cifs_ioctl,
+#ifdef CONFIG_COMPAT
+   .compat_ioctl = cifs_compat_ioctl,
+#endif
 #endif /* CONFIG_CIFS_POSIX */
 
 #ifdef CONFIG_CIFS_EXPERIMENTAL
@@ -646,6 +649,9 @@ const struct file_operations cifs_file_d
.splice_read = generic_file_splice_read,
 #ifdef CONFIG_CIFS_POSIX
.unlocked_ioctl = cifs_ioctl,
+#ifdef CONFIG_COMPAT
+   .compat_ioctl = cifs_compat_ioctl,
+#endif
 #endif /* CONFIG_CIFS_POSIX */
.llseek = cifs_llseek,
 #ifdef CONFIG_CIFS_EXPERIMENTAL
@@ -666,6 +672,9 @@ const struct file_operations cifs_file_n
.llseek = cifs_llseek,
 #ifdef CONFIG_CIFS_POSIX
.unlocked_ioctl = cifs_ioctl,
+#ifdef CONFIG_COMPAT
+   .compat_ioctl = cifs_compat_ioctl,
+#endif
 #endif /* CONFIG_CIFS_POSIX */
 
 #ifdef CONFIG_CIFS_EXPERIMENTAL
@@ -685,6 +694,9 @@ const struct file_operations cifs_file_d
.splice_read = generic_file_splice_read,
 #ifdef CONFIG_CIFS_POSIX
.unlocked_ioctl = cifs_ioctl,
+#ifdef CONFIG_COMPAT
+   .compat_ioctl = cifs_compat_ioctl,
+#endif
 #endif /* CONFIG_CIFS_POSIX */
.llseek = cifs_llseek,
 #ifdef CONFIG_CIFS_EXPERIMENTAL
@@ -700,6 +712,9 @@ const struct file_operations cifs_dir_op
.dir_notify = cifs_dir_notify,
 #endif /* CONFIG_CIFS_EXPERIMENTAL */
.unlocked_ioctl = cifs_ioctl,
+#ifdef CONFIG_COMPAT
+   .compat_ioctl = cifs_compat_ioctl,
+#endif
 };
 
 static void
Index: linux/fs/cifs/cifsfs.h
===
--- linux.orig/fs/cifs/cifsfs.h
+++ linux/fs/cifs/cifsfs.h
@@ -101,6 +101,8 @@ extern ssize_t  cifs_getxattr(struct dent
 extern ssize_t cifs_listxattr(struct dentry *, char *, size_t);
 extern long cifs_ioctl(struct file *filep, unsigned int command,
unsigned long arg);
+extern long cifs_compat_ioctl(struct file *filep, unsigned int command,
+   unsigned long arg);
 
 #ifdef CONFIG_CIFS_EXPERIMENTAL
 extern const struct export_operations cifs_export_ops;
Index: linux/fs/cifs/ioctl.c
===
--- linux.orig/fs/cifs/ioctl.c
+++ linux/fs/cifs/ioctl.c
@@ -108,3 +108,22 @@ long cifs_ioctl(struct file *filep, unsi
FreeXid(xid);
return rc;
 }
+
+#ifdef CONFIG_COMPAT
+#define  FS_IOC_GETFLAGS32_IOR('f', 1, int)
+#define  FS_IOC_SETFLAGS32_IOR('f', 2, int)
+
+long cifs_compat_ioctl(struct file *f, unsigned int command, unsigned long arg)
+{
+   /* Native ioctl is just mistyped, the real type is always int */
+   switch (command) {
+   case FS_IOC_GETFLAGS32:
+   command = FS_IOC_GETFLAGS;
+   break;
+   case FS_IOC_SETFLAGS32:
+   command = FS_IOC_SETFLAGS;
+   break;
+   }
+   return cifs_ioctl(f, command, arg);
+}
+#endif
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [7/18] BKL-removal: Remove incorrect comments refering to BKL from ext4

2008-01-26 Thread Andi Kleen

BKL is not hold in any of those

Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/ext4/dir.c   |2 +-
 fs/ext4/inode.c |1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

Index: linux/fs/ext4/dir.c
===
--- linux.orig/fs/ext4/dir.c
+++ linux/fs/ext4/dir.c
@@ -46,7 +46,7 @@ const struct file_operations ext4_dir_op
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext4_compat_ioctl,
 #endif
-   .fsync  = ext4_sync_file,   /* BKL held */
+   .fsync  = ext4_sync_file,
.release= ext4_release_dir,
 };
 
Index: linux/fs/ext4/inode.c
===
--- linux.orig/fs/ext4/inode.c
+++ linux/fs/ext4/inode.c
@@ -778,7 +778,6 @@ err_out:
  *
  * `handle' can be NULL if create == 0.
  *
- * The BKL may not be held on entry here.  Be sure to take it early.
  * return > 0, # of blocks mapped or allocated.
  * return = 0, if plain lookup failed.
  * return < 0, error case.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-26 Thread Andi Kleen

- Replace remote_llseek with remote_llseek_unlocked (to force compilation 
failures in all users)
- Change all users to either use remote_llseek directly or take the
BKL around. I changed the file systems who don't use the BKL
for anything (CIFS, GFS) to call it directly. NCPFS and SMBFS and NFS
take the BKL, but explicitely in their own source now.

I moved them all over in a single patch to avoid unbisectable sections.

Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/cifs/cifsfs.c   |2 +-
 fs/gfs2/ops_file.c |4 ++--
 fs/ncpfs/file.c|   12 +++-
 fs/nfs/file.c  |6 +-
 fs/read_write.c|7 +++
 fs/smbfs/file.c|   11 ++-
 include/linux/fs.h |3 ++-
 7 files changed, 34 insertions(+), 11 deletions(-)

Index: linux/fs/cifs/cifsfs.c
===
--- linux.orig/fs/cifs/cifsfs.c
+++ linux/fs/cifs/cifsfs.c
@@ -549,7 +549,7 @@ static loff_t cifs_llseek(struct file *f
if (retval < 0)
return (loff_t)retval;
}
-   return remote_llseek(file, offset, origin);
+   return remote_llseek_unlocked(file, offset, origin);
 }
 
 static struct file_system_type cifs_fs_type = {
Index: linux/fs/gfs2/ops_file.c
===
--- linux.orig/fs/gfs2/ops_file.c
+++ linux/fs/gfs2/ops_file.c
@@ -107,11 +107,11 @@ static loff_t gfs2_llseek(struct file *f
error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY,
   &i_gh);
if (!error) {
-   error = remote_llseek(file, offset, origin);
+   error = remote_llseek_unlocked(file, offset, origin);
gfs2_glock_dq_uninit(&i_gh);
}
} else
-   error = remote_llseek(file, offset, origin);
+   error = remote_llseek_unlocked(file, offset, origin);
 
return error;
 }
Index: linux/fs/ncpfs/file.c
===
--- linux.orig/fs/ncpfs/file.c
+++ linux/fs/ncpfs/file.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include "ncplib_kernel.h"
@@ -281,9 +282,18 @@ static int ncp_release(struct inode *ino
return 0;
 }
 
+static loff_t ncp_remote_llseek(struct file *file, loff_t offset, int origin)
+{
+   loff_t ret;
+   lock_kernel();
+   ret = remote_llseek_unlocked(file, offset, origin);
+   unlock_kernel();
+   return ret;
+}
+
 const struct file_operations ncp_file_operations =
 {
-   .llseek = remote_llseek,
+   .llseek = ncp_remote_llseek,
.read   = ncp_file_read,
.write  = ncp_file_write,
.ioctl  = ncp_ioctl,
Index: linux/fs/read_write.c
===
--- linux.orig/fs/read_write.c
+++ linux/fs/read_write.c
@@ -58,11 +58,10 @@ loff_t generic_file_llseek(struct file *
 
 EXPORT_SYMBOL(generic_file_llseek);
 
-loff_t remote_llseek(struct file *file, loff_t offset, int origin)
+loff_t remote_llseek_unlocked(struct file *file, loff_t offset, int origin)
 {
long long retval;
 
-   lock_kernel();
switch (origin) {
case SEEK_END:
offset += i_size_read(file->f_path.dentry->d_inode);
@@ -73,15 +72,15 @@ loff_t remote_llseek(struct file *file, 
retval = -EINVAL;
if (offset>=0 && 
offset<=file->f_path.dentry->d_inode->i_sb->s_maxbytes) {
if (offset != file->f_pos) {
+   /* AK: do we need a lock for those? */
file->f_pos = offset;
file->f_version = 0;
}
retval = offset;
}
-   unlock_kernel();
return retval;
 }
-EXPORT_SYMBOL(remote_llseek);
+EXPORT_SYMBOL(remote_llseek_unlocked);
 
 loff_t no_llseek(struct file *file, loff_t offset, int origin)
 {
Index: linux/fs/smbfs/file.c
===
--- linux.orig/fs/smbfs/file.c
+++ linux/fs/smbfs/file.c
@@ -422,9 +422,18 @@ smb_file_permission(struct inode *inode,
return error;
 }
 
+static loff_t smb_remote_llseek(struct file *file, loff_t offset, int origin)
+{
+   loff_t ret;
+   lock_kernel();
+   ret = remote_llseek_unlocked(file, offset, origin);
+   unlock_kernel();
+   return ret;
+}
+
 const struct file_operations smb_file_operations =
 {
-   .llseek = remote_llseek,
+   .llseek = smb_remote_llseek,
.read   = do_sync_read,
.aio_read   = smb_file_aio_re

[PATCH] [16/18] BKL-removal: Convert socket fasync to unlocked_fasync

2008-01-26 Thread Andi Kleen

Cc: [EMAIL PROTECTED]

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 net/socket.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/net/socket.c
===
--- linux.orig/net/socket.c
+++ linux/net/socket.c
@@ -131,7 +131,7 @@ static const struct file_operations sock
.mmap = sock_mmap,
.open = sock_no_open,   /* special open code to disallow open 
via /proc */
.release =  sock_close,
-   .fasync =   sock_fasync,
+   .unlocked_fasync = sock_fasync,
.sendpage = sock_sendpage,
.splice_write = generic_splice_sendpage,
 };
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [17/18] BKL-removal: Convert fuse to unlocked_fasync

2008-01-26 Thread Andi Kleen

Cc: [EMAIL PROTECTED]

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/fuse/dev.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/fs/fuse/dev.c
===
--- linux.orig/fs/fuse/dev.c
+++ linux/fs/fuse/dev.c
@@ -1040,7 +1040,7 @@ const struct file_operations fuse_dev_op
.aio_write  = fuse_dev_write,
.poll   = fuse_dev_poll,
.release= fuse_dev_release,
-   .fasync = fuse_dev_fasync,
+   .unlocked_fasync = fuse_dev_fasync,
 };
 
 static struct miscdevice fuse_miscdevice = {
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [18/18] BKL-removal: Convert bad_inode to unlocked_fasync

2008-01-26 Thread Andi Kleen

Not that it matters much, but it was easy.

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/bad_inode.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/fs/bad_inode.c
===
--- linux.orig/fs/bad_inode.c
+++ linux/fs/bad_inode.c
@@ -174,7 +174,7 @@ static const struct file_operations bad_
.release= bad_file_release,
.fsync  = bad_file_fsync,
.aio_fsync  = bad_file_aio_fsync,
-   .fasync = bad_file_fasync,
+   .unlocked_fasync = bad_file_fasync,
.lock   = bad_file_lock,
.sendpage   = bad_file_sendpage,
.get_unmapped_area = bad_file_get_unmapped_area,
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [14/18] BKL-removal: Add unlocked_fasync

2008-01-26 Thread Andi Kleen

Add a new fops entry point to allow fasync without BKL. While it's arguably
unclear this entry point is called often enough for it really matters
it was still relatively easy to do. And there are far less async users
in the tree than ioctls so it's likely they can be all converted 
eventually and then the non unlocked async entry point could be dropped.

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 Documentation/filesystems/vfs.txt |5 -
 fs/fcntl.c|6 +-
 fs/ioctl.c|5 -
 include/linux/fs.h|1 +
 4 files changed, 14 insertions(+), 3 deletions(-)

Index: linux/fs/fcntl.c
===
--- linux.orig/fs/fcntl.c
+++ linux/fs/fcntl.c
@@ -240,11 +240,15 @@ static int setfl(int fd, struct file * f
 
lock_kernel();
if ((arg ^ filp->f_flags) & FASYNC) {
-   if (filp->f_op && filp->f_op->fasync) {
+   if (filp->f_op && filp->f_op->unlocked_fasync)
+   error = filp->f_op->unlocked_fasync(fd, filp,
+   !!(arg & FASYNC));
+   else if (filp->f_op && filp->f_op->fasync) {
error = filp->f_op->fasync(fd, filp, (arg & FASYNC) != 
0);
if (error < 0)
goto out;
}
+   /* AK: no else error = -EINVAL here? */
}
 
filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK);
Index: linux/fs/ioctl.c
===
--- linux.orig/fs/ioctl.c
+++ linux/fs/ioctl.c
@@ -118,7 +118,10 @@ int vfs_ioctl(struct file *filp, unsigne
 
/* Did FASYNC state change ? */
if ((flag ^ filp->f_flags) & FASYNC) {
-   if (filp->f_op && filp->f_op->fasync) {
+   if (filp->f_op && filp->f_op->unlocked_fasync)
+   error = filp->f_op->unlocked_fasync(fd,
+   filp, on);
+   else if (filp->f_op && filp->f_op->fasync) {
lock_kernel();
error = filp->f_op->fasync(fd, filp, 
on);
unlock_kernel();
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h
+++ linux/include/linux/fs.h
@@ -1179,6 +1179,7 @@ struct file_operations {
int (*fsync) (struct file *, struct dentry *, int datasync);
int (*aio_fsync) (struct kiocb *, int datasync);
int (*fasync) (int, struct file *, int);
+   int (*unlocked_fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t 
*, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, 
unsigned long, unsigned long, unsigned long);
Index: linux/Documentation/filesystems/vfs.txt
===
--- linux.orig/Documentation/filesystems/vfs.txt
+++ linux/Documentation/filesystems/vfs.txt
@@ -769,6 +769,7 @@ struct file_operations {
int (*fsync) (struct file *, struct dentry *, int datasync);
int (*aio_fsync) (struct kiocb *, int datasync);
int (*fasync) (int, struct file *, int);
+   int (*unlocked_fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, 
loff_t *);
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, 
loff_t *);
@@ -828,7 +829,9 @@ otherwise noted.
   fsync: called by the fsync(2) system call
 
   fasync: called by the fcntl(2) system call when asynchronous
-   (non-blocking) mode is enabled for a file
+   (non-blocking) mode is enabled for a file. BKL hold
+
+  unlocked_fasync: like fasync, but without BKL
 
   lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
commands
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [15/18] BKL-removal: Convert pipe over to unlocked_fasync

2008-01-26 Thread Andi Kleen

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/pipe.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux/fs/pipe.c
===
--- linux.orig/fs/pipe.c
+++ linux/fs/pipe.c
@@ -788,7 +788,7 @@ const struct file_operations read_fifo_f
.ioctl  = pipe_ioctl,
.open   = pipe_read_open,
.release= pipe_read_release,
-   .fasync = pipe_read_fasync,
+   .unlocked_fasync = pipe_read_fasync,
 };
 
 const struct file_operations write_fifo_fops = {
@@ -800,7 +800,7 @@ const struct file_operations write_fifo_
.ioctl  = pipe_ioctl,
.open   = pipe_write_open,
.release= pipe_write_release,
-   .fasync = pipe_write_fasync,
+   .unlocked_fasync = pipe_write_fasync,
 };
 
 const struct file_operations rdwr_fifo_fops = {
@@ -813,7 +813,7 @@ const struct file_operations rdwr_fifo_f
.ioctl  = pipe_ioctl,
.open   = pipe_rdwr_open,
.release= pipe_rdwr_release,
-   .fasync = pipe_rdwr_fasync,
+   .unlocked_fasync = pipe_rdwr_fasync,
 };
 
 static const struct file_operations read_pipe_fops = {
@@ -825,7 +825,7 @@ static const struct file_operations read
.ioctl  = pipe_ioctl,
.open   = pipe_read_open,
.release= pipe_read_release,
-   .fasync = pipe_read_fasync,
+   .unlocked_fasync = pipe_read_fasync,
 };
 
 static const struct file_operations write_pipe_fops = {
@@ -837,7 +837,7 @@ static const struct file_operations writ
.ioctl  = pipe_ioctl,
.open   = pipe_write_open,
.release= pipe_write_release,
-   .fasync = pipe_write_fasync,
+   .unlocked_fasync = pipe_write_fasync,
 };
 
 static const struct file_operations rdwr_pipe_fops = {
@@ -850,7 +850,7 @@ static const struct file_operations rdwr
.ioctl  = pipe_ioctl,
.open   = pipe_rdwr_open,
.release= pipe_rdwr_release,
-   .fasync = pipe_rdwr_fasync,
+   .unlocked_fasync = pipe_rdwr_fasync,
 };
 
 struct pipe_inode_info * alloc_pipe_info(struct inode *inode)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [10/18] BKL-removal: Implement a compat_ioctl handler for JFS

2008-01-26 Thread Andi Kleen

The ioctls were already compatible except for the actual values so this
was fairly easy to do.

Cc: [EMAIL PROTECTED]

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/jfs/file.c   |3 +++
 fs/jfs/ioctl.c  |   18 ++
 fs/jfs/jfs_dinode.h |2 ++
 fs/jfs/jfs_inode.h  |1 +
 fs/jfs/namei.c  |3 +++
 5 files changed, 27 insertions(+)

Index: linux/fs/jfs/file.c
===
--- linux.orig/fs/jfs/file.c
+++ linux/fs/jfs/file.c
@@ -113,4 +113,7 @@ const struct file_operations jfs_file_op
.fsync  = jfs_fsync,
.release= jfs_release,
.unlocked_ioctl = jfs_ioctl,
+#ifdef CONFIG_COMPAT
+   .compat_ioctl   = jfs_compat_ioctl,
+#endif
 };
Index: linux/fs/jfs/ioctl.c
===
--- linux.orig/fs/jfs/ioctl.c
+++ linux/fs/jfs/ioctl.c
@@ -117,3 +117,21 @@ long jfs_ioctl(struct file *filp, unsign
}
 }
 
+#ifdef CONFIG_COMPAT
+long jfs_compat_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+   /* While these ioctl numbers defined with 'long' and have different
+* numbers than the 64bit ABI,
+* the actual implementation only deals with ints and is compatible.
+*/
+   switch (cmd) {
+   case JFS_IOC_GETFLAGS32:
+   cmd = JFS_IOC_GETFLAGS;
+   break;
+   case JFS_IOC_SETFLAGS32:
+   cmd = JFS_IOC_SETFLAGS;
+   break;
+   }
+   return jfs_ioctl(filp, cmd, arg);
+}
+#endif
Index: linux/fs/jfs/jfs_inode.h
===
--- linux.orig/fs/jfs/jfs_inode.h
+++ linux/fs/jfs/jfs_inode.h
@@ -23,6 +23,7 @@ struct fid;
 extern struct inode *ialloc(struct inode *, umode_t);
 extern int jfs_fsync(struct file *, struct dentry *, int);
 extern long jfs_ioctl(struct file *, unsigned int, unsigned long);
+extern long jfs_compat_ioctl(struct file *, unsigned int, unsigned long);
 extern void jfs_read_inode(struct inode *);
 extern int jfs_commit_inode(struct inode *, int);
 extern int jfs_write_inode(struct inode*, int);
Index: linux/fs/jfs/namei.c
===
--- linux.orig/fs/jfs/namei.c
+++ linux/fs/jfs/namei.c
@@ -1563,6 +1563,9 @@ const struct file_operations jfs_dir_ope
.readdir= jfs_readdir,
.fsync  = jfs_fsync,
.unlocked_ioctl = jfs_ioctl,
+#ifdef CONFIG_COMPAT
+   .compat_ioctl   = jfs_compat_ioctl,
+#endif
 };
 
 static int jfs_ci_hash(struct dentry *dir, struct qstr *this)
Index: linux/fs/jfs/jfs_dinode.h
===
--- linux.orig/fs/jfs/jfs_dinode.h
+++ linux/fs/jfs/jfs_dinode.h
@@ -170,5 +170,7 @@ struct dinode {
 #define JFS_IOC_GETFLAGS   _IOR('f', 1, long)
 #define JFS_IOC_SETFLAGS   _IOW('f', 2, long)
 
+#define JFS_IOC_GETFLAGS32 _IOR('f', 1, int)
+#define JFS_IOC_SETFLAGS32 _IOW('f', 2, int)
 
 #endif /*_H_JFS_DINODE */
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [9/18] BKL-removal: Use unlocked_ioctl for jfs

2008-01-26 Thread Andi Kleen

Convert jfs_ioctl over to not use the BKL. The only potential race
I could see was with two ioctls in parallel changing the flags
and losing the updates. Use the i_mutex to protect against this.

Cc: [EMAIL PROTECTED]

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/jfs/file.c  |2 +-
 fs/jfs/ioctl.c |   13 ++---
 fs/jfs/jfs_inode.h |3 +--
 fs/jfs/namei.c |2 +-
 4 files changed, 13 insertions(+), 7 deletions(-)

Index: linux/fs/jfs/ioctl.c
===
--- linux.orig/fs/jfs/ioctl.c
+++ linux/fs/jfs/ioctl.c
@@ -51,9 +51,9 @@ static long jfs_map_ext2(unsigned long f
 }
 
 
-int jfs_ioctl(struct inode * inode, struct file * filp, unsigned int cmd,
-   unsigned long arg)
+long jfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
+   struct inode *inode = filp->f_dentry->d_inode;
struct jfs_inode_info *jfs_inode = JFS_IP(inode);
unsigned int flags;
 
@@ -82,6 +82,10 @@ int jfs_ioctl(struct inode * inode, stru
/* Is it quota file? Do not allow user to mess with it */
if (IS_NOQUOTA(inode))
return -EPERM;
+
+   /* Lock against other parallel changes of flags */
+   mutex_lock(&inode->i_mutex);
+
jfs_get_inode_flags(jfs_inode);
oldflags = jfs_inode->mode2;
 
@@ -92,8 +96,10 @@ int jfs_ioctl(struct inode * inode, stru
if ((oldflags & JFS_IMMUTABLE_FL) ||
((flags ^ oldflags) &
(JFS_APPEND_FL | JFS_IMMUTABLE_FL))) {
-   if (!capable(CAP_LINUX_IMMUTABLE))
+   if (!capable(CAP_LINUX_IMMUTABLE)) {
+   mutex_unlock(&inode->i_mutex);
return -EPERM;
+   }
}
 
flags = flags & JFS_FL_USER_MODIFIABLE;
@@ -101,6 +107,7 @@ int jfs_ioctl(struct inode * inode, stru
jfs_inode->mode2 = flags;
 
jfs_set_inode_flags(inode);
+   mutex_unlock(&inode->i_mutex);
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
return 0;
Index: linux/fs/jfs/file.c
===
--- linux.orig/fs/jfs/file.c
+++ linux/fs/jfs/file.c
@@ -112,5 +112,5 @@ const struct file_operations jfs_file_op
.splice_write   = generic_file_splice_write,
.fsync  = jfs_fsync,
.release= jfs_release,
-   .ioctl  = jfs_ioctl,
+   .unlocked_ioctl = jfs_ioctl,
 };
Index: linux/fs/jfs/jfs_inode.h
===
--- linux.orig/fs/jfs/jfs_inode.h
+++ linux/fs/jfs/jfs_inode.h
@@ -22,8 +22,7 @@ struct fid;
 
 extern struct inode *ialloc(struct inode *, umode_t);
 extern int jfs_fsync(struct file *, struct dentry *, int);
-extern int jfs_ioctl(struct inode *, struct file *,
-   unsigned int, unsigned long);
+extern long jfs_ioctl(struct file *, unsigned int, unsigned long);
 extern void jfs_read_inode(struct inode *);
 extern int jfs_commit_inode(struct inode *, int);
 extern int jfs_write_inode(struct inode*, int);
Index: linux/fs/jfs/namei.c
===
--- linux.orig/fs/jfs/namei.c
+++ linux/fs/jfs/namei.c
@@ -1562,7 +1562,7 @@ const struct file_operations jfs_dir_ope
.read   = generic_read_dir,
.readdir= jfs_readdir,
.fsync  = jfs_fsync,
-   .ioctl  = jfs_ioctl,
+   .unlocked_ioctl = jfs_ioctl,
 };
 
 static int jfs_ci_hash(struct dentry *dir, struct qstr *this)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [3/18] BKL-removal: Convert ext3 to use unlocked_ioctl

2008-01-26 Thread Andi Kleen

I checked ext3_ioctl and it looked largely safe to not be used
without BKL.  So convert it over to unlocked_ioctl.

The only case where I wasn't quite sure was for the
dynamic fs grow ioctls versus umounting -- I kept the BKL for those. 

Cc: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/ext3/dir.c   |2 +-
 fs/ext3/file.c  |2 +-
 fs/ext3/ioctl.c |   21 +++--
 include/linux/ext3_fs.h |3 +--
 4 files changed, 14 insertions(+), 14 deletions(-)

Index: linux/fs/ext3/dir.c
===
--- linux.orig/fs/ext3/dir.c
+++ linux/fs/ext3/dir.c
@@ -42,7 +42,7 @@ const struct file_operations ext3_dir_op
.llseek = generic_file_llseek,
.read   = generic_read_dir,
.readdir= ext3_readdir, /* we take BKL. needed?*/
-   .ioctl  = ext3_ioctl,   /* BKL held */
+   .unlocked_ioctl = ext3_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext3_compat_ioctl,
 #endif
Index: linux/fs/ext3/file.c
===
--- linux.orig/fs/ext3/file.c
+++ linux/fs/ext3/file.c
@@ -112,7 +112,7 @@ const struct file_operations ext3_file_o
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
.aio_write  = ext3_file_write,
-   .ioctl  = ext3_ioctl,
+   .unlocked_ioctl = ext3_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext3_compat_ioctl,
 #endif
Index: linux/fs/ext3/ioctl.c
===
--- linux.orig/fs/ext3/ioctl.c
+++ linux/fs/ext3/ioctl.c
@@ -17,9 +17,9 @@
 #include 
 #include 
 
-int ext3_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
-   unsigned long arg)
+long ext3_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
+   struct inode *inode = filp->f_dentry->d_inode;
struct ext3_inode_info *ei = EXT3_I(inode);
unsigned int flags;
unsigned short rsv_window_size;
@@ -224,10 +224,14 @@ flags_err:
if (get_user(n_blocks_count, (__u32 __user *)arg))
return -EFAULT;
 
+   /* AK: not sure the BKL is needed, but this might prevent
+* races against umount */
+   lock_kernel();
err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
journal_lock_updates(EXT3_SB(sb)->s_journal);
journal_flush(EXT3_SB(sb)->s_journal);
journal_unlock_updates(EXT3_SB(sb)->s_journal);
+   unlock_kernel();
 
return err;
}
@@ -245,11 +249,14 @@ flags_err:
if (copy_from_user(&input, (struct ext3_new_group_input __user 
*)arg,
sizeof(input)))
return -EFAULT;
-
+   /* AK: not sure the BKL is needed, but this might prevent
+* races against umount */
+   lock_kernel();
err = ext3_group_add(sb, &input);
journal_lock_updates(EXT3_SB(sb)->s_journal);
journal_flush(EXT3_SB(sb)->s_journal);
journal_unlock_updates(EXT3_SB(sb)->s_journal);
+   unlock_kernel();
 
return err;
}
@@ -263,9 +270,6 @@ flags_err:
 #ifdef CONFIG_COMPAT
 long ext3_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
-   struct inode *inode = file->f_path.dentry->d_inode;
-   int ret;
-
/* These are just misnamed, they actually get/put from/to user an int */
switch (cmd) {
case EXT3_IOC32_GETFLAGS:
@@ -305,9 +309,6 @@ long ext3_compat_ioctl(struct file *file
default:
return -ENOIOCTLCMD;
}
-   lock_kernel();
-   ret = ext3_ioctl(inode, file, cmd, (unsigned long) compat_ptr(arg));
-   unlock_kernel();
-   return ret;
+   return ext3_ioctl(file, cmd, (unsigned long) compat_ptr(arg));
 }
 #endif
Index: linux/include/linux/ext3_fs.h
===
--- linux.orig/include/linux/ext3_fs.h
+++ linux/include/linux/ext3_fs.h
@@ -838,8 +838,7 @@ extern void ext3_get_inode_flags(struct 
 extern void ext3_set_aops(struct inode *inode);
 
 /* ioctl.c */
-extern int ext3_ioctl (struct inode *, struct file *, unsigned int,
-  unsigned long);
+extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
 extern long ext3_compat_ioctl (struct file *, unsigned int, unsigned long);
 
 /* namei.c */
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [0/18] Implement some low hanging BKL removal fruit in fs/*

2008-01-26 Thread Andi Kleen

[Andrew: I believe this is -mm material for .25]

- Convert some more file systems (generally those who don't use the BKL
for anything except mount) to use unlocked_bkl.
- Implement BKL less fasync (see patch for the rationale) 
This is currently a separate entry point, but since the number of fasync
users in the tree is relatively small I hope the older entry point can
be removed then in the not too far future
[help from other people converting more fasync users to unlocked_fasync
would be appreciated]
- Implement BKL less remote_llseek
- While I was at it I also added a few missing compat ioctl handlers
- Fix a few comments

This fixes a lot of relatively trivial BKL users in fs/*. The main 
remaining non legacy offenders are now locks.c, nfs/nfsd and reiserfs.
I believe BKL removal for all of those is being worked on by other people.
Also a lot of "legacy" file systems still use it, but converting those
does not seem to be very pressing.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [1/18] BKL-removal: Convert ext2 over to use unlocked_ioctl

2008-01-26 Thread Andi Kleen

I checked ext2_ioctl and could not find anything in there that would
need the BKL. So convert it over to use unlocked_ioctl

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/ext2/dir.c   |2 +-
 fs/ext2/ext2.h  |3 +--
 fs/ext2/file.c  |4 ++--
 fs/ext2/ioctl.c |   12 +++-
 4 files changed, 7 insertions(+), 14 deletions(-)

Index: linux/fs/ext2/dir.c
===
--- linux.orig/fs/ext2/dir.c
+++ linux/fs/ext2/dir.c
@@ -703,7 +703,7 @@ const struct file_operations ext2_dir_op
.llseek = generic_file_llseek,
.read   = generic_read_dir,
.readdir= ext2_readdir,
-   .ioctl  = ext2_ioctl,
+   .unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext2_compat_ioctl,
 #endif
Index: linux/fs/ext2/ext2.h
===
--- linux.orig/fs/ext2/ext2.h
+++ linux/fs/ext2/ext2.h
@@ -139,8 +139,7 @@ int __ext2_write_begin(struct file *file
struct page **pagep, void **fsdata);
 
 /* ioctl.c */
-extern int ext2_ioctl (struct inode *, struct file *, unsigned int,
-  unsigned long);
+extern long ext2_ioctl(struct file *, unsigned int, unsigned long);
 extern long ext2_compat_ioctl(struct file *, unsigned int, unsigned long);
 
 /* namei.c */
Index: linux/fs/ext2/file.c
===
--- linux.orig/fs/ext2/file.c
+++ linux/fs/ext2/file.c
@@ -48,7 +48,7 @@ const struct file_operations ext2_file_o
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
.aio_write  = generic_file_aio_write,
-   .ioctl  = ext2_ioctl,
+   .unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext2_compat_ioctl,
 #endif
@@ -65,7 +65,7 @@ const struct file_operations ext2_xip_fi
.llseek = generic_file_llseek,
.read   = xip_file_read,
.write  = xip_file_write,
-   .ioctl  = ext2_ioctl,
+   .unlocked_ioctl = ext2_ioctl,
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext2_compat_ioctl,
 #endif
Index: linux/fs/ext2/ioctl.c
===
--- linux.orig/fs/ext2/ioctl.c
+++ linux/fs/ext2/ioctl.c
@@ -17,9 +17,9 @@
 #include 
 
 
-int ext2_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
-   unsigned long arg)
+long ext2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
+   struct inode *inode = filp->f_dentry->d_inode;
struct ext2_inode_info *ei = EXT2_I(inode);
unsigned int flags;
unsigned short rsv_window_size;
@@ -141,9 +141,6 @@ int ext2_ioctl (struct inode * inode, st
 #ifdef CONFIG_COMPAT
 long ext2_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
-   struct inode *inode = file->f_path.dentry->d_inode;
-   int ret;
-
/* These are just misnamed, they actually get/put from/to user an int */
switch (cmd) {
case EXT2_IOC32_GETFLAGS:
@@ -161,9 +158,6 @@ long ext2_compat_ioctl(struct file *file
default:
return -ENOIOCTLCMD;
}
-   lock_kernel();
-   ret = ext2_ioctl(inode, file, cmd, (unsigned long) compat_ptr(arg));
-   unlock_kernel();
-   return ret;
+   return ext2_ioctl(file, cmd, (unsigned long) compat_ptr(arg));
 }
 #endif
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [2/18] BKL-removal: Remove incorrect BKL comment in ext2

2008-01-26 Thread Andi Kleen

No BKL used anywhere, so don't mention it.

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 fs/ext2/inode.c |1 -
 1 file changed, 1 deletion(-)

Index: linux/fs/ext2/inode.c
===
--- linux.orig/fs/ext2/inode.c
+++ linux/fs/ext2/inode.c
@@ -569,7 +569,6 @@ static void ext2_splice_branch(struct in
  *
  * `handle' can be NULL if create == 0.
  *
- * The BKL may not be held on entry here.  Be sure to take it early.
  * return > 0, # of blocks mapped or allocated.
  * return = 0, if plain lookup failed.
  * return < 0, error case.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Andi Kleen
Jens Axboe <[EMAIL PROTECTED]> writes:
>
> So how does it work? Instead of punting IO to a thread and passing it
> through the page cache, we instead attempt to send the IO directly to the

Great -- something like this was needed for a long time.

> - The file block mappings must not change while loop is using the file.
>   This means that we have to ensure exclusive access to the file and
>   this is the bit that is currently missing in the implementation. It
>   would be nice if we could just do this via open(), ideas welcome...

get_write_access()/put_write_access() will block other writers.

But as pointed out by others that is not enough for this.

I suppose you could use a white list like a special flag for file systems 
(like ext2/ext3) that do not reallocate blocks.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Remove BKL from fs/locks.c

2007-12-30 Thread Andi Kleen
> The only problem I can see from an NFS perspective is with NFSv2/v3
> locking: unfortunately the protocol provides no way for the server to
> notify that a lock may not be granted after the client has been told to
> block. You would therefore have to bend the protocol rules by simply
> delaying replying to the client until the deadlock timeout occurred
> instead of telling it to block. I'm not sure that all clients would be
> able to cope...

If the delay is short enough (let's say < 2 jiffies) that should be surely no 
problem? 
If they couldn't deal with that they couldn't deal with a congested network 
either.

Otherwise lockd could just force a 0 timeout.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Remove BKL from fs/locks.c

2007-12-30 Thread Andi Kleen
Matthew Wilcox <[EMAIL PROTECTED]> writes:
>
> The blocked_list is a bit more complex since we need to check every lock
> on the blocked list, and would need to acquire all the sb_file_lock_locks
> to check this list consistently.  I don't see a nice way to do this --
> particularly when you consider that we need to run this check every time
> someone takes out a POSIX lock that blocks on another lock.

Have you considered using a timeout approach? e.g. just start a timer
when aquiring the lock and when you can't get it in some short (user 
configurable) 
time and only do then the expensive deadlock check. Timers are quite optimized
and have per cpu state so they should be cheap enough.

AFAIK that's a standard technique used in databases. Advantage is that it 
keeps all that out of the fast path.

Disadvantage is that it takes at least timeout time to detect a deadlock, but 
they should be infrequent anyways I guess so it's hopefully not a problem
(and if it was the user could reset the timeout to 0) 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Beagle and logging inotify events

2007-11-14 Thread Andi Kleen
"Jon Smirl" <[EMAIL PROTECTED]> writes:

> On 11/14/07, Chuck Lever <[EMAIL PROTECTED]> wrote:
>> On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote:
>> > Is it feasible to do something like this in the linux file system
>> > architecture?
>> >
>> > Beagle beats on my disk for an hour when I reboot. Of course I don't
>> > like that and I shut Beagle off.
>>
>> Leopard, by the way, does exactly this: it has a daemon that starts
>> at boot time and taps FSEvents then journals file system changes to a
>> well-known file on local disk.
>
> Logging file systems have all of the needed info.

Actually most journaling file systems in Linux use block logging and
it would be probably hard to get specific file names out of a random
collection of logged blocks. And even if you could they would
hit a lot of false positives since everything is rounded up
to block level.

With intent logging like in XFS/JFS it would be easier, but even
then costly :- e.g. they might log changes to the inode but
there is no back pointer to the file name short of searching the
whole directory tree.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-05 Thread Andi Kleen
Jens Axboe <[EMAIL PROTECTED]> writes:
> 
> Writing a small test module to exercise slub/slab in various ways
> (allocating from all cpus freeing from one, as described) should not be
> too hard. Perhaps that would be enough to find this performance
> discrepancy between slab and slub?

You could simulate that by just sending packets using unix sockets 
between threads bound to different CPUs. Sending a packet allocates; receiving 
deallocates.

But it's not clear that will really simulate the cache bounce environment
of the database test. I don't think all passing of data between CPUs 
using slub objects is slow.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/17] [RFC] Virtual Compound Page Support

2007-09-19 Thread Andi Kleen
Christoph Lameter <[EMAIL PROTECTED]> writes:

It seems like a good idea simply because the same functionality
is already open coded in a couple of places and unifying
that would be a good thing. But ...
 
> The patchset provides this functionality in stages. Stage 1 introduces
> the basic fall back mechanism necessary to replace vmalloc allocations
> with
> 
>   alloc_page(GFP_VFALLBACK, order, )

Is there a reason this needs to be a GFP flag versus a wrapper
around alloc_page/free_page ?  page_alloc.c is already too complicated
and it's better to keep new features separated. The only drawback
would be that free_pages would need a different call, but that
doesn't seem like a big problem.

Especially integrating it into slab would seem wrong to me.
slab is already too complicated and for users who need that
large areas page granuality rounding to pages is probably fine.

Also such a wrapper could do the old alloc_page_exact() trick:
instead of always rounding up to next order return the left over
pages to the VM. In some cases this can save significant memory.

I'm also a little dubious about your attempts to do vmalloc in
interrupt context. Is that really needed? GFP_ATOMIC allocations of
large areas seem to be extremly unreliable to me and not design. Even
if it works sometimes free probably wouldn't work there due to the
flushes, which is very nasty. It would be better to drop that.

-Andi



> which signifies to the page allocator that a higher order is to be found
> but a virtual mapping may stand in if there is an issue with fragmentation.
> 
> Stage 1 functionality does not allow allocation and freeing of virtual
> mappings from interrupt contexts.
> 
> The stage 1 series ends with the conversion of a few key uses of vmalloc
> in the VM to alloc_pages() for the allocation of sparsemems memmap table
> and the wait table in each zone. Other uses of vmalloc could be converted
> in the same way.
> 
> 
> Stage 2 functionality enhances the fallback even more allowing allocation
> and frees in interrupt context.
> 
> SLUB is then modified to use the virtual mappings for slab caches
> that are marked with SLAB_VFALLBACK. If a slab cache is marked this way
> then we drop all the restraints regarding page order and allocate
> good large memory areas that fit lots of objects so that we rarely
> have to use the slow paths.
> 
> Two slab caches--the dentry cache and the buffer_heads--are then flagged
> that way. Others could be converted in the same way.
> 
> The patch set also provides a debugging aid through setting
> 
>   CONFIG_VFALLBACK_ALWAYS
> 
> If set then all GFP_VFALLBACK allocations fall back to the virtual
> mappings. This is useful for verification tests. The test of this
> patch set was done by enabling that options and compiling a kernel.
> 
> 
> Stage 3 functionality could be the adding of support for the large
> buffer size patchset. Not done yet and not sure if it would be useful
> to do.
> 
> Much of this patchset may only be needed for special cases in which the
> existing defragmentation methods fail for some reason. It may be better to
> have the system operate without such a safety net and make sure that the
> page allocator can return large orders in a reliable way.
> 
> The initial idea for this patchset came from Nick Piggin's fsblock
> and from his arguments about reliability and guarantees. Since his
> fsblock uses the virtual mappings I think it is legitimate to
> generalize the use of virtual mappings to support higher order
> allocations in this way. The application of these ideas to the large
> block size patchset etc are straightforward. If wanted I can base
> the next rev of the largebuffer patchset on this one and implement
> fallback.
> 
> Contrary to Nick, I still doubt that any of this provides a "guarantee".
> Have said that I have to deal with various failure scenarios in the VM
> daily and I'd certainly like to see it work in a more reliable manner.
> 
> IMHO getting rid of the various workarounds to deal with the small 4k
> pages and avoiding additional layers that group these pages in subsystem
> specific ways is something that can simplify the kernel and make the
> kernel more reliable overall.
> 
> If people feel that a virtual fall back is needed then so be it. Maybe
> we can shed our security blanket later when the approaches to deal
> with fragmentation have matured.
> 
> The patch set is also available via git from the largeblock git tree via
> 
> git pull
>   git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git
> vcompound
> 
> -- 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/2] 2.6.23-rc3: known regressions with patches

2007-08-13 Thread Andi Kleen
On Mon, Aug 13, 2007 at 10:24:52PM +0200, Michal Piotrowski wrote:
> On 13/08/07, Andi Kleen <[EMAIL PROTECTED]> wrote:
> > > Unclassified
> > >
> > > Subject : reset during bootup - 2.6.23-rc2 (git d23cf676)
> >
> > This is already fixed in mainline
> 
> commit b8d3f2448b8f4ba24f301e23585547ba1acc1f04
> 
> >
> > There is a real regression with failing builds on some old binutils on
> > x86-64 know, but I don't know how to fix it.
> 
> Did you mean this?

Yes. Anyways it be likely a WONT-FIX -- aka document newer binutils
version required. We haven't had any other reports so likely
that version is not too widely used.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/2] 2.6.23-rc3: known regressions with patches

2007-08-13 Thread Andi Kleen
> Unclassified
> 
> Subject : reset during bootup - 2.6.23-rc2 (git d23cf676)

This is already fixed in mainline

There is a real regression with failing builds on some old binutils on 
x86-64 know, but I don't know how to fix it.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/3] add the fsblock layer

2007-06-25 Thread Andi Kleen
On Sun, Jun 24, 2007 at 01:18:42PM -0700, Arjan van de Ven wrote:
> 
> > Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the 
> > smallest
> > possible type for each architecture. And a couple of ugly casts for set_bit 
> > et.al.
> > but those could be also hidden in macros. Should be relatively easy to do.
> 
> or make a "smallbit" type that is small/supported, so 64 bit if 32 bit
> isn't supported, otherwise 32

That wouldn't handle the case where you only need e.g. 8 bits
That's fine for x86 too. It only hates atomic accesses crossing cache line
boundaries (but handles them too, just slow) 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/3] add the fsblock layer

2007-06-24 Thread Andi Kleen
Nick Piggin <[EMAIL PROTECTED]> writes:


[haven't read everything, just commenting on something that caught my eye]

> +struct fsblock {
> + atomic_tcount;
> + union {
> + struct {
> + unsigned long   flags; /* XXX: flags could be int for 
> better packing */

int is not supported by many architectures, but works on x86 at least.

Hmm, could define a macro DECLARE_ATOMIC_BITMAP(maxbit) that expands to the 
smallest
possible type for each architecture. And a couple of ugly casts for set_bit 
et.al.
but those could be also hidden in macros. Should be relatively easy to do.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] fsblock

2007-06-24 Thread Andi Kleen
Nick Piggin <[EMAIL PROTECTED]> writes:
> 
> - Structure packing. A page gets a number of buffer heads that are
>   allocated in a linked list. fsblocks are allocated contiguously, so
>   cacheline footprint is smaller in the above situation.

It would be interesting to test if that makes a difference for 
database benchmarks running over file systems. Databases
eat a lot of cache so in theory any cache improvements
in the kernel which often runs cache cold then should be beneficial. 

But I guess it would need at least ext2 to test; Minix is probably not
good enough.

In general have you benchmarked the CPU overhead of old vs new code? 
e.g. when we went to BIO scalability went up, but CPU costs
of a single request also went up. It would be nice to not continue
or better reverse that trend.

> - Large block support. I can mount and run an 8K block size minix3 fs on
>   my 4K page system and it didn't require anything special in the fs. We
>   can go up to about 32MB blocks now, and gigabyte+ blocks would only
>   require  one more bit in the fsblock flags. fsblock_superpage blocks
>   are > PAGE_CACHE_SIZE, midpage ==, and subpage <.

Can it be cleanly ifdefed or optimized away?  Unless the fragmentation
problem is not solved it would seem rather pointless to me. Also I personally
still think the right way to approach this is larger softpage size.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NFS: Make NFS root work again

2007-06-17 Thread Andi Kleen
Andrew Morton <[EMAIL PROTECTED]> writes:
> 
> What is not working, and how does this patch fix it?

FWIW i use nfs root regularly for testing kernels and it works
for me.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NILFS version 2 now available

2007-06-17 Thread Andi Kleen
[EMAIL PROTECTED] writes:

> NILFS (a New Implementation of a Log-structured Filesystem) Version 2 have
> been available at the project website
> 
> http://www.nilfs.org/
> 
> If you are interested, please visit to our website.

Could you please give some information on the use cases
for this file system. Is it another flash file system
or optimized for block devices? What do you think are
the primary use cases? Do you see it as a general purpose
file system like ext3 or targetted at some specific niche?
Is it supposed to be production ready eventually or 
more intended for research?  Does it avoid the long mount 
time problem the other Linux log file system have?

Thanks,

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Andi Kleen
Matt Mackall <[EMAIL PROTECTED]> writes:

> This is a relatively simple scheme for making a filesystem with
> incremental online consistency checks of both data and metadata.
> Overhead can be well under 1% disk space and CPU overhead may also be
> very small, while greatly improving filesystem integrity.

Problem I see is that your scheme doesn't support metadata checksums only.
IMHO those are the most interesting because they have the potential
to be basically zero cost, unlike full data checksumming.
And doing metadata checksums is enough to handle the fsck
problem.

I'm sure there are many cases where full checksumming makes sense too,
but those shouldn't be forced on everybody because it will slow down
some important workloads (like O_DIRECT)

Metadata checksums would be best just put into the file systems
data structures. Essentially every object (inode, extent, directory entry,
super block) should have a checksum that can be incrementially updated.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: Dentry still in use during umount in 2.6.21-rc5-git6

2007-04-24 Thread Andi Kleen
On Tuesday 24 April 2007 12:40:24 Jan Kara wrote:
> > One of my autoboot test clients gave me this during shutdown. It used
> > reiserfs and autofs and NFS heavily.
>   Jeff has a fix for this bug so it should go away soon... Thanks for
> report anyway :).

Well I hit two more -- see other mails if you're interested.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


More reiserfs trouble in 2.6.21-rc5

2007-04-22 Thread Andi Kleen

FYI,

This was a debugging kernel (preempt, slab debugging, lockdep etc. enabled)
running autotest and some other load on a 4 core Opteron system

There was also another lockdep warning before that which I'm sending
separately.

Looks like some memory corruption. Could be something else, but at least
reiserfs is the messenger.

BTW these kind of backtraces are a good example on why I want the dwarf2
unwinder back.

-Andi

[ cut here ]
kernel BUG at /mnt/dm-2/newautoboot/autoboot/lsrc/mainline/linux/mm/slab.c:2380!
invalid opcode:  [1] PREEMPT SMP
CPU 0
Modules linked in:
Pid: 12205, comm: find Not tainted 2.6.21-rc5-git6 #44
RIP: 0010:[]  [] 
cache_alloc_refill+0xe6/0x22a
RSP: 0018:810078be7a28  EFLAGS: 00010002
RAX: 0001 RBX: 0001 RCX: 8027b826
RDX: 0003 RSI: 81017d077000 RDI: 8100f7f7c540
RBP: 81017d077000 R08: 8100f7f7ec78 R09: 8101fa6cf178
R10: 810100116070 R11:  R12: 810100116070
R13: 8100f7f7ec78 R14: 8100f7f7c540 R15: 000c
FS:  2acc1ce646d0() GS:8074a000() knlGS:5b3f2b90
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 43d45000 CR3: 88a9b000 CR4: 06e0
Process find (pid: 12205, threadinfo 810078be6000, task 81002a76d0c0)
Stack:  000100d0 8101002b6248 8100f7f7c540 0246
 00d0 802ce4cb 802c7590 8027bda3
 8101002b6248  8100070a9988 8101002b6248
Call Trace:
 [] reiserfs_alloc_inode+0x15/0x2a
 [] reiserfs_find_actor+0x0/0x1b
 [] kmem_cache_alloc+0x8c/0xe2
 [] reiserfs_alloc_inode+0x15/0x2a
 [] alloc_inode+0x12/0x13f
 [] iget5_locked+0x5a/0x18a
 [] reiserfs_init_locked_inode+0x0/0x12
 [] reiserfs_iget+0x30/0x92
 [] pathrelse+0x24/0x3c
 [] reiserfs_lookup+0xcf/0x138
 [] _read_trylock+0x47/0x6b
 [] d_alloc+0x1c4/0x1d0
 [] do_lookup+0xc4/0x1ae
 [] __link_path_walk+0x885/0xd2b
 [] link_path_walk+0x58/0xe0
 [] do_path_lookup+0x1be/0x1e2
 [] getname+0x152/0x196
 [] __user_walk_fd+0x37/0x53
 [] vfs_lstat_fd+0x18/0x47
 [] sys_newlstat+0x19/0x31
 [] trace_hardirqs_on_thunk+0x35/0x37
 [] _atomic_dec_and_lock+0x39/0x58
 [] system_call+0x7e/0x83


Code: 0f 0b eb fe 49 8b 86 70 03 00 00 49 ff 86 78 03 00 00 48 ff
RIP  [] cache_alloc_refill+0xe6/0x22a
 RSP 
note: find[12205] exited with preempt_count 1
BUG: scheduling while atomic: find/0x1001/12205
INFO: lockdep is turned off.
UG: scheduling while atomic: find/0x1001/12205
INFO: lockdep is turned off.

Call Trace:
 [] __sched_text_start+0x81/0x80b
 [] vt_console_print+0x21f/0x235
 [] _spin_unlock_irqrestore+0x49/0x69
 [] __cond_resched+0x13/0x32
 [] cond_resched+0x2e/0x39
 [] unmap_vmas+0x5e5/0x778
 [] exit_mmap+0x80/0x117
 [] mmput+0x2c/0x9e
 [] do_exit+0x21a/0x82e
 [] _spin_unlock_irqrestore+0x49/0x69
 [] kernel_math_error+0x0/0x90
 [] do_invalid_op+0xb2/0xbc
 [] cache_alloc_refill+0xe6/0x22a
 [] trace_hardirqs_on_thunk+0x35/0x37
 [] restore_args+0x0/0x30
 [] error_exit+0x0/0x96
 [] cache_alloc_refill+0x6f/0x22a
 [] cache_alloc_refill+0xe6/0x22a
 [] cache_alloc_refill+0xcd/0x22a
 [] reiserfs_alloc_inode+0x15/0x2a
 [] reiserfs_find_actor+0x0/0x1b
 [] kmem_cache_alloc+0x8c/0xe2
 [] reiserfs_alloc_inode+0x15/0x2a
 [] alloc_inode+0x12/0x13f
 [] iget5_locked+0x5a/0x18a
 [] reiserfs_init_locked_inode+0x0/0x12
 [] reiserfs_iget+0x30/0x92
 [] pathrelse+0x24/0x3c
 [] reiserfs_lookup+0xcf/0x138
 [] _read_trylock+0x47/0x6b
 [] d_alloc+0x1c4/0x1d0
 [] do_lookup+0xc4/0x1ae
 [] __link_path_walk+0x885/0xd2b
 [] link_path_walk+0x58/0xe0
 [] do_path_lookup+0x1be/0x1e2
 [] getname+0x152/0x196
 [] __user_walk_fd+0x37/0x53
 [] vfs_lstat_fd+0x18/0x47
 [] sys_newlstat+0x19/0x31
 [] trace_hardirqs_on_thunk+0x35/0x37
 [] _atomic_dec_and_lock+0x39/0x58
 [] system_call+0x7e/0x83

BUG: scheduling while atomic: find/0x1001/12205
INFO: lockdep is turned off.

[... more similar messages ...]
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


reiserfs lockdep warning in 2.6.21-rc5

2007-04-22 Thread Andi Kleen


===
[ INFO: possible circular locking dependency detected ]
2.6.21-rc5-git6 #44
---
perl/7968 is trying to acquire lock:
 (&inode->i_mutex){--..}, at: [] 
reiserfs_file_release+0x109/0x2cc

but task is already holding lock:
 (&mm->mmap_sem){}, at: [] sys_munmap+0x32/0x5a

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&mm->mmap_sem){}:
   [] __lock_acquire+0x9e0/0xb79
   [] lock_acquire+0x48/0x63
   [] do_page_fault+0x3a4/0x7b2
   [] down_read_trylock+0xe/0x3b
   [] down_read+0x21/0x2a
   [] do_page_fault+0x3a4/0x7b2
   [] trace_hardirqs_on+0x11c/0x140
   [] _read_unlock_irq+0x2f/0x4a
   [] find_lock_page+0x91/0x9d
   [] find_or_create_page+0x1e/0x75
   [] error_exit+0x0/0x96
   [] reiserfs_release_claimed_blocks+0x22/0x49
   [] reiserfs_copy_from_user_to_file_region+0x7e/0xf3
   [] reiserfs_file_write+0x15a1/0x1795
   [] _spin_unlock_irqrestore+0x49/0x69
   [] trace_hardirqs_on_thunk+0x35/0x37
   [] _spin_unlock_irq+0x24/0x4a
   [] trace_hardirqs_on+0x11c/0x140
   [] _spin_unlock_irq+0x2f/0x4a
   [] thread_return+0xee/0x135
   [] _read_unlock_irq+0x24/0x4a
   [] trace_hardirqs_on+0x11c/0x140
   [] _read_unlock_irq+0x2f/0x4a
   [] find_get_pages_tag+0x75/0x80
   [] vfs_write+0xad/0x136
   [] sys_pwrite64+0x50/0x70
   [] ia32_sysret+0x0/0xa
   [] 0x

-> #0 (&inode->i_mutex){--..}:
   [] print_circular_bug_header+0xcc/0xd3
   [] __lock_acquire+0x8dc/0xb79
   [] lock_acquire+0x48/0x63
   [] reiserfs_file_release+0x109/0x2cc
   [] debug_mutex_lock_common+0x16/0x23
   [] __mutex_lock_slowpath+0xe1/0x293
   [] reiserfs_file_release+0x109/0x2cc
   [] __fput+0xa1/0x15e
   [] remove_vma+0x35/0x5c
   [] do_munmap+0x258/0x27a
   [] __down_write_nested+0x34/0x9e
   [] sys_munmap+0x40/0x5a
   [] system_call+0x7e/0x83
   [] 0x

other info that might help us debug this:

1 lock held by perl/7968:
 #0:  (&mm->mmap_sem){}, at: [] sys_munmap+0x32/0x5a

stack backtrace:
Call Trace:
 [] print_circular_bug_tail+0x69/0x72
 [] print_circular_bug_header+0xcc/0xd3
 [] __lock_acquire+0x8dc/0xb79
 [] lock_acquire+0x48/0x63
 [] reiserfs_file_release+0x109/0x2cc
 [] debug_mutex_lock_common+0x16/0x23
 [] __mutex_lock_slowpath+0xe1/0x293
 [] reiserfs_file_release+0x109/0x2cc
 [] __fput+0xa1/0x15e
 [] remove_vma+0x35/0x5c
 [] do_munmap+0x258/0x27a
 [] __down_write_nested+0x34/0x9e
 [] sys_munmap+0x40/0x5a
 [] system_call+0x7e/0x83
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 7/8] allow unprivileged mounts

2007-04-21 Thread Andi Kleen
Andrew Morton <[EMAIL PROTECTED]> writes:

> On Fri, 20 Apr 2007 12:25:39 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:
> 
> > Define a new fs flag FS_SAFE, which denotes, that unprivileged
> > mounting of this filesystem may not constitute a security problem.
> > 
> > Since most filesystems haven't been designed with unprivileged
> > mounting in mind, a thorough audit is needed before setting this flag.
> 
> Practically speaking, is there any realistic likelihood that any filesystem
> apart from FUSE will ever use this?

If it worked for mount --bind for any fs I could see uses of this.  I haven't 
thought
through the security implications though, so it might not work.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BUG: Dentry still in use during umount in 2.6.21-rc5-git6

2007-04-20 Thread Andi Kleen

One of my autoboot test clients gave me this during shutdown. It used
reiserfs and autofs and NFS heavily.

Unmounting file systems
BUG: Dentry 8100f3693a40{i=2352220,n=xattrs} still in use (1) [unmount of 
reiserfs sda9]
[ cut here ]
kernel BUG at 
/mnt/dm-2/newautoboot/autoboot/lsrc/mainline/linux/fs/dcache.c:623!
invalid opcode:  [1] SMP 
CPU 1 
Modules linked in:
Pid: 15791, comm: umount Not tainted 2.6.21-rc5-git6 #44
RIP: 0010:[]  [] 
shrink_dcache_for_umount_subtree+0x178/0x250
RSP: 0018:8100f5f67e18  EFLAGS: 00010292
RAX: 0060 RBX: 8100f3693a40 RCX: 5207
RDX:  RSI: 0046 RDI: 00014661
RBP: 8100f6dc9cc0 R08: 00a0 R09: 0005
R10:  R11:  R12: 8100f3693aa0
R13: 00014661 R14: 0050ea70 R15: 0050ead0
FS:  2adc863a86d0() GS:8100f7fdc1c0() knlGS:b7be38d0
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 2adc8626a688 CR3: f628b000 CR4: 06e0
Process umount (pid: 15791, threadinfo 8100f5f66000, task 8100f7a08100)
Stack:  810004dab218 810004dab000 80558860 810004dab000
  8028815b 810004dab000 8027a1a5
  8100f6c50980 806c1600 8027a2a4
Call Trace:
 [] shrink_dcache_for_umount+0x2f/0x3d
 [] generic_shutdown_super+0x19/0xf2
 [] kill_block_super+0x26/0x3b
 [] deactivate_super+0x47/0x60
 [] sys_umount+0x1f7/0x22a
 [] sys_newstat+0x19/0x31
 [] system_call+0x7e/0x83


Code: 0f 0b eb fe 48 8b 6b 28 48 39 dd 75 04 31 ed eb 04 f0 ff 4d 
RIP  [] shrink_dcache_for_umount_subtree+0x178/0x250
 RSP 
/etc/init.d/boot.d/K14boot.localfs: line 93: 15791 Segmentation fault  
umount -avt noproc,nonfs,nonfs4,nosmbfs,nocifs,notmpfs



-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AppArmor FAQ

2007-04-17 Thread Andi Kleen
> The vast majority of applications are not
> modified to be SELinux aware - only a small handful of security aware
> applications are modified. 

All applications that can edit /etc/resolv.conf? That's nearly 
everything. You yourself gave the example; I'm not making anything up.

-Andi (sensing a loop in the thread -- things that already have been
discussed come back from the dead.)

P.S.: If you want to loop further please drop me from cc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AppArmor FAQ

2007-04-17 Thread Andi Kleen
> For SELinux to be effective it has to have a complete policy definition.
> This would prevent the OpenOffice access (unless OpenOffice is in the
> modify_resolv_conf_t domain) above.

This would mean no fully functional root user anymore. My understanding 
is rather that at least in the Fedora default setup individual applications
are confined with targetted policy and root left alone because normal system 
administrators get very unhappy when root becomes powerless.

I was merely pointing out that in this setup path or namespace based access 
control are much easier to fit in than label based access because they normally 
don't require changing applications. John's original document also
listed some other advantages that I don't need to repeat.

In "i don't care if it looks like Unix anymore" security setups like
you're describing that's undoubtedly different and labels might indeed
work because you forbid just anybody changing them easily. If that makes
the users happy is a different question though. I suppose it will
keep security consultants employed at least @)

Arguably the preserving label issue is not unique to SELinux but 
also applies to ACLs and other possible uses of EAs, but then people normally 
don't need to set any ACLs on /etc/resolv.conf.

I personally don't like either too much. Path based access control
is somewhat hackish and ugly and slow in the current implementation, 
but I haven't seen an similarly easy to configure solution yet.

plan9 like limited namespaces for individual processes initially seem 
like a nice alternative, but in practice they're also too hard to use and 
suffer from many of the problems the EA label approach has.

But easy to use security is probably better than complicated security
because normal people will more likely use it.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AppArmor FAQ

2007-04-17 Thread Andi Kleen
On Tue, Apr 17, 2007 at 01:47:39PM -0400, James Morris wrote:
> Normal applications need zero modification under SELinux.
> 
> Some applications which manage security may need to be made SELinux-aware, 

Anything that can touch /etc/resolv.conf? That's potentially a lot of binaries
if you consider anything scripts could do with it.

> although this can often be done with PAM plugins, which is a standard way 
> to do this kind of thing in modern Unix & Linux OSs.

PAM plugins in vi and emacs? Scary idea.

And what do you do if someone decides to use OpenOffice to edit their
/etc/resolv.conf? For a lot of people that's the only text editor 
they know.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AppArmor FAQ

2007-04-17 Thread Andi Kleen
Karl MacMillan <[EMAIL PROTECTED]> writes:
 
> No - the real fix is to change the applications or to run under a policy
> that confines all applications. Most of the problems with resolv.conf,
> mtab, etc. stem from admin processes (e.g., editors or shell scripts)
> all running under the same unconfined domain.
> 
> In some cases applications need modification as only the application has
> enough information to determine the correct label. Usually this means
> preserving labels from input files or separating the output into
> distinct directories so type transitions or label inheritance will work.
> 
> restorecond is just a hack not a requirement or a sign that something is
> wrong with the model. That is why it is a userspace application and not
> integrated into the kernel mechanism.

You nicely show one of the major disadvantages of the label model vs the path 
model here: it requires modification of a lot of applications. 

Maybe John can borrow your statement for new versions of his FAQ @)

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching

2007-04-15 Thread Andi Kleen
> It's nice to check for consistency though, so we're adding that. Profile 
> loading is a trusted operation, at least so far, and so security wise we 
> don't actually have to care --- if loading an invalid profile can bring down 
> the system, then that's no worse than an arbitrary module that crashes the 
> machine. Not sure if there will ever be user loadable profiles; at least at 
> that point we had to care.

A security system that allows to crash the kernel is a little weird 
though. It would be better to check. Not that a recursion check
is particularly expensive.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching

2007-04-12 Thread Andi Kleen
[EMAIL PROTECTED] writes:

[didn't review code fully, just some stuff I noticed]

> +
> +struct aa_dfa {
> + struct table_header *tables[YYTD_ID_NXT];
> +};

If that is passed in from user space you would need special compat
code for 64bit kernels who support 32bit userland.
Better to avoid pointers.

> +
> + /* get optional subprofiles */
> + if (aa_is_nameX(e, AA_LIST, "hats")) {
> + while (!aa_is_nameX(e, AA_LISTEND, NULL)) {
> + struct aa_profile *subprofile;
> + subprofile = aa_unpack_profile(e);

Is there any check that would guard the recursion from stack
overflow on malicious input?  

> + /*
> +  * Replacement needs to allocate a new aa_task_context for each
> +  * task confined by old_profile.  To do this the profile locks
> +  * are only held when the actual switch is done per task.  While
> +  * looping to allocate a new aa_task_context the old_task list
> +  * may get shorter if tasks exit/change their profile but will
> +  * not get longer as new task will not use old_profile detecting
> +  * that is stale.
> +  */
> + do {
> + new_cxt = aa_alloc_task_context(GFP_KERNEL | __GFP_NOFAIL);

NOFAIL is usually a bad sign. It should be only used if there is no
alternative.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take36 10/10] kevent: Kevent based generic AIO.

2007-02-12 Thread Andi Kleen
Evgeniy Polyakov <[EMAIL PROTECTED]> writes:
> 
> aio_sendfile_path() is essentially aio_sendfile(), except that it takes
> source filename as parameter, has a pointer to private header
> and its size (which allows to send header and file's content in one syscall
> instead of three (open, send, sendfile) and returns opened file descriptor.

Are you sure this is a useful optimization? Do you have numbers vs 
open+aio_sendfile+close? 

Compared to the cost of sending a complete file three system calls should be 
quite in the noise. 
And Linux system calls are not that expensive (few hundred cycles normally) 

Adding such compound system calls would be a worrying precedent because
I'm sure others would want them then for their favourite system call combo
too. If they were really useful it might make more sense to have a batch() 
system call that works for arbitary calls, but I'm not convinced yet
it's even needed. It would be certainly ugly.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix quadratic behavior of shrink_dcache_parent()

2007-02-10 Thread Andi Kleen
Miklos Szeredi <[EMAIL PROTECTED]> writes:
 
> So the second part of the problem is to somehow limit the number of
> dentries used.  Not easy...

OpenVZ has some existing work in this area to separate their virtual machines. 
I assume they will eventually submit it.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GFS, what's remaining

2005-09-02 Thread Andi Kleen
Andrew Morton <[EMAIL PROTECTED]> writes:

> 
> > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > >   possibly gain (or vice versa)
> > > 
> > > - Relative merits of the two offerings
> > 
> > You missed the important one - people actively use it and have been for
> > some years. Same reason with have NTFS, HPFS, and all the others. On
> > that alone it makes sense to include.
>  
> Again, that's not a technical reason.  It's _a_ reason, sure.  But what are
> the technical reasons for merging gfs[2], ocfs2, both or neither?

There seems to be clearly a need for a shared-storage fs of some sort
for HA clusters and virtualized usage (multiple guests sharing a
partition).  Shared storage can be more efficient than network file
systems like NFS because the storage access is often more efficient
than network access  and it is more reliable because it doesn't have a
single point of failure in form of the NFS server.

It's also a logical extension of the "failover on failure" clusters
many people run now - instead of only failing over the shared fs at
failure and keeping one machine idle the load can be balanced between
multiple machines at any time.

One argument to merge both might be that nobody really knows yet which
shared-storage file system (GFS or OCFS2) is better. The only way to
find out would be to let the user base try out both, and that's most
practical when they're merged.

Personally I think ocfs2 has nicer&cleaner code than GFS.
It seems to be more or less a 64bit ext3 with cluster support, while
GFS seems to reinvent a lot more things and has somewhat uglier code.
On the other hand GFS' cluster support seems to be more aimed
at being a universal cluster service open for other usages too,
which might be a good thing. OCFS2s cluster seems to be more 
aimed at only serving the file system.

But which one works better in practice is really an open question.

The only thing that should be probably resolved is a common API
for at least the clustered lock manager. Having multiple
incompatible user space APIs for that would be sad.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: File Locking in Linux 2.5

2001-05-04 Thread Andi Kleen

On Thu, May 03, 2001 at 06:03:07PM -0700, Jeremy Allison wrote:
> The only thing we need to fix (IMHO) is the close(dup(fd)) "bug",
> which I have yet to see any application treat as anything other
> than a spec. bug that must be worked around.

I don't think it's a bug right now; not dropping them and keeping them 
around forever adds a quite lot of new problems: just see how unnatural
sysv shared memory segments and semaphores fit into the rest of the mostly
elegant unix model. You need to have special administration tools for them,
you need to clean up after crashing applications that keep their resources
around, you need special global resource limits to avoid users running amok
etc. Persistent locks have similar problems. They probably can all be solved,
but it won't be pretty or easy.

One way to work around this would be to have a special device where these locks
are associated with an open descriptor. e.g. in samba you would open that
device (it would work like the /dev/ptmx multiplexer and just give new 
instances for every open)
and remove close-on-exec and inherit it to every child. It would even work
for more distributed applications without clear child-parent relations, you 
could pass that fd around via unix sockets. Then all persistent
locks would be associated with the fd. When samba dies all
references to that lock manager instance would be closed and the system could
clean up the persistent locks without administrator support. I think such
a scheme would be much nicer to the rest of the system and avoid the problems
the sysv stuff has.

Comments? 


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: File Locking in Linux 2.5

2001-05-03 Thread Andi Kleen

On Thu, May 03, 2001 at 08:36:13AM +0100, Matthew Wilcox wrote:
> I'll get to it this weekend then.  Should be a relatively simple patch.

Just don't forget to add a per user ulimit for it and probably an admin
tool like ipcs.

> Are there any other semantics you want changing from the POSIX lock?
> I'm thinking of removing deadlock detection, for one...

If you're willing to delay deadlock detection a bit  you can just do 
it without a global list:
on waiting for the lock you do schedule_timeout for 5-10s; after that 
you put the lock onto a timeout queue and search the queue for deadlock and
repeat that searching for some time in slow intervals. 
(That's the textbook algorithm out of the Grey/Reuter database book)

Disadvantage is that the big timer list must be modified for every
lock operation; but that one is already efficient and will in 2.5 
probably be per CPU.

Advantage is that it moves near completely out of the fast path. 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: File Locking in 2.5

2001-05-02 Thread Andi Kleen

On Mon, Apr 30, 2001 at 12:39:23PM -0600, Matthew Wilcox wrote:
>  * All filesystems will fill in their ->lock method.

Why when a common stub should work for 90% of them?  Please keep
global search-and-edit operation low when not absolutely possible.

>  * Local filesystems should all use local_lock() for this method,
>unless they have a good reason to provide their own facilities.


So this should be default for ->lock

>A clustered filesystem might call out to the network and say `I want
>to put this lock on this file'. Either some other node in the cluster
>says `Denied', `Blocked' or `Granted' (ie handles the request), OR no
>other node accepts responsibility for the lock, in which case we lock
>it locally by calling local_lock().

It would be nice if it was possible to interface it to the distributed lock
manager IBM has recently GPLed...

>With the changes above, even this code can go away. The nfs client can
>keep a per-fs list of locks, and reestablish them at server restart.
>No need to interact with the local locking at all.

It's still linear lists I guess? Any evidence that a hash table or a ternary
search tree makes sense?
>  * Mandatory Locks. These are byte-range mandatory locks. To use
>them, mount the filesystem with the `mand' option enabled, and set
>the file mode to g-x g+s. POSIX locks applied to this file will
>now be mandatory. Mandatory locks do not prevent accesses via
>mmap(). You should not use Mandatory locks in new code.

I would add a printk warning that fires 10 times, then you can get rid
of it.

>locks.c still runs almost entirely under the BKL. An earlier attempt
>to move it to a different locking scheme was thwarted when the code
>was integrated into 2.4.0-test9 while I was on holiday, and without me
>submitting it to Linus. Grumble. I plan to move it to _one_ spinlock
>to cover all lock-related structures, and I think that will be
>possible with the plan described above (since this code will no longer
>sleep).

I never understood why you can't do it completely inode local protected
by the inode lock; but then I don't claim that I understand fs/locks.c

It would be a small drawback for your new global locks because these inodes
could not be flushed (so you need a new resource management and perhaps
per user quotas for it anyways), but probably not worse than the current state.

>As soon as lockd no longer needs to keep its fingers inside locks.c, I
>want to remove the global list of locks. It's also used by /proc/locks
>-- which probably needs to go away anyway. So what's useful about
>/proc/locks? I'd like to be able to see which locks my process has,

Or just walk the inode hash ? 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Q: sb->s_bdev same device as sb->s_dev?

2001-03-11 Thread Andi Kleen

On Sun, Mar 11, 2001 at 04:55:24AM -0500, Alexander Viro wrote:
> 
> 
> On Sun, 11 Mar 2001, Andi Kleen wrote:
> 
> > On Sat, Mar 10, 2001 at 08:44:41PM -0500, Alexander Viro wrote:
> > > Too many places are using ->i_dev right now to eliminate ->i_dev.
> > > Yes, it should eventually go away (and icache should work by ->i_sb/->i_ino
> > > instead of ->i_dev/->i_ino).  ->i_dev should eventually go away, but that
> > > was too late in 2.4 cycle to do that.
> > 
> > And how would you implement stateless file handles for NFS then if i_dev
> > was gone ?
> > (->i_sb is not constant over reboots) 
> 
> ->i_sb->s_dev is not worse than ->i_dev.

Yes, it's an obvious improvement ;)


-Andi (who thinks VFS is turning more and more into pointer hell)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Q: sb->s_bdev same device as sb->s_dev?

2001-03-11 Thread Andi Kleen

On Sat, Mar 10, 2001 at 08:44:41PM -0500, Alexander Viro wrote:
> Too many places are using ->i_dev right now to eliminate ->i_dev.
> Yes, it should eventually go away (and icache should work by ->i_sb/->i_ino
> instead of ->i_dev/->i_ino).  ->i_dev should eventually go away, but that
> was too late in 2.4 cycle to do that.

And how would you implement stateless file handles for NFS then if i_dev
was gone ?
(->i_sb is not constant over reboots) 


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: NULL f_ops

2001-03-08 Thread Andi Kleen

On Thu, Mar 08, 2001 at 03:48:08PM +, Matthew Wilcox wrote:
> 
> Someone tell me if my chain of reasoning is wrong here...
> 
>  (1) The only way to get a `struct file' is to call get_empty_filp()

There is code that creates private struct files without calling get_empty_filp()
NFS comes to mind, but there is other too.  Now of course these could
be changed, but I think the change would cause much more work for debugging/
fixing than you think ;)

[I think it causes performance problems there BTW -- it kills ext2 readahead
because the file is not preserved] 


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: block bitmap readahead for ext2

2001-01-24 Thread Andi Kleen

On Wed, Jan 24, 2001 at 11:00:41AM -0500, Jeff Darcy wrote:
> From: "Andi Kleen" <[EMAIL PROTECTED]>
> > Does this mean that unwriten extents are supported now in pagebuf?
> Otherwise
> > this ioctl would need to prezero the disks blocks to prevent old data
> > from being leaked, right?
> 
> I don't know specifically about the current Linux pagebuf code, but in
> general this is not necessarily the case.  If the interfaces are set up so
> that one can know when the whole block is about to be overwritten with new
> data, no zeroing is necessary for that block.

This is what XFS does on Irix according to the design documentation on the
XFS site.  It has special unwriten extents that on read return zeroes.
On write an unwriten extent is turned into a normal extent.
In the Linux port the VM support needed for that just seems not to be 
implemented yet. Pagebuf in this case is a XFS datastructure to describe 
multi page objects that are continuous on disk in memory -- normal linux page 
cache only knows about allocated individual pages and doesn't support
preallocation. 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: block bitmap readahead for ext2

2001-01-24 Thread Andi Kleen

Quick question.

On Wed, Jan 24, 2001 at 09:20:43AM -0600, Steve Lord wrote:
> What XFS also has is an ioctl to preallocate disk space, there is very
> little documentation on this (none), but if you look in the file
> cmd/xfstests/src/randholes.c you will see an ioctl like this:
> 
> struct flock64 fl;
> fl.l_start = offset;
> fl.l_len = blocksize;
> fl.l_whence = 0;
> 
> if (ioctl(fd, XFS_IOC_RESVSP64, &fl) < 0) {
> 
> This is preallocating disk blocks for file data. Of course, it only
> works if you know file size in advance. We should also have direct
> I/O in the next month or so.

Does this mean that unwriten extents are supported now in pagebuf? Otherwise
this ioctl would need to prezero the disks blocks to prevent old data
from being leaked, right?


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: [RFC] api for consistent lvm snapshots

2000-10-04 Thread Andi Kleen

On Wed, Oct 04, 2000 at 01:47:28PM -0400, Ken Hirsch wrote:
> 
> The advantages I see are that database systems and other resource managers
> would have much less complicated buffering logic, that the data would not be
> double-buffered, that the LRU algorithm would be more accurate (there was a
> discussion of this a few months ago with regard to multiple file systems).
> Hopefully we could better performance without having to tune so much.
> 
> It would also be a step toward a common open-source transaction manager.
> 
> There are issues with regard to checkpointing, but other than that, what do
> you think of this idea?

The main motivation for that change seems to be simpler buffering logic
in a user space log manager. How about simply adding a madvise() flag
that tells the MM system that you don't want it to flush the page to 
the backing store yet. After the log manager did its job it could clear
that flag and the system would flush it.

The only nasty issue is that it could cause deadlock in the low memory 
case when the system cannot flush pages because they all have this flag
set. A very simple workaround would be to write the page cache page to
swap then instead of the underlying file [this would make swap accounting
harder, but linux doesn't do that anyways]. Later the swapped page could be
paged in again and then writen to the real file.   It would be slow in the
trashing case, but in this case you should not expect much performance
anyways.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]



Re: Stupid Question

2000-09-10 Thread Andi Kleen

On Sun, Sep 10, 2000 at 01:18:40PM -0700, Ion Badulescu wrote:
> In article <[EMAIL PROTECTED]> you wrote:
> > Can I ask a stupid question: why do we pass a pointer to the file
> > position along with the file when calling read/write methods as in:
> > 
> >   read(file, buf, count, &file->f_pos)
> > 
> > Is there ever a case when we don't want to use the f_pos from the file
> > struct?
> 
> Yes: pread()/pwrite().
> 
> BTW, they've caused us quite a bit of trouble in the stacking fs code.
> We need to detect if the VFS code calling us is read() or pread(), in
> order to know how to call our own underlying fs method. The only --
> and very unclean -- way we can detect this is by comparing the fpos
> pointer passed to us to the address of file->fpos...
> 
> Passing an update_fpos boolean would be so much cleaner.

An update_fpos boolean only would require that pread/pwrite create their own
file structure to pass in the user offset.


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]