Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Andi Kleen

 I completely agree.  If one thread writes A and another writes B then the
 kernel should record either A or B, not ((A  0x) | (B 
 0x))

The problem is pretty nasty unfortunately. To solve it properly I think
the file_operations-read/write prototypes would need to be changed
because otherwise it is not possible to do atomic relative updates
of f_pos. Right now the actual update is burrowed deeply in the low level 
read/write implementation. But that would be a huge impact all over
the tree :/

Or maybe define a new read/write64 and keep the default as 32bit only-- i 
suppose most users don't really need 64bit. Still would be a nasty API 
change.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Bodo Eggert
Trond Myklebust [EMAIL PROTECTED] wrote:
 On Mon, 2008-01-28 at 05:38 +0100, Andi Kleen wrote:
 On Monday 28 January 2008 05:13:09 Trond Myklebust wrote:
  On Mon, 2008-01-28 at 03:58 +0100, Andi Kleen wrote:

   The problem is that it's not a race in who gets to do its thing first,
   but a parallel reader can actually see a corrupted value from the two
   independent words on 32bit (e.g. during a 4GB). And this could actually
   completely corrupt f_pos when it happens with two racing relative seeks
   or read/write()s
   
   I would consider that a bug.
  
  I disagree. The corruption occurs because this isn't a situation that is
  allowed by either POSIX or SUSv2/v3. Exactly what spec are you referring
  to here?
 
 No specific spec, just general quality of implementation. We normally don't
 have non thread safe system calls even if it was in theory allowed by some
 specification.
 
 We've had the existing implementation for quite some time. The arguments
 against changing it have been the same all along: if your application
 wants to share files between threads, the portability argument implies
 that you should either use pread/pwrite or use a mutex or some other
 form of synchronisation primitive in order to ensure that
 lseek()/read()/write() do not overlap.

Does anything in the kernel depend on f_pos being valid?
E.g. is it possible to read beyond the EOF using this race, or to have files
larger than the ulimit?

If not, update the manpage and be done. ¢¢

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Andi Kleen
On Monday 28 January 2008 13:56:05 Alan Cox wrote:
   No specific spec, just general quality of implementation.
  
  I completely agree.  If one thread writes A and another writes B then the
  kernel should record either A or B, not ((A  0x) | (B 
  0x))
 
 Agree entirely: the spec doesn't allow for random scribbling in the wrong
 place. It doesn't cover which goes first or who wins the race but
 provides pwrite/pread for that situation. Writing somewhere unrelated is
 definitely not to spec 

Actually it would probably -- i guess it's undefined and in undefined
country such things can happen.

Also to be fair I think it's only a problem for the 4GB wrapping case
which is presumably rare (otherwise we would have heard about it)

Also worse really fixing it would be a major change to the VFS 
because of the way -read/write are defined :/

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Alan Cox
  No specific spec, just general quality of implementation.
 
 I completely agree.  If one thread writes A and another writes B then the
 kernel should record either A or B, not ((A  0x) | (B 
 0x))

Agree entirely: the spec doesn't allow for random scribbling in the wrong
place. It doesn't cover which goes first or who wins the race but
provides pwrite/pread for that situation. Writing somewhere unrelated is
definitely not to spec and not good.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Alan Cox
On Mon, 28 Jan 2008 15:10:34 +0100
Andi Kleen [EMAIL PROTECTED] wrote:

 On Monday 28 January 2008 14:38:57 Alan Cox wrote:
   Also worse really fixing it would be a major change to the VFS 
   because of the way -read/write are defined :/
  
  I don't see a problem there. -read and -write update the passed pointer
  which is not the real f_pos anyway. Just the copies need fixing. 
 
 They are effectually doing a decoupled read/modify/write cycle. e.g.:
 
 A   B
 
 read fpos   
 
 read fpos
 
 fpos += A   fpos += B
 write fpos
 
 
 write fpos
 
 So you get overlapping reads. Probably not good.

No unix system I'm aware of cares about the read/write positioning during
parallel simultaneous reads or writes, with the exception of O_APPEND
which is strictly defined. The problem case is getting fpos != either
valid value.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Diego Calleja
El Mon, 28 Jan 2008 15:10:34 +0100, Andi Kleen [EMAIL PROTECTED] escribió:

 So you get overlapping reads. Probably not good.

This was discussed in the past i think -

http://lkml.org/lkml/2006/4/13/124
http://lkml.org/lkml/2006/4/13/130
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Steve French
On Jan 28, 2008 2:17 AM, Andi Kleen [EMAIL PROTECTED] wrote:
  I completely agree.  If one thread writes A and another writes B then the
  kernel should record either A or B, not ((A  0x) | (B 
  0x))

 The problem is pretty nasty unfortunately. To solve it properly I think
 the file_operations-read/write prototypes would need to be changed
 because otherwise it is not possible to do atomic relative updates
 of f_pos. Right now the actual update is burrowed deeply in the low level
 read/write implementation. But that would be a huge impact all over
 the tree :/

If there were a wrapper around reads and writes of f_pos as there is
for i_size e.g. it would hit a lot of code, but not as many as I had
originally thought.  the most important ones are in the vfs itself, where
there are only 59 uses of the field (not all need to be changed).   ext3
has fewer (25), and cifs only 12 uses.


-- 
Thanks,

Steve
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Dave Kleikamp

On Mon, 2008-01-28 at 12:33 -0600, Steve French wrote:
 On Jan 28, 2008 2:17 AM, Andi Kleen [EMAIL PROTECTED] wrote:
   I completely agree.  If one thread writes A and another writes B then the
   kernel should record either A or B, not ((A  0x) | (B 
   0x))
 
  The problem is pretty nasty unfortunately. To solve it properly I think
  the file_operations-read/write prototypes would need to be changed
  because otherwise it is not possible to do atomic relative updates
  of f_pos. Right now the actual update is burrowed deeply in the low level
  read/write implementation. But that would be a huge impact all over
  the tree :/
 
 If there were a wrapper around reads and writes of f_pos as there is
 for i_size e.g. it would hit a lot of code, but not as many as I had
 originally thought.  the most important ones are in the vfs itself, where
 there are only 59 uses of the field (not all need to be changed).   ext3
 has fewer (25), and cifs only 12 uses.

Most of the uses in ext3 and cifs deal with a directory's f_pos in
readdir, which is protected by i_mutex, so I don't think we need to
worry about them at all.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Andi Kleen
On Monday 28 January 2008 14:38:57 Alan Cox wrote:
  Also worse really fixing it would be a major change to the VFS 
  because of the way -read/write are defined :/
 
 I don't see a problem there. -read and -write update the passed pointer
 which is not the real f_pos anyway. Just the copies need fixing. 

They are effectually doing a decoupled read/modify/write cycle. e.g.:

A   B

read fpos   

read fpos

fpos += A   fpos += B
write fpos


write fpos

So you get overlapping reads. Probably not good.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-28 Thread Alan Cox
 Also worse really fixing it would be a major change to the VFS 
 because of the way -read/write are defined :/

I don't see a problem there. -read and -write update the passed pointer
which is not the real f_pos anyway. Just the copies need fixing.

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-27 Thread Steve French
Don't you need to a spinlock/spinunlock(i_lock) or something similar
(there isn't a spinlock in the file struct unfortunately) around the
reads and writes from f_pos in fs/read_write.c in remote_llseek with
your patch since the reads/writes from that field are not necessarily
atomic and threads could be racing in seek on the same file struct?

On Jan 26, 2008 8:17 PM, Andi Kleen [EMAIL PROTECTED] wrote:

 - Replace remote_llseek with remote_llseek_unlocked (to force compilation
 failures in all users)
 - Change all users to either use remote_llseek directly or take the
 BKL around. I changed the file systems who don't use the BKL
 for anything (CIFS, GFS) to call it directly. NCPFS and SMBFS and NFS
 take the BKL, but explicitely in their own source now.

 I moved them all over in a single patch to avoid unbisectable sections.

 Cc: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]

 Signed-off-by: Andi Kleen [EMAIL PROTECTED]

 ---
  fs/cifs/cifsfs.c   |2 +-
  fs/gfs2/ops_file.c |4 ++--
  fs/ncpfs/file.c|   12 +++-
  fs/nfs/file.c  |6 +-
  fs/read_write.c|7 +++
  fs/smbfs/file.c|   11 ++-
  include/linux/fs.h |3 ++-
  7 files changed, 34 insertions(+), 11 deletions(-)

 Index: linux/fs/cifs/cifsfs.c
 ===
 --- linux.orig/fs/cifs/cifsfs.c
 +++ linux/fs/cifs/cifsfs.c
 @@ -549,7 +549,7 @@ static loff_t cifs_llseek(struct file *f
 if (retval  0)
 return (loff_t)retval;
 }
 -   return remote_llseek(file, offset, origin);
 +   return remote_llseek_unlocked(file, offset, origin);
  }

  static struct file_system_type cifs_fs_type = {
 Index: linux/fs/gfs2/ops_file.c
 ===
 --- linux.orig/fs/gfs2/ops_file.c
 +++ linux/fs/gfs2/ops_file.c
 @@ -107,11 +107,11 @@ static loff_t gfs2_llseek(struct file *f
 error = gfs2_glock_nq_init(ip-i_gl, LM_ST_SHARED, 
 LM_FLAG_ANY,
i_gh);
 if (!error) {
 -   error = remote_llseek(file, offset, origin);
 +   error = remote_llseek_unlocked(file, offset, origin);
 gfs2_glock_dq_uninit(i_gh);
 }
 } else
 -   error = remote_llseek(file, offset, origin);
 +   error = remote_llseek_unlocked(file, offset, origin);

 return error;
  }
 Index: linux/fs/ncpfs/file.c
 ===
 --- linux.orig/fs/ncpfs/file.c
 +++ linux/fs/ncpfs/file.c
 @@ -18,6 +18,7 @@
  #include linux/slab.h
  #include linux/vmalloc.h
  #include linux/sched.h
 +#include linux/smp_lock.h

  #include linux/ncp_fs.h
  #include ncplib_kernel.h
 @@ -281,9 +282,18 @@ static int ncp_release(struct inode *ino
 return 0;
  }

 +static loff_t ncp_remote_llseek(struct file *file, loff_t offset, int origin)
 +{
 +   loff_t ret;
 +   lock_kernel();
 +   ret = remote_llseek_unlocked(file, offset, origin);
 +   unlock_kernel();
 +   return ret;
 +}
 +
  const struct file_operations ncp_file_operations =
  {
 -   .llseek = remote_llseek,
 +   .llseek = ncp_remote_llseek,
 .read   = ncp_file_read,
 .write  = ncp_file_write,
 .ioctl  = ncp_ioctl,
 Index: linux/fs/read_write.c
 ===
 --- linux.orig/fs/read_write.c
 +++ linux/fs/read_write.c
 @@ -58,11 +58,10 @@ loff_t generic_file_llseek(struct file *

  EXPORT_SYMBOL(generic_file_llseek);

 -loff_t remote_llseek(struct file *file, loff_t offset, int origin)
 +loff_t remote_llseek_unlocked(struct file *file, loff_t offset, int origin)
  {
 long long retval;

 -   lock_kernel();
 switch (origin) {
 case SEEK_END:
 offset += i_size_read(file-f_path.dentry-d_inode);
 @@ -73,15 +72,15 @@ loff_t remote_llseek(struct file *file,
 retval = -EINVAL;
 if (offset=0  
 offset=file-f_path.dentry-d_inode-i_sb-s_maxbytes) {
 if (offset != file-f_pos) {
 +   /* AK: do we need a lock for those? */
 file-f_pos = offset;
 file-f_version = 0;
 }
 retval = offset;
 }
 -   unlock_kernel();
 return retval;
  }
 -EXPORT_SYMBOL(remote_llseek);
 +EXPORT_SYMBOL(remote_llseek_unlocked);

  loff_t no_llseek(struct file *file, loff_t offset, int origin)
  {
 Index: linux/fs/smbfs/file.c
 ===
 --- linux.orig/fs/smbfs/file.c
 +++ linux/fs/smbfs/file.c
 @@ -422,9 +422,18 @@ smb_file_permission(struct inode *inode,
 return error;
  }

 +static loff_t 

Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-27 Thread Trond Myklebust

On Sun, 2008-01-27 at 10:57 -0600, Steve French wrote:
 Don't you need to a spinlock/spinunlock(i_lock) or something similar
 (there isn't a spinlock in the file struct unfortunately) around the
 reads and writes from f_pos in fs/read_write.c in remote_llseek with
 your patch since the reads/writes from that field are not necessarily
 atomic and threads could be racing in seek on the same file struct?

Where does is state in POSIX or SUS that we need to cater to that kind
of application?
In any case, the current behaviour of f_pos if two threads are sharing
the file struct is undefined no matter whether you spinlock or not,
since there is no special locking around sys_read() or sys_write().

Trond
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-27 Thread Steve French
If two seeks overlap, can't you end up with an f_pos value that is
different than what either thread seeked to? or if you have a seek and
a read overlap can't you end up with the read occurring in the midst
of an update of f_pos (which takes more than one instruction on
various architectures), e.g. reading an f_pos, which has only the
lower half of a 64 bit field updated?   I agree that you shouldn't
have seeks racing in parallel but I think it is preferable to get
either the updated f_pos or the earlier f_pos not something 1/2
updated.

On Jan 27, 2008 11:56 AM, Trond Myklebust [EMAIL PROTECTED] wrote:

 On Sun, 2008-01-27 at 10:57 -0600, Steve French wrote:
  Don't you need to a spinlock/spinunlock(i_lock) or something similar
  (there isn't a spinlock in the file struct unfortunately) around the
  reads and writes from f_pos in fs/read_write.c in remote_llseek with
  your patch since the reads/writes from that field are not necessarily
  atomic and threads could be racing in seek on the same file struct?

 Where does is state in POSIX or SUS that we need to cater to that kind
 of application?
 In any case, the current behaviour of f_pos if two threads are sharing
 the file struct is undefined no matter whether you spinlock or not,
 since there is no special locking around sys_read() or sys_write().

 Trond




-- 
Thanks,

Steve
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-27 Thread Trond Myklebust

On Sun, 2008-01-27 at 16:18 -0600, Steve French wrote:
 If two seeks overlap, can't you end up with an f_pos value that is
 different than what either thread seeked to? or if you have a seek and
 a read overlap can't you end up with the read occurring in the midst
 of an update of f_pos (which takes more than one instruction on
 various architectures), e.g. reading an f_pos, which has only the
 lower half of a 64 bit field updated?   I agree that you shouldn't
 have seeks racing in parallel but I think it is preferable to get
 either the updated f_pos or the earlier f_pos not something 1/2
 updated.

Why? The threads are doing something inherently liable to corrupt data
anyway. If they can race over the seek, why wouldn't they race over the
read or write too?
The race in lseek() should probably be the least of your worries in this
case.

Trond
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-27 Thread Andi Kleen
On Sunday 27 January 2008 17:57:14 Steve French wrote:
 Don't you need to a spinlock/spinunlock(i_lock) or something similar
 (there isn't a spinlock in the file struct unfortunately) around the
 reads and writes from f_pos in fs/read_write.c in remote_llseek with
 your patch since the reads/writes from that field are not necessarily
 atomic and threads could be racing in seek on the same file struct?

Funny that you mention it. I actually noticed this too while working on this, 
but noticed that it is wrong everywhere (as in even plain sys_write/read gets 
it wrong). So I decided to not address it because it is already
broken. 

I did actually send email to a few people about this, but no answer
yet.

I agree it's probably all broken on 32bit platforms, but I'm not
sure how to best address this. When it is comprehensively addressed 
remote_llseek 
can use that new method too.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek

2008-01-27 Thread Trond Myklebust

On Mon, 2008-01-28 at 03:58 +0100, Andi Kleen wrote:
 The problem is that it's not a race in who gets to do its thing first, but a 
 parallel reader can actually see a corrupted value from the two independent 
 words on 32bit (e.g. during a 4GB). And this could actually completely 
 corrupt 
 f_pos when it happens with two racing relative seeks or read/write()s
 
 I would consider that a bug.

I disagree. The corruption occurs because this isn't a situation that is
allowed by either POSIX or SUSv2/v3. Exactly what spec are you referring
to here?

Trond
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html