Re: 2.6.22.6: kernel BUG at fs/locks.c:171

2007-09-14 Thread Soeren Sonnenburg
On Thu, 2007-09-13 at 09:51 +1000, Nick Piggin wrote:
 On Thursday 13 September 2007 19:20, Soeren Sonnenburg wrote:
  Dear all,
 
  I've just seen this in dmesg on a AMD K7 / kernel 2.6.22.6 machine
  (config attached).
 
  Any ideas / which further information needed ?
 
 Thanks for the report. Is it reproduceable? It seems like the
 locks_free_lock call that's oopsing is coming from __posix_lock_file.
 The actual function looks fine, but the lock being freed could have
 been corrupted if there was slab corruption, or a hardware corruption.
 
 You could: try running memtest86+ overnight. And try the following
 patch and turn on slab debugging then try to reproduce the problem.

OK so far I've run memtest86+ 1.40 from freedos for 8 hrs (v1.70 hung on
startup) - nothing.

Could this corruption be caused by a pci card/driver? I am asking as I
am using a new dvb-t card (asus p7131) and the oops happened after 5 or
6 days of uptime just about a day after watching some movie (very bad
reception/lots of errors). 

However this machine used to have uptimes of months before the dvb card
was in there and the kernel version upgrade (don't know which version
that was...).

Anyway I am not sure if this is reproducible, but I will keep memtest
running today and then proceed as you said...

Thanks,
Soeren
-- 
Sometimes, there's a moment as you're waking, when you become aware of
the real world around you, but you're still dreaming.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change

2007-09-14 Thread Greg Banks
On Tue, Sep 04, 2007 at 10:37:04AM -0400, Jeff Layton wrote:
 If the ATTR_KILL_S*ID bits are set then any mode change is only for
 clearing the setuid/setgid bits. For NFS skip the mode change and
 let the server handle it.

You're assuming the server will remove setuid and setgid bits on WRITE?
I don't see that behaviour specified in the RFC, at least for v3.
The RFC specifies a behaviour for the mtime attribute as a side
effect of WRITE, but says nothing about mode.  This means server
implementations are free to clobber setuid or not.  A quick experiment
shows that at least the Irix server will *NOT* clobber those bits.
So with an Irix server you've now lost this Linux-specific security
feature.

I'm curious about the reasons behind this change.  You mention
credential issues; how exactly is it that you have the correct creds
to perform a WRITE rpc but not a SETATTR rpc?

Greg.
-- 
Greg Banks, RD Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/19] export operations rewrite

2007-09-14 Thread hch
This patchset is a medium scale rewrite of the export operations
interface.  The goal is to make the interface less complex, and
easier to understand from the filesystem side, aswell as preparing
generic support for exporting of 64bit inode numbers.

This touches all nfs exporting filesystems, and I've done testing
on all of the filesystems I have here locally (xfs, ext2, ext3, reiserfs,
jfs)

Compared to the last version I've fixed the white space issues that
checkpatch.pl complained about.

Note that this patch series is against mainline.  There will be some
xfs changes landing in -mm soon that revamp lots of the code touched
here.  They should hopefully include the first path in the series so
it can be simply dropped, but the xfs conversion will need some smaller
updates.  I will send this update as soon as the xfs tree updates get
pulled into -mm.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/19] exportfs: add fid type

2007-09-14 Thread hch
Add a structured fid type so that we don't have to pass an array
of u32 values around everywhere.  It's a union of possible layouts.

As a start there's only the u32 array and the traditional 32bit
inode format, but there will be more in one of my next patchset
when I start to document the various filehandle formats we have
in lowlevel filesystems better.

Also add an enum that gives the various filehandle types human-
readable names.

Note:  Some people might think the struct containing an anonymous
union is ugly, but I didn't want to pass around a raw union type.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/include/linux/exportfs.h
===
--- linux-2.6.orig/include/linux/exportfs.h 2007-09-13 15:10:59.0 
+0200
+++ linux-2.6/include/linux/exportfs.h  2007-09-13 15:11:11.0 +0200
@@ -7,6 +7,44 @@ struct dentry;
 struct super_block;
 struct vfsmount;
 
+/*
+ * The fileid_type identifies how the file within the filesystem is encoded.
+ * In theory this is freely set and parsed by the filesystem, but we try to
+ * stick to conventions so we can share some generic code and don't confuse
+ * sniffers like ethereal/wireshark.
+ *
+ * The filesystem must not use the value '0' or '0xff'.
+ */
+enum fid_type {
+   /*
+* The root, or export point, of the filesystem.
+* (Never actually passed down to the filesystem.
+*/
+   FILEID_ROOT = 0,
+
+   /*
+* 32bit inode number, 32 bit generation number.
+*/
+   FILEID_INO32_GEN = 1,
+
+   /*
+* 32bit inode number, 32 bit generation number,
+* 32 bit parent directory inode number.
+*/
+   FILEID_INO32_GEN_PARENT = 2,
+};
+
+struct fid {
+   union {
+   struct {
+   u32 ino;
+   u32 gen;
+   u32 parent_ino;
+   u32 parent_gen;
+   } i32;
+   __u32 raw[6];
+   };
+};
 
 /**
  * struct export_operations - for nfsd to communicate with file systems
@@ -117,9 +155,9 @@ extern struct dentry *find_exported_dent
void *parent, int (*acceptable)(void *context, struct dentry *de),
void *context);
 
-extern int exportfs_encode_fh(struct dentry *dentry, __u32 *fh, int *max_len,
-   int connectable);
-extern struct dentry *exportfs_decode_fh(struct vfsmount *mnt, __u32 *fh,
+extern int exportfs_encode_fh(struct dentry *dentry, struct fid *fid,
+   int *max_len, int connectable);
+extern struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
int fh_len, int fileid_type, int (*acceptable)(void *, struct dentry *),
void *context);
 
Index: linux-2.6/fs/nfsd/nfsfh.c
===
--- linux-2.6.orig/fs/nfsd/nfsfh.c  2007-09-13 15:10:59.0 +0200
+++ linux-2.6/fs/nfsd/nfsfh.c   2007-09-13 15:13:32.0 +0200
@@ -115,8 +115,7 @@ fh_verify(struct svc_rqst *rqstp, struct
dprintk(nfsd: fh_verify(%s)\n, SVCFH_fmt(fhp));
 
if (!fhp-fh_dentry) {
-   __u32 *datap=NULL;
-   __u32 tfh[3];   /* filehandle fragment for oldstyle 
filehandles */
+   struct fid *fid = NULL, sfid;
int fileid_type;
int data_left = fh-fh_size/4;
 
@@ -128,7 +127,6 @@ fh_verify(struct svc_rqst *rqstp, struct
 
if (fh-fh_version == 1) {
int len;
-   datap = fh-fh_auth;
if (--data_left0) goto out;
switch (fh-fh_auth_type) {
case 0: break;
@@ -144,9 +142,11 @@ fh_verify(struct svc_rqst *rqstp, struct
fh-fh_fsid[1] = fh-fh_fsid[2];
}
if ((data_left -= len)0) goto out;
-   exp = rqst_exp_find(rqstp, fh-fh_fsid_type, datap);
-   datap += len;
+   exp = rqst_exp_find(rqstp, fh-fh_fsid_type,
+   fh-fh_auth);
+   fid = (struct fid *)(fh-fh_auth + len);
} else {
+   __u32 tfh[2];
dev_t xdev;
ino_t xino;
if (fh-fh_size != NFS_FHSIZE)
@@ -190,22 +190,22 @@ fh_verify(struct svc_rqst *rqstp, struct
error = nfserr_badhandle;
 
if (fh-fh_version != 1) {
-   tfh[0] = fh-ofh_ino;
-   tfh[1] = fh-ofh_generation;
-   tfh[2] = fh-ofh_dirino;
-   datap = tfh;
+   sfid.i32.ino = fh-ofh_ino;
+   sfid.i32.gen = fh-ofh_generation;
+   sfid.i32.parent_ino = fh-ofh_dirino;
+   fid = sfid;
  

[PATCH 03/19] exportfs: add new methods

2007-09-14 Thread hch
Add the guts for the new filesystem API to exportfs.

There's now a fh_to_dentry method that returns a dentry for the 
object looked for given a filehandle fragment, and a fh_to_parent
operation that returns the dentry for the encoded parent directory
in case the file handle contains it.

There are default implementations for these methods that only take
a callback for an nfs-enhanced iget variant and implement the
rest of the semantics.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/include/linux/exportfs.h
===
--- linux-2.6.orig/include/linux/exportfs.h 2007-09-13 15:11:11.0 
+0200
+++ linux-2.6/include/linux/exportfs.h  2007-09-13 15:13:57.0 +0200
@@ -4,6 +4,7 @@
 #include linux/types.h
 
 struct dentry;
+struct inode;
 struct super_block;
 struct vfsmount;
 
@@ -101,6 +102,21 @@ struct fid {
  *the filehandle fragment.  encode_fh() should return the number of bytes
  *stored or a negative error code such as %-ENOSPC
  *
+ * fh_to_dentry:
+ *@fh_to_dentry is given a struct super_block (@sb) and a file handle
+ *fragment (@fh, @fh_len). It should return a struct dentry which refers
+ *to the same file that the file handle fragment refers to.  If it cannot,
+ *it should return a %NULL pointer if the file was found but no acceptable
+ *dentries were available, or an %ERR_PTR error code indicating why it
+ *couldn't be found (e.g. %ENOENT or %ENOMEM).  Any suitable dentry can be
+ *returned including, if necessary, a new dentry created with d_alloc_root.
+ *The caller can then find any other extant dentries by following the
+ *d_alias links.
+ *
+ * fh_to_parent:
+ *Same as @fh_to_dentry, except that it returns a pointer to the parent
+ *dentry if it was encoded into the filehandle fragment by @encode_fh.
+ *
  * get_name:
  *@get_name should find a name for the given @child in the given @parent
  *directory.  The name should be stored in the @name (with the
@@ -139,6 +155,10 @@ struct export_operations {
void *context);
int (*encode_fh)(struct dentry *de, __u32 *fh, int *max_len,
int connectable);
+   struct dentry * (*fh_to_dentry)(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type);
+   struct dentry * (*fh_to_parent)(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type);
int (*get_name)(struct dentry *parent, char *name,
struct dentry *child);
struct dentry * (*get_parent)(struct dentry *child);
@@ -161,4 +181,14 @@ extern struct dentry *exportfs_decode_fh
int fh_len, int fileid_type, int (*acceptable)(void *, struct dentry *),
void *context);
 
+/*
+ * Generic helpers for filesystems.
+ */
+extern struct dentry *generic_fh_to_dentry(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type,
+   struct inode *(*get_inode) (struct super_block *sb, u64 ino, u32 gen));
+extern struct dentry *generic_fh_to_parent(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type,
+   struct inode *(*get_inode) (struct super_block *sb, u64 ino, u32 gen));
+
 #endif /* LINUX_EXPORTFS_H */
Index: linux-2.6/fs/exportfs/expfs.c
===
--- linux-2.6.orig/fs/exportfs/expfs.c  2007-09-13 15:13:02.0 +0200
+++ linux-2.6/fs/exportfs/expfs.c   2007-09-13 15:14:42.0 +0200
@@ -514,17 +514,141 @@ struct dentry *exportfs_decode_fh(struct
int (*acceptable)(void *, struct dentry *), void *context)
 {
struct export_operations *nop = mnt-mnt_sb-s_export_op;
-   struct dentry *result;
+   struct dentry *result, *alias;
+   int err;
 
-   if (nop-decode_fh) {
-   result = nop-decode_fh(mnt-mnt_sb, fid-raw, fh_len,
+   /*
+* Old way of doing things.  Will go away soon.
+*/
+   if (!nop-fh_to_dentry) {
+   if (nop-decode_fh) {
+   return nop-decode_fh(mnt-mnt_sb, fid-raw, fh_len,
fileid_type, acceptable, context);
+   } else {
+   return export_decode_fh(mnt-mnt_sb, fid-raw, fh_len,
+   fileid_type, acceptable, context);
+   }
+   }
+
+   /*
+* Try to get any dentry for the given file handle from the filesystem.
+*/
+   result = nop-fh_to_dentry(mnt-mnt_sb, fid, fh_len, fileid_type);
+   if (!result)
+   result = ERR_PTR(-ESTALE);
+   if (IS_ERR(result))
+   return result;
+
+   if (S_ISDIR(result-d_inode-i_mode)) {
+   /*
+* This request is for a directory.
+*
+* On the positive side there is only one dentry for each
+   

[PATCH 11/19] fat: new export ops

2007-09-14 Thread hch
Very little changes here, fat had a mostly no op decode_fh before
and does not store any parent information.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/fs/fat/inode.c
===
--- linux-2.6.orig/fs/fat/inode.c   2007-03-13 19:22:40.0 +0100
+++ linux-2.6/fs/fat/inode.c2007-03-13 19:23:13.0 +0100
@@ -651,24 +651,15 @@ static const struct super_operations fat
  * of i_logstart is used to store the directory entry offset.
  */
 
-static struct dentry *
-fat_decode_fh(struct super_block *sb, __u32 *fh, int len, int fhtype,
- int (*acceptable)(void *context, struct dentry *de),
- void *context)
-{
-   if (fhtype != 3)
-   return ERR_PTR(-ESTALE);
-   if (len  5)
-   return ERR_PTR(-ESTALE);
-
-   return sb-s_export_op-find_exported_dentry(sb, fh, NULL, acceptable, 
context);
-}
-
-static struct dentry *fat_get_dentry(struct super_block *sb, void *inump)
+static struct dentry *fat_fh_to_dentry(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type)
 {
struct inode *inode = NULL;
struct dentry *result;
-   __u32 *fh = inump;
+   u32 *fh = fid-raw;
+
+   if (fh_len  5 || fh_type != 3)
+   return NULL;
 
inode = iget(sb, fh[0]);
if (!inode || is_bad_inode(inode) || inode-i_generation != fh[1]) {
@@ -782,9 +773,8 @@ out:
 }
 
 static struct export_operations fat_export_ops = {
-   .decode_fh  = fat_decode_fh,
.encode_fh  = fat_encode_fh,
-   .get_dentry = fat_get_dentry,
+   .fh_to_dentry   = fat_fh_to_dentry,
.get_parent = fat_get_parent,
 };
 

--
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/19] isofs: new export ops

2007-09-14 Thread hch
Nice little cleanup by consolidating things a little and using   
a structure for the special file handle format.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/fs/isofs/export.c
===
--- linux-2.6.orig/fs/isofs/export.c2007-02-11 10:31:19.0 +0100
+++ linux-2.6/fs/isofs/export.c 2007-02-11 10:45:25.0 +0100
@@ -42,16 +42,6 @@
return result;
 }
 
-static struct dentry *
-isofs_export_get_dentry(struct super_block *sb, void *vobjp)
-{
-   __u32 *objp = vobjp;
-   unsigned long block = objp[0];
-   unsigned long offset = objp[1];
-   __u32 generation = objp[2];
-   return isofs_export_iget(sb, block, offset, generation);
-}
-
 /* This function is surprisingly simple.  The trick is understanding
  * that child is always a directory. So, to find its parent, you
  * simply need to find its .. entry, normalize its block and offset,
@@ -182,43 +172,44 @@
return type;
 }
 
+struct isofs_fid {
+   u32 block;
+   u16 offset;
+   u16 parent_offset;
+   u32 generation;
+   u32 parent_block;
+   u32 parent_generation;
+};
 
-static struct dentry *
-isofs_export_decode_fh(struct super_block *sb,
-  __u32 *fh32,
-  int fh_len,
-  int fileid_type,
-  int (*acceptable)(void *context, struct dentry *de),
-  void *context)
+static struct dentry *isofs_fh_to_dentry(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type)
 {
-   __u16 *fh16 = (__u16*)fh32;
-   __u32 child[3];   /* The child is what triggered all this. */
-   __u32 parent[3];  /* The parent is just along for the ride. */
+   struct isofs_fid *ifid = (struct isofs_fid *)fid;
 
-   if (fh_len  3 || fileid_type  2)
+   if (fh_len  3 || fh_type  2)
return NULL;
 
-   child[0] = fh32[0];
-   child[1] = fh16[2];  /* fh16 [sic] */
-   child[2] = fh32[2];
-
-   parent[0] = 0;
-   parent[1] = 0;
-   parent[2] = 0;
-   if (fileid_type == 2) {
-   if (fh_len  2) parent[0] = fh32[3];
-   parent[1] = fh16[3];  /* fh16 [sic] */
-   if (fh_len  4) parent[2] = fh32[4];
-   }
-
-   return sb-s_export_op-find_exported_dentry(sb, child, parent,
-acceptable, context);
+   return isofs_export_iget(sb, ifid-block, ifid-offset,
+   ifid-generation);
 }
 
+static struct dentry *isofs_fh_to_parent(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type)
+{
+   struct isofs_fid *ifid = (struct isofs_fid *)fid;
+
+   if (fh_type != 2)
+   return NULL;
+
+   return isofs_export_iget(sb,
+   fh_len  2 ? ifid-parent_block : 0,
+   ifid-parent_offset,
+   fh_len  4 ? ifid-parent_generation : 0);
+}
 
 struct export_operations isofs_export_ops = {
-   .decode_fh  = isofs_export_decode_fh,
.encode_fh  = isofs_export_encode_fh,
-   .get_dentry = isofs_export_get_dentry,
+   .fh_to_dentry   = isofs_fh_to_dentry,
+   .fh_to_parent   = isofs_fh_to_parent,
.get_parent = isofs_export_get_parent,
 };

--
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/19] shmem: new export ops

2007-09-14 Thread hch
I'm not sure what people were thinking when adding support to
export tmpfs, but here's the conversion anyway:


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/mm/shmem.c
===
--- linux-2.6.orig/mm/shmem.c   2007-02-11 10:46:30.0 +0100
+++ linux-2.6/mm/shmem.c2007-02-11 10:53:12.0 +0100
@@ -1977,33 +1977,25 @@
return ino-i_ino == inum  fh[0] == ino-i_generation;
 }
 
-static struct dentry *shmem_get_dentry(struct super_block *sb, void *vfh)
+static struct dentry *shmem_fh_to_dentry(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type)
 {
-   struct dentry *de = NULL;
struct inode *inode;
-   __u32 *fh = vfh;
-   __u64 inum = fh[2];
-   inum = (inum  32) | fh[1];
+   struct dentry *dentry = NULL;
+   u64 inum = fid-raw[2];
+   inum = (inum  32) | fid-raw[1];
 
-   inode = ilookup5(sb, (unsigned long)(inum+fh[0]), shmem_match, vfh);
+   if (fh_len  3)
+   return NULL;
+
+   inode = ilookup5(sb, (unsigned long)(inum + fid-raw[0]),
+   shmem_match, fid-raw);
if (inode) {
-   de = d_find_alias(inode);
+   dentry = d_find_alias(inode);
iput(inode);
}
 
-   return de? de: ERR_PTR(-ESTALE);
-}
-
-static struct dentry *shmem_decode_fh(struct super_block *sb, __u32 *fh,
-   int len, int type,
-   int (*acceptable)(void *context, struct dentry *de),
-   void *context)
-{
-   if (len  3)
-   return ERR_PTR(-ESTALE);
-
-   return sb-s_export_op-find_exported_dentry(sb, fh, NULL, acceptable,
-   context);
+   return dentry;
 }
 
 static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len,
@@ -2038,9 +2030,8 @@
 
 static struct export_operations shmem_export_ops = {
.get_parent = shmem_get_parent,
-   .get_dentry = shmem_get_dentry,
.encode_fh  = shmem_encode_fh,
-   .decode_fh  = shmem_decode_fh,
+   .fh_to_dentry   = shmem_fh_to_dentry,
 };
 
 static int shmem_parse_options(char *options, int *mode, uid_t *uid,

--
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/19] reiserfs: new export ops

2007-09-14 Thread hch
Another nice little cleanup by using the new methods.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/fs/reiserfs/inode.c
===
--- linux-2.6.orig/fs/reiserfs/inode.c  2007-09-13 15:10:45.0 +0200
+++ linux-2.6/fs/reiserfs/inode.c   2007-09-13 15:21:12.0 +0200
@@ -1514,19 +1514,20 @@ struct inode *reiserfs_iget(struct super
return inode;
 }
 
-struct dentry *reiserfs_get_dentry(struct super_block *sb, void *vobjp)
+static struct dentry *reiserfs_get_dentry(struct super_block *sb,
+   u32 objectid, u32 dir_id, u32 generation)
+
 {
-   __u32 *data = vobjp;
struct cpu_key key;
struct dentry *result;
struct inode *inode;
 
-   key.on_disk_key.k_objectid = data[0];
-   key.on_disk_key.k_dir_id = data[1];
+   key.on_disk_key.k_objectid = objectid;
+   key.on_disk_key.k_dir_id = dir_id;
reiserfs_write_lock(sb);
inode = reiserfs_iget(sb, key);
-   if (inode  !IS_ERR(inode)  data[2] != 0 
-   data[2] != inode-i_generation) {
+   if (inode  !IS_ERR(inode)  generation != 0 
+   generation != inode-i_generation) {
iput(inode);
inode = NULL;
}
@@ -1543,14 +1544,9 @@ struct dentry *reiserfs_get_dentry(struc
return result;
 }
 
-struct dentry *reiserfs_decode_fh(struct super_block *sb, __u32 * data,
- int len, int fhtype,
- int (*acceptable) (void *contect,
-struct dentry * de),
- void *context)
+struct dentry *reiserfs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type)
 {
-   __u32 obj[3], parent[3];
-
/* fhtype happens to reflect the number of u32s encoded.
 * due to a bug in earlier code, fhtype might indicate there
 * are more u32s then actually fitted.
@@ -1563,32 +1559,28 @@ struct dentry *reiserfs_decode_fh(struct
 *   6 - as above plus generation of directory
 * 6 does not fit in NFSv2 handles
 */
-   if (fhtype  len) {
-   if (fhtype != 6 || len != 5)
+   if (fh_type  fh_len) {
+   if (fh_type != 6 || fh_len != 5)
reiserfs_warning(sb,
-nfsd/reiserfs, fhtype=%d, len=%d - 
odd,
-fhtype, len);
-   fhtype = 5;
+   nfsd/reiserfs, fhtype=%d, len=%d - odd,
+   fh_type, fh_len);
+   fh_type = 5;
}
 
-   obj[0] = data[0];
-   obj[1] = data[1];
-   if (fhtype == 3 || fhtype = 5)
-   obj[2] = data[2];
-   else
-   obj[2] = 0; /* generation number */
+   return reiserfs_get_dentry(sb, fid-raw[0], fid-raw[1],
+   (fh_type == 3 || fh_type = 5) ? fid-raw[2] : 0);
+}
 
-   if (fhtype = 4) {
-   parent[0] = data[fhtype = 5 ? 3 : 2];
-   parent[1] = data[fhtype = 5 ? 4 : 3];
-   if (fhtype == 6)
-   parent[2] = data[5];
-   else
-   parent[2] = 0;
-   }
-   return sb-s_export_op-find_exported_dentry(sb, obj,
-fhtype  4 ? NULL : parent,
-acceptable, context);
+struct dentry *reiserfs_fh_to_parent(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type)
+{
+   if (fh_type  4)
+   return NULL;
+
+   return reiserfs_get_dentry(sb,
+   (fh_type = 5) ? fid-raw[3] : fid-raw[2],
+   (fh_type = 5) ? fid-raw[4] : fid-raw[3],
+   (fh_type == 6) ? fid-raw[5] : 0);
 }
 
 int reiserfs_encode_fh(struct dentry *dentry, __u32 * data, int *lenp,
Index: linux-2.6/fs/reiserfs/super.c
===
--- linux-2.6.orig/fs/reiserfs/super.c  2007-09-13 15:10:45.0 +0200
+++ linux-2.6/fs/reiserfs/super.c   2007-09-13 15:21:12.0 +0200
@@ -651,9 +651,9 @@ static struct quotactl_ops reiserfs_qctl
 
 static struct export_operations reiserfs_export_ops = {
.encode_fh = reiserfs_encode_fh,
-   .decode_fh = reiserfs_decode_fh,
+   .fh_to_dentry = reiserfs_fh_to_dentry,
+   .fh_to_parent = reiserfs_fh_to_parent,
.get_parent = reiserfs_get_parent,
-   .get_dentry = reiserfs_get_dentry,
 };
 
 /* this struct is used in reiserfs_getopt () for containing the value for those
Index: linux-2.6/include/linux/reiserfs_fs.h
===
--- linux-2.6.orig/include/linux/reiserfs_fs.h  2007-09-13 15:10:45.0 
+0200
+++ linux-2.6/include/linux/reiserfs_fs.h   

[PATCH 15/19] gfs2: new export ops

2007-09-14 Thread hch
Convert gfs2 to the new ops.   Uses a similar structure to the generic
helpers, but gfs2 has it's own file handle formats.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]


Index: linux-2.6/fs/gfs2/ops_export.c
===
--- linux-2.6.orig/fs/gfs2/ops_export.c 2007-07-19 15:56:46.0 +0200
+++ linux-2.6/fs/gfs2/ops_export.c  2007-07-20 19:58:06.0 +0200
@@ -31,40 +31,6 @@
 #define GFS2_LARGE_FH_SIZE 8
 #define GFS2_OLD_FH_SIZE 10
 
-static struct dentry *gfs2_decode_fh(struct super_block *sb,
-__u32 *p,
-int fh_len,
-int fh_type,
-int (*acceptable)(void *context,
-  struct dentry *dentry),
-void *context)
-{
-   __be32 *fh = (__force __be32 *)p;
-   struct gfs2_inum_host inum, parent;
-
-   memset(parent, 0, sizeof(struct gfs2_inum));
-
-   switch (fh_len) {
-   case GFS2_LARGE_FH_SIZE:
-   case GFS2_OLD_FH_SIZE:
-   parent.no_formal_ino = ((u64)be32_to_cpu(fh[4]))  32;
-   parent.no_formal_ino |= be32_to_cpu(fh[5]);
-   parent.no_addr = ((u64)be32_to_cpu(fh[6]))  32;
-   parent.no_addr |= be32_to_cpu(fh[7]);
-   case GFS2_SMALL_FH_SIZE:
-   inum.no_formal_ino = ((u64)be32_to_cpu(fh[0]))  32;
-   inum.no_formal_ino |= be32_to_cpu(fh[1]);
-   inum.no_addr = ((u64)be32_to_cpu(fh[2]))  32;
-   inum.no_addr |= be32_to_cpu(fh[3]);
-   break;
-   default:
-   return NULL;
-   }
-
-   return gfs2_export_ops.find_exported_dentry(sb, inum, parent,
-   acceptable, context);
-}
-
 static int gfs2_encode_fh(struct dentry *dentry, __u32 *p, int *len,
  int connectable)
 {
@@ -189,10 +155,10 @@ static struct dentry *gfs2_get_parent(st
return dentry;
 }
 
-static struct dentry *gfs2_get_dentry(struct super_block *sb, void *inum_obj)
+static struct dentry *gfs2_get_dentry(struct super_block *sb,
+   struct gfs2_inum_host *inum)
 {
struct gfs2_sbd *sdp = sb-s_fs_info;
-   struct gfs2_inum_host *inum = inum_obj;
struct gfs2_holder i_gh, ri_gh, rgd_gh;
struct gfs2_rgrpd *rgd;
struct inode *inode;
@@ -289,11 +255,50 @@ fail:
return ERR_PTR(error);
 }
 
+static struct dentry *gfs2_fh_to_dentry(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   struct gfs2_inum_host this;
+   __be32 *fh = (__force __be32 *)fid-raw;
+
+   switch (fh_type) {
+   case GFS2_SMALL_FH_SIZE:
+   case GFS2_LARGE_FH_SIZE:
+   case GFS2_OLD_FH_SIZE:
+   this.no_formal_ino = ((u64)be32_to_cpu(fh[0]))  32;
+   this.no_formal_ino |= be32_to_cpu(fh[1]);
+   this.no_addr = ((u64)be32_to_cpu(fh[2]))  32;
+   this.no_addr |= be32_to_cpu(fh[3]);
+   return gfs2_get_dentry(sb, this);
+   default:
+   return NULL;
+   }
+}
+
+static struct dentry *gfs2_fh_to_parent(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   struct gfs2_inum_host parent;
+   __be32 *fh = (__force __be32 *)fid-raw;
+
+   switch (fh_type) {
+   case GFS2_LARGE_FH_SIZE:
+   case GFS2_OLD_FH_SIZE:
+   parent.no_formal_ino = ((u64)be32_to_cpu(fh[4]))  32;
+   parent.no_formal_ino |= be32_to_cpu(fh[5]);
+   parent.no_addr = ((u64)be32_to_cpu(fh[6]))  32;
+   parent.no_addr |= be32_to_cpu(fh[7]);
+   return gfs2_get_dentry(sb, parent);
+   default:
+   return NULL;
+   }
+}
+
 struct export_operations gfs2_export_ops = {
-   .decode_fh = gfs2_decode_fh,
.encode_fh = gfs2_encode_fh,
+   .fh_to_dentry = gfs2_fh_to_dentry,
+   .fh_to_parent = gfs2_fh_to_parent,
.get_name = gfs2_get_name,
.get_parent = gfs2_get_parent,
-   .get_dentry = gfs2_get_dentry,
 };
 

--
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 16/19] ocfs2: new export ops

2007-09-14 Thread hch
OCFS2 has it's own 64bit-firendly filehandle format so we can't use
the generic helpers here.  I'll add a struct for the types later.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]


Index: linux-2.6/fs/ocfs2/export.c
===
--- linux-2.6.orig/fs/ocfs2/export.c2007-05-06 13:51:17.0 +0200
+++ linux-2.6/fs/ocfs2/export.c 2007-06-12 15:54:44.0 +0200
@@ -45,9 +45,9 @@ struct ocfs2_inode_handle
u32 ih_generation;
 };
 
-static struct dentry *ocfs2_get_dentry(struct super_block *sb, void *vobjp)
+static struct dentry *ocfs2_get_dentry(struct super_block *sb,
+   struct ocfs2_inode_handle *handle)
 {
-   struct ocfs2_inode_handle *handle = vobjp;
struct inode *inode;
struct dentry *result;
 
@@ -200,54 +200,37 @@ bail:
return type;
 }
 
-static struct dentry *ocfs2_decode_fh(struct super_block *sb, u32 *fh_in,
- int fh_len, int fileid_type,
- int (*acceptable)(void *context,
-   struct dentry *de),
- void *context)
+static struct dentry *ocfs2_fh_to_dentry(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type)
 {
-   struct ocfs2_inode_handle handle, parent;
-   struct dentry *ret = NULL;
-   __le32 *fh = (__force __le32 *) fh_in;
-
-   mlog_entry((0x%p, 0x%p, %d, %d, 0x%p, 0x%p)\n,
-  sb, fh, fh_len, fileid_type, acceptable, context);
+   struct ocfs2_inode_handle handle;
 
-   if (fh_len  3 || fileid_type  2)
-   goto bail;
+   if (fh_len  3 || fh_type  2)
+   return NULL;
 
-   if (fileid_type == 2) {
-   if (fh_len  6)
-   goto bail;
-
-   parent.ih_blkno = (u64)le32_to_cpu(fh[3])  32;
-   parent.ih_blkno |= (u64)le32_to_cpu(fh[4]);
-   parent.ih_generation = le32_to_cpu(fh[5]);
-
-   mlog(0, Decoding parent: blkno: %llu, generation: %u\n,
-(unsigned long long)parent.ih_blkno,
-parent.ih_generation);
-   }
-
-   handle.ih_blkno = (u64)le32_to_cpu(fh[0])  32;
-   handle.ih_blkno |= (u64)le32_to_cpu(fh[1]);
-   handle.ih_generation = le32_to_cpu(fh[2]);
+   handle.ih_blkno = (u64)le32_to_cpu(fid-raw[0])  32;
+   handle.ih_blkno |= (u64)le32_to_cpu(fid-raw[1]);
+   handle.ih_generation = le32_to_cpu(fid-raw[2]);
+   return ocfs2_get_dentry(sb, handle);
+}
 
-   mlog(0, Encoding fh: blkno: %llu, generation: %u\n,
-(unsigned long long)handle.ih_blkno, handle.ih_generation);
+static struct dentry *ocfs2_fh_to_parent(struct super_block *sb,
+   struct fid *fid, int fh_len, int fh_type)
+{
+   struct ocfs2_inode_handle parent;
 
-   ret = ocfs2_export_ops.find_exported_dentry(sb, handle, parent,
-   acceptable, context);
+   if (fh_type != 2 || fh_len  6)
+   return NULL;
 
-bail:
-   mlog_exit_ptr(ret);
-   return ret;
+   parent.ih_blkno = (u64)le32_to_cpu(fid-raw[3])  32;
+   parent.ih_blkno |= (u64)le32_to_cpu(fid-raw[4]);
+   parent.ih_generation = le32_to_cpu(fid-raw[5]);
+   return ocfs2_get_dentry(sb, parent);
 }
 
 struct export_operations ocfs2_export_ops = {
-   .decode_fh  = ocfs2_decode_fh,
.encode_fh  = ocfs2_encode_fh,
-
+   .fh_to_dentry   = ocfs2_fh_to_dentry,
+   .fh_to_parent   = ocfs2_fh_to_parent,
.get_parent = ocfs2_get_parent,
-   .get_dentry = ocfs2_get_dentry,
 };

--
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/19] exportfs: remove old methods

2007-09-14 Thread hch
Now that all filesystems are converted remove support for the
old methods.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/fs/exportfs/expfs.c
===
--- linux-2.6.orig/fs/exportfs/expfs.c  2007-08-29 13:52:01.0 +0200
+++ linux-2.6/fs/exportfs/expfs.c   2007-08-29 14:02:41.0 +0200
@@ -13,19 +13,6 @@ static int get_name(struct dentry *dentr
struct dentry *child);
 
 
-static struct dentry *exportfs_get_dentry(struct super_block *sb, void *obj)
-{
-   struct dentry *result = ERR_PTR(-ESTALE);
-
-   if (sb-s_export_op-get_dentry) {
-   result = sb-s_export_op-get_dentry(sb, obj);
-   if (!result)
-   result = ERR_PTR(-ESTALE);
-   }
-
-   return result;
-}
-
 static int exportfs_get_name(struct dentry *dir, char *name,
struct dentry *child)
 {
@@ -214,125 +201,6 @@ reconnect_path(struct super_block *sb, s
return 0;
 }
 
-/**
- * find_exported_dentry - helper routine to implement 
export_operations-decode_fh
- * @sb:The super_block identifying the filesystem
- * @obj:   An opaque identifier of the object to be found - passed to
- * get_inode
- * @parent:An optional opqaue identifier of the parent of the object.
- * @acceptable:A function used to test possible dentries to see if 
they are
- * acceptable
- * @context:   A parameter to @acceptable so that it knows on what basis to
- * judge.
- *
- * find_exported_dentry is the central helper routine to enable file systems
- * to provide the decode_fh() export_operation.  It's main task is to take
- * an inode, find or create an appropriate dentry structure, and possibly
- * splice this into the dcache in the correct place.
- *
- * The decode_fh() operation provided by the filesystem should call
- * find_exported_dentry() with the same parameters that it received except
- * that instead of the file handle fragment, pointers to opaque identifiers
- * for the object and optionally its parent are passed.  The default decode_fh
- * routine passes one pointer to the start of the filehandle fragment, and
- * one 8 bytes into the fragment.  It is expected that most filesystems will
- * take this approach, though the offset to the parent identifier may well be
- * different.
- *
- * find_exported_dentry() will call get_dentry to get an dentry pointer from
- * the file system.  If any dentry in the d_alias list is acceptable, it will
- * be returned.  Otherwise find_exported_dentry() will attempt to splice a new
- * dentry into the dcache using get_name() and get_parent() to find the
- * appropriate place.
- */
-
-struct dentry *
-find_exported_dentry(struct super_block *sb, void *obj, void *parent,
-int (*acceptable)(void *context, struct dentry *de),
-void *context)
-{
-   struct dentry *result, *alias;
-   int err = -ESTALE;
-
-   /*
-* Attempt to find the inode.
-*/
-   result = exportfs_get_dentry(sb, obj);
-   if (IS_ERR(result))
-   return result;
-
-   if (S_ISDIR(result-d_inode-i_mode)) {
-   if (!(result-d_flags  DCACHE_DISCONNECTED)) {
-   if (acceptable(context, result))
-   return result;
-   err = -EACCES;
-   goto err_result;
-   }
-
-   err = reconnect_path(sb, result);
-   if (err)
-   goto err_result;
-   } else {
-   struct dentry *target_dir, *nresult;
-   char nbuf[NAME_MAX+1];
-
-   alias = find_acceptable_alias(result, acceptable, context);
-   if (alias)
-   return alias;
-
-   if (parent == NULL)
-   goto err_result;
-
-   target_dir = exportfs_get_dentry(sb,parent);
-   if (IS_ERR(target_dir)) {
-   err = PTR_ERR(target_dir);
-   goto err_result;
-   }
-
-   err = reconnect_path(sb, target_dir);
-   if (err) {
-   dput(target_dir);
-   goto err_result;
-   }
-
-   /*
-* As we weren't after a directory, have one more step to go.
-*/
-   err = exportfs_get_name(target_dir, nbuf, result);
-   if (!err) {
-   mutex_lock(target_dir-d_inode-i_mutex);
-   nresult = lookup_one_len(nbuf, target_dir,
-strlen(nbuf));
-   mutex_unlock(target_dir-d_inode-i_mutex);
-   if (!IS_ERR(nresult)) {
-   if (nresult-d_inode) {
-   dput(result);
-   

[PATCH 18/19] exportfs: make struct export_operations const

2007-09-14 Thread hch
Now that nfsd has stopped writing to the find_exported_dentry member
we an mark the export_operations const


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/fs/efs/super.c
===
--- linux-2.6.orig/fs/efs/super.c   2007-09-13 15:06:51.0 +0200
+++ linux-2.6/fs/efs/super.c2007-09-13 15:09:05.0 +0200
@@ -113,7 +113,7 @@ static const struct super_operations efs
.remount_fs = efs_remount,
 };
 
-static struct export_operations efs_export_ops = {
+static const struct export_operations efs_export_ops = {
.fh_to_dentry   = efs_fh_to_dentry,
.fh_to_parent   = efs_fh_to_parent,
.get_parent = efs_get_parent,
Index: linux-2.6/fs/ext2/super.c
===
--- linux-2.6.orig/fs/ext2/super.c  2007-09-13 15:05:26.0 +0200
+++ linux-2.6/fs/ext2/super.c   2007-09-13 15:09:05.0 +0200
@@ -292,7 +292,7 @@ static struct dentry *ext2_fh_to_parent(
  * systems, but can be improved upon.
  * Currently only get_parent is required.
  */
-static struct export_operations ext2_export_ops = {
+static const struct export_operations ext2_export_ops = {
.fh_to_dentry = ext2_fh_to_dentry,
.fh_to_parent = ext2_fh_to_parent,
.get_parent = ext2_get_parent,
Index: linux-2.6/fs/ext3/super.c
===
--- linux-2.6.orig/fs/ext3/super.c  2007-09-13 15:06:39.0 +0200
+++ linux-2.6/fs/ext3/super.c   2007-09-13 15:09:05.0 +0200
@@ -670,7 +670,7 @@ static const struct super_operations ext
 #endif
 };
 
-static struct export_operations ext3_export_ops = {
+static const struct export_operations ext3_export_ops = {
.fh_to_dentry = ext3_fh_to_dentry,
.fh_to_parent = ext3_fh_to_parent,
.get_parent = ext3_get_parent,
Index: linux-2.6/fs/ext4/super.c
===
--- linux-2.6.orig/fs/ext4/super.c  2007-09-13 15:06:44.0 +0200
+++ linux-2.6/fs/ext4/super.c   2007-09-13 15:09:05.0 +0200
@@ -721,7 +721,7 @@ static const struct super_operations ext
 #endif
 };
 
-static struct export_operations ext4_export_ops = {
+static const struct export_operations ext4_export_ops = {
.fh_to_dentry = ext4_fh_to_dentry,
.fh_to_parent = ext4_fh_to_parent,
.get_parent = ext4_get_parent,
Index: linux-2.6/fs/fat/inode.c
===
--- linux-2.6.orig/fs/fat/inode.c   2007-09-13 15:08:12.0 +0200
+++ linux-2.6/fs/fat/inode.c2007-09-13 15:09:05.0 +0200
@@ -769,7 +769,7 @@ out:
return parent;
 }
 
-static struct export_operations fat_export_ops = {
+static const struct export_operations fat_export_ops = {
.encode_fh  = fat_encode_fh,
.fh_to_dentry   = fat_fh_to_dentry,
.get_parent = fat_get_parent,
Index: linux-2.6/fs/gfs2/ops_export.c
===
--- linux-2.6.orig/fs/gfs2/ops_export.c 2007-09-13 15:08:53.0 +0200
+++ linux-2.6/fs/gfs2/ops_export.c  2007-09-13 15:09:05.0 +0200
@@ -294,7 +294,7 @@ static struct dentry *gfs2_fh_to_parent(
}
 }
 
-struct export_operations gfs2_export_ops = {
+const struct export_operations gfs2_export_ops = {
.encode_fh = gfs2_encode_fh,
.fh_to_dentry = gfs2_fh_to_dentry,
.fh_to_parent = gfs2_fh_to_parent,
Index: linux-2.6/fs/isofs/export.c
===
--- linux-2.6.orig/fs/isofs/export.c2007-09-13 15:08:18.0 +0200
+++ linux-2.6/fs/isofs/export.c 2007-09-13 15:09:05.0 +0200
@@ -207,7 +207,7 @@ static struct dentry *isofs_fh_to_parent
fh_len  4 ? ifid-parent_generation : 0);
 }
 
-struct export_operations isofs_export_ops = {
+const struct export_operations isofs_export_ops = {
.encode_fh  = isofs_export_encode_fh,
.fh_to_dentry   = isofs_fh_to_dentry,
.fh_to_parent   = isofs_fh_to_parent,
Index: linux-2.6/fs/isofs/isofs.h
===
--- linux-2.6.orig/fs/isofs/isofs.h 2007-09-11 16:23:34.0 +0200
+++ linux-2.6/fs/isofs/isofs.h  2007-09-13 15:09:05.0 +0200
@@ -178,4 +178,4 @@ isofs_normalize_block_and_offset(struct 
 extern const struct inode_operations isofs_dir_inode_operations;
 extern const struct file_operations isofs_dir_operations;
 extern const struct address_space_operations isofs_symlink_aops;
-extern struct export_operations isofs_export_ops;
+extern const struct export_operations isofs_export_ops;
Index: linux-2.6/fs/jfs/super.c
===
--- linux-2.6.orig/fs/jfs/super.c   2007-09-13 15:07:17.0 +0200
+++ 

[PATCH 19/19] exportfs: update documentation

2007-09-14 Thread hch
Update deocumentation to the current state of affairs.  Remove duplicated
method descruptions in exportfs.h and point to Documentation/filesystems/
Exporting instead.  Add a little file header comment in expfs.c describing
what's going on and mentioning Neils and my copyright [1].

[1] Neil, in case you want a different/additional attribution just change
the patch in your queue to reflect the preferred version.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/Documentation/filesystems/Exporting
===
--- linux-2.6.orig/Documentation/filesystems/Exporting  2007-03-16 
15:10:54.0 +0100
+++ linux-2.6/Documentation/filesystems/Exporting   2007-03-16 
17:11:50.0 +0100
@@ -2,7 +2,10 @@
 Making Filesystems Exportable
 =
 
-Most filesystem operations require a dentry (or two) as a starting
+Overview
+
+
+All filesystem operations require a dentry (or two) as a starting
 point.  Local applications have a reference-counted hold on suitable
 dentrys via open file descriptors or cwd/root.  However remote
 applications that access a filesystem via a remote filesystem protocol
@@ -89,11 +92,9 @@ For a filesystem to be exportable it mus
1/ provide the filehandle fragment routines described below.
2/ make sure that d_splice_alias is used rather than d_add
   when -lookup finds an inode for a given parent and name.
-  Typically the -lookup routine will end:
-   if (inode)
-   return d_splice(inode, dentry);
-   d_add(dentry, inode);
-   return NULL;
+  Typically the -lookup routine will end with a:
+
+   return d_splice_alias(inode, dentry);
}
 
 
@@ -101,67 +102,39 @@ For a filesystem to be exportable it mus
   A file system implementation declares that instances of the filesystem
 are exportable by setting the s_export_op field in the struct
 super_block.  This field must point to a struct export_operations
-struct which could potentially be full of NULLs, though normally at
-least get_parent will be set.
+struct which has the following members:
 
- The primary operations are decode_fh and encode_fh.  
-decode_fh takes a filehandle fragment and tries to find or create a
-dentry for the object referred to by the filehandle.
-encode_fh takes a dentry and creates a filehandle fragment which can
-later be used to find/create a dentry for the same object.
-
-decode_fh will probably make use of find_exported_dentry.
-This function lives in the exportfs module which a filesystem does
-not need unless it is being exported.  So rather that calling
-find_exported_dentry directly, each filesystem should call it through
-the find_exported_dentry pointer in it's export_operations table.
-This field is set correctly by the exporting agent (e.g. nfsd) when a
-filesystem is exported, and before any export operations are called.
-
-find_exported_dentry needs three support functions from the
-filesystem:
-  get_name.  When given a parent dentry and a child dentry, this
-should find a name in the directory identified by the parent
-dentry, which leads to the object identified by the child dentry.
-If no get_name function is supplied, a default implementation is
-provided which uses vfs_readdir to find potential names, and
-matches inode numbers to find the correct match.
-
-  get_parent.  When given a dentry for a directory, this should return 
-a dentry for the parent.  Quite possibly the parent dentry will
-have been allocated by d_alloc_anon.  
-The default get_parent function just returns an error so any
-filehandle lookup that requires finding a parent will fail.
--lookup(..) is *not* used as a default as it can leave ..
-entries in the dcache which are too messy to work with.
-
-  get_dentry.  When given an opaque datum, this should find the
-implied object and create a dentry for it (possibly with
-d_alloc_anon). 
-The opaque datum is whatever is passed down by the decode_fh
-function, and is often simply a fragment of the filehandle
-fragment.
-decode_fh passes two datums through find_exported_dentry.  One that 
-should be used to identify the target object, and one that can be
-used to identify the object's parent, should that be necessary.
-The default get_dentry function assumes that the datum contains an
-inode number and a generation number, and it attempts to get the
-inode using iget and check it's validity by matching the
-generation number.  A filesystem should only depend on the default
-if iget can safely be used this way.
-
-If decode_fh and/or encode_fh are left as NULL, then default
-implementations are used.  These defaults are suitable for ext2 and 
-extremely similar filesystems (like ext3).
-
-The default encode_fh creates a filehandle fragment from the inode
-number and generation 

[PATCH 06/19] ext4: new export ops

2007-09-14 Thread hch
Trivial switch over to the new generic helpers.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/fs/ext4/super.c
===
--- linux-2.6.orig/fs/ext4/super.c  2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/ext4/super.c   2007-09-13 15:18:21.0 +0200
@@ -613,13 +613,10 @@ static int ext4_show_options(struct seq_
 }
 
 
-static struct dentry *ext4_get_dentry(struct super_block *sb, void *vobjp)
+static struct inode *ext4_nfs_get_inode(struct super_block *sb,
+   u64 ino, u32 generation)
 {
-   __u32 *objp = vobjp;
-   unsigned long ino = objp[0];
-   __u32 generation = objp[1];
struct inode *inode;
-   struct dentry *result;
 
if (ino  EXT4_FIRST_INO(sb)  ino != EXT4_ROOT_INO)
return ERR_PTR(-ESTALE);
@@ -642,15 +639,22 @@ static struct dentry *ext4_get_dentry(st
iput(inode);
return ERR_PTR(-ESTALE);
}
-   /* now to find a dentry.
-* If possible, get a well-connected one
-*/
-   result = d_alloc_anon(inode);
-   if (!result) {
-   iput(inode);
-   return ERR_PTR(-ENOMEM);
-   }
-   return result;
+
+   return inode;
+}
+
+static struct dentry *ext4_fh_to_dentry(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+   ext4_nfs_get_inode);
+}
+
+static struct dentry *ext4_fh_to_parent(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+   ext4_nfs_get_inode);
 }
 
 #ifdef CONFIG_QUOTA
@@ -720,8 +724,9 @@ static const struct super_operations ext
 };
 
 static struct export_operations ext4_export_ops = {
+   .fh_to_dentry = ext4_fh_to_dentry,
+   .fh_to_parent = ext4_fh_to_parent,
.get_parent = ext4_get_parent,
-   .get_dentry = ext4_get_dentry,
 };
 
 enum {

--
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/19] ext2: new export ops

2007-09-14 Thread hch
Trivial switch over to the new generic helpers.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/fs/ext2/super.c
===
--- linux-2.6.orig/fs/ext2/super.c  2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/ext2/super.c   2007-09-13 15:16:25.0 +0200
@@ -248,13 +248,10 @@ static const struct super_operations ext
 #endif
 };
 
-static struct dentry *ext2_get_dentry(struct super_block *sb, void *vobjp)
+static struct inode *ext2_nfs_get_inode(struct super_block *sb,
+   u64 ino, u32 generation)
 {
-   __u32 *objp = vobjp;
-   unsigned long ino = objp[0];
-   __u32 generation = objp[1];
struct inode *inode;
-   struct dentry *result;
 
if (ino  EXT2_FIRST_INO(sb)  ino != EXT2_ROOT_INO)
return ERR_PTR(-ESTALE);
@@ -275,15 +272,21 @@ static struct dentry *ext2_get_dentry(st
iput(inode);
return ERR_PTR(-ESTALE);
}
-   /* now to find a dentry.
-* If possible, get a well-connected one
-*/
-   result = d_alloc_anon(inode);
-   if (!result) {
-   iput(inode);
-   return ERR_PTR(-ENOMEM);
-   }
-   return result;
+   return inode;
+}
+
+static struct dentry *ext2_fh_to_dentry(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+   ext2_nfs_get_inode);
+}
+
+static struct dentry *ext2_fh_to_parent(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+   ext2_nfs_get_inode);
 }
 
 /* Yes, most of these are left as NULL!!
@@ -292,8 +295,9 @@ static struct dentry *ext2_get_dentry(st
  * Currently only get_parent is required.
  */
 static struct export_operations ext2_export_ops = {
+   .fh_to_dentry = ext2_fh_to_dentry,
+   .fh_to_parent = ext2_fh_to_parent,
.get_parent = ext2_get_parent,
-   .get_dentry = ext2_get_dentry,
 };
 
 static unsigned long get_sb_block(void **data)

--
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/19] ext3: new export ops

2007-09-14 Thread hch
Trivial switch over to the new generic helpers.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/fs/ext3/super.c
===
--- linux-2.6.orig/fs/ext3/super.c  2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/ext3/super.c   2007-09-13 15:16:57.0 +0200
@@ -562,13 +562,10 @@ static int ext3_show_options(struct seq_
 }
 
 
-static struct dentry *ext3_get_dentry(struct super_block *sb, void *vobjp)
+static struct inode *ext3_nfs_get_inode(struct super_block *sb,
+   u64 ino, u32 generation)
 {
-   __u32 *objp = vobjp;
-   unsigned long ino = objp[0];
-   __u32 generation = objp[1];
struct inode *inode;
-   struct dentry *result;
 
if (ino  EXT3_FIRST_INO(sb)  ino != EXT3_ROOT_INO)
return ERR_PTR(-ESTALE);
@@ -591,15 +588,22 @@ static struct dentry *ext3_get_dentry(st
iput(inode);
return ERR_PTR(-ESTALE);
}
-   /* now to find a dentry.
-* If possible, get a well-connected one
-*/
-   result = d_alloc_anon(inode);
-   if (!result) {
-   iput(inode);
-   return ERR_PTR(-ENOMEM);
-   }
-   return result;
+
+   return inode;
+}
+
+static struct dentry *ext3_fh_to_dentry(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+   ext3_nfs_get_inode);
+}
+
+static struct dentry *ext3_fh_to_parent(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+   ext3_nfs_get_inode);
 }
 
 #ifdef CONFIG_QUOTA
@@ -669,8 +673,9 @@ static const struct super_operations ext
 };
 
 static struct export_operations ext3_export_ops = {
+   .fh_to_dentry = ext3_fh_to_dentry,
+   .fh_to_parent = ext3_fh_to_parent,
.get_parent = ext3_get_parent,
-   .get_dentry = ext3_get_dentry,
 };
 
 enum {

--
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/19] jfs: new export ops

2007-09-14 Thread hch
Trivial switch over to the new generic helpers.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/fs/jfs/jfs_inode.h
===
--- linux-2.6.orig/fs/jfs/jfs_inode.h   2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/jfs/jfs_inode.h2007-09-13 15:19:06.0 +0200
@@ -18,6 +18,8 @@
 #ifndef_H_JFS_INODE
 #define _H_JFS_INODE
 
+struct fid;
+
 extern struct inode *ialloc(struct inode *, umode_t);
 extern int jfs_fsync(struct file *, struct dentry *, int);
 extern int jfs_ioctl(struct inode *, struct file *,
@@ -32,7 +34,10 @@ extern void jfs_truncate_nolock(struct i
 extern void jfs_free_zero_link(struct inode *);
 extern struct dentry *jfs_get_parent(struct dentry *dentry);
 extern void jfs_get_inode_flags(struct jfs_inode_info *);
-extern struct dentry *jfs_get_dentry(struct super_block *sb, void *vobjp);
+extern struct dentry *jfs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type);
+extern struct dentry *jfs_fh_to_parent(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type);
 extern void jfs_set_inode_flags(struct inode *);
 extern int jfs_get_block(struct inode *, sector_t, struct buffer_head *, int);
 
Index: linux-2.6/fs/jfs/namei.c
===
--- linux-2.6.orig/fs/jfs/namei.c   2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/jfs/namei.c2007-09-13 15:19:42.0 +0200
@@ -20,6 +20,7 @@
 #include linux/fs.h
 #include linux/ctype.h
 #include linux/quotaops.h
+#include linux/exportfs.h
 #include jfs_incore.h
 #include jfs_superblock.h
 #include jfs_inode.h
@@ -1477,13 +1478,10 @@ static struct dentry *jfs_lookup(struct 
return dentry;
 }
 
-struct dentry *jfs_get_dentry(struct super_block *sb, void *vobjp)
+static struct inode *jfs_nfs_get_inode(struct super_block *sb,
+   u64 ino, u32 generation)
 {
-   __u32 *objp = vobjp;
-   unsigned long ino = objp[0];
-   __u32 generation = objp[1];
struct inode *inode;
-   struct dentry *result;
 
if (ino == 0)
return ERR_PTR(-ESTALE);
@@ -1493,20 +1491,25 @@ struct dentry *jfs_get_dentry(struct sup
 
if (is_bad_inode(inode) ||
(generation  inode-i_generation != generation)) {
-   result = ERR_PTR(-ESTALE);
-   goto out_iput;
+   iput(inode);
+   return ERR_PTR(-ESTALE);
}
 
-   result = d_alloc_anon(inode);
-   if (!result) {
-   result = ERR_PTR(-ENOMEM);
-   goto out_iput;
-   }
-   return result;
+   return inode;
+}
 
- out_iput:
-   iput(inode);
-   return result;
+struct dentry *jfs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+   jfs_nfs_get_inode);
+}
+
+struct dentry *jfs_fh_to_parent(struct super_block *sb, struct fid *fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+   jfs_nfs_get_inode);
 }
 
 struct dentry *jfs_get_parent(struct dentry *dentry)
Index: linux-2.6/fs/jfs/super.c
===
--- linux-2.6.orig/fs/jfs/super.c   2007-09-13 15:10:46.0 +0200
+++ linux-2.6/fs/jfs/super.c2007-09-13 15:19:06.0 +0200
@@ -738,7 +738,8 @@ static const struct super_operations jfs
 };
 
 static struct export_operations jfs_export_operations = {
-   .get_dentry = jfs_get_dentry,
+   .fh_to_dentry   = jfs_fh_to_dentry,
+   .fh_to_parent   = jfs_fh_to_parent,
.get_parent = jfs_get_parent,
 };
 

--
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/19] ntfs: new export ops

2007-09-14 Thread hch
Trivial switch over to the new generic helpers.


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]


Index: linux-2.6/fs/ntfs/namei.c
===
--- linux-2.6.orig/fs/ntfs/namei.c  2007-09-13 15:10:45.0 +0200
+++ linux-2.6/fs/ntfs/namei.c   2007-09-13 15:20:12.0 +0200
@@ -450,58 +450,40 @@ try_next:
return parent_dent;
 }
 
-/**
- * ntfs_get_dentry - find a dentry for the inode from a file handle 
sub-fragment
- * @sb:super block identifying the mounted ntfs volume
- * @fh:the file handle sub-fragment
- *
- * Find a dentry for the inode given a file handle sub-fragment.  This function
- * is called from fs/exportfs/expfs.c::find_exported_dentry() which in turn is
- * called from the default -decode_fh() which is export_decode_fh() in the
- * same file.  The code is closely based on the default -get_dentry() helper
- * fs/exportfs/expfs.c::get_object().
- *
- * The @fh contains two 32-bit unsigned values, the first one is the inode
- * number and the second one is the inode generation.
- *
- * Return the dentry on success or the error code on error (IS_ERR() is true).
- */
-static struct dentry *ntfs_get_dentry(struct super_block *sb, void *fh)
+static struct inode *ntfs_nfs_get_inode(struct super_block *sb,
+   u64 ino, u32 generation)
 {
-   struct inode *vi;
-   struct dentry *dent;
-   unsigned long ino = ((u32 *)fh)[0];
-   u32 gen = ((u32 *)fh)[1];
-
-   ntfs_debug(Entering for inode 0x%lx, generation 0x%x., ino, gen);
-   vi = ntfs_iget(sb, ino);
-   if (IS_ERR(vi)) {
-   ntfs_error(sb, Failed to get inode 0x%lx., ino);
-   return (struct dentry *)vi;
-   }
-   if (unlikely(is_bad_inode(vi) || vi-i_generation != gen)) {
-   /* We didn't find the right inode. */
-   ntfs_error(sb, Inode 0x%lx, bad count: %d %d or version 0x%x 
-   0x%x., vi-i_ino, vi-i_nlink,
-   atomic_read(vi-i_count), vi-i_generation,
-   gen);
-   iput(vi);
-   return ERR_PTR(-ESTALE);
-   }
-   /* Now find a dentry.  If possible, get a well-connected one. */
-   dent = d_alloc_anon(vi);
-   if (unlikely(!dent)) {
-   iput(vi);
-   return ERR_PTR(-ENOMEM);
+   struct inode *inode;
+
+   inode = ntfs_iget(sb, ino);
+   if (!IS_ERR(inode)) {
+   if (is_bad_inode(inode) || inode-i_generation != generation) {
+   iput(inode);
+   inode = ERR_PTR(-ESTALE);
+   }
}
-   ntfs_debug(Done for inode 0x%lx, generation 0x%x., ino, gen);
-   return dent;
+
+   return inode;
+}
+
+static struct dentry *ntfs_fh_to_dentry(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+   ntfs_nfs_get_inode);
+}
+
+static struct dentry *ntfs_fh_to_parent(struct super_block *sb, struct fid 
*fid,
+   int fh_len, int fh_type)
+{
+   return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+   ntfs_nfs_get_inode);
 }
 
 /**
  * Export operations allowing NFS exporting of mounted NTFS partitions.
  *
- * We use the default -decode_fh() and -encode_fh() for now.  Note that they
+ * We use the default -encode_fh() for now.  Note that they
  * use 32 bits to store the inode number which is an unsigned long so on 64-bit
  * architectures is usually 64 bits so it would all fail horribly on huge
  * volumes.  I guess we need to define our own encode and decode fh functions
@@ -520,7 +502,6 @@ static struct dentry *ntfs_get_dentry(st
 struct export_operations ntfs_export_ops = {
.get_parent = ntfs_get_parent,  /* Find the parent of a given
   directory. */
-   .get_dentry = ntfs_get_dentry,  /* Find a dentry for the inode
-  given a file handle
-  sub-fragment. */
+   .fh_to_dentry   = ntfs_fh_to_dentry,
+   .fh_to_parent   = ntfs_fh_to_parent,
 };

--
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Nick Piggin
On Thursday 13 September 2007 09:06, Christoph Lameter wrote:
 On Wed, 12 Sep 2007, Nick Piggin wrote:

  So lumpy reclaim does not change my formula nor significantly help
  against a fragmentation attack. AFAIKS.

 Lumpy reclaim improves the situation significantly because the
 overwhelming majority of allocation during the lifetime of a systems are
 movable and thus it is able to opportunistically restore the availability
 of higher order pages by reclaiming neighboring pages.

I'm talking about non movable allocations.


  [*] ok, this isn't quite true because if you can actually put a hard
  limit on unmovable allocations then anti-frag will fundamentally help --
  get back to me on that when you get patches to move most of the obvious 
  ones.

 We have this hard limit using ZONE_MOVABLE in 2.6.23.

So we're back to 2nd class support.


  Sure, and I pointed out the theoretical figure for 64K pages as well. Is
  that figure not problematic to you? Where do you draw the limit for what
  is acceptable? Why? What happens with tiny memory machines where a
  reserve or even the anti-frag patches may not be acceptable and/or work
  very well? When do you require reserve pools? Why are reserve pools
  acceptable for first-class support of filesystems when it has been very
  loudly been made a known policy decision by Linus in the past (and for
  some valid reasons) that we should not put limits on the sizes of caches
  in the kernel.

 64K pages may problematic because it is above the PAGE_ORDER_COSTLY in
 2.6.23. 32K is currently much safer because lumpy reclaim can restore
 these and does so on my systems. I expect the situation for 64K pages to
 improve when more of Mel's patches go in. We have long term experience
 with 32k sized allocation through Andrew's tree.

 Reserve pools as handled (by the not yet available) large page pool
 patches (which again has altogether another purpose) are not a limit. The
 reserve pools are used to provide a mininum of higher order pages that is
 not broken down in order to insure that a mininum number of the desired
 order of pages is even available in your worst case scenario. Mainly I
 think that is needed during the period when memory defragmentation is
 still under development.

fsblock doesn't need any of those hacks, of course.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Nick Piggin
On Thursday 13 September 2007 12:01, Nick Piggin wrote:
 On Thursday 13 September 2007 23:03, David Chinner wrote:
  Then just do operations on directories with lots of files in them
  (tens of thousands). Every directory operation will require at
  least one vmap in this situation - e.g. a traversal will result in
  lots and lots of blocks being read that will require vmap() for every
  directory block read from disk and an unmap almost immediately
  afterwards when the reference is dropped

 Ah, wow, thanks: I can reproduce it.

OK, the vunmap batching code wipes your TLB flushing and IPIs off
the table. Diffstat below, but the TLB portions are here (besides that
_everything_ is probably lower due to less TLB misses caused by the
TLB flushing):

  -170   -99.4% sn2_send_IPI
  -343  -100.0% sn_send_IPI_phys
-17911   -99.9% smp_call_function


Total performance went up by 30% on a 64-way system (248 seconds to
172 seconds to run parallel finds over different huge directories).

 23012  54790.5% _read_lock
  9427   329.0% __get_vm_area_node
  5792 0.0% __find_vm_area
  1590  53000.0% __vunmap
   10726.0% _spin_lock
74   119.4% _xfs_buf_find
58 0.0% __unmap_kernel_range
5336.6% kmem_zone_alloc
  -129  -100.0% pio_phys_write_mmr
  -144  -100.0% unmap_kernel_range
  -170   -99.4% sn2_send_IPI
  -233   -59.1% kfree
  -266  -100.0% find_next_bit
  -343  -100.0% sn_send_IPI_phys
  -564   -19.9% xfs_iget_core
 -1946  -100.0% remove_vm_area
-17911   -99.9% smp_call_function
-62726-7.2% _write_lock
   -438360   -64.2% default_idle
   -482631   -30.4% total

Next I have some patches to scale the vmap locks and data
structures better, but they're not quite ready yet. This looks like it
should result in a further speedup of several times when combined
with the TLB flushing reductions here...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change

2007-09-14 Thread Greg Banks
On Fri, Sep 14, 2007 at 07:02:58AM -0400, Jeff Layton wrote:
 On Fri, 14 Sep 2007 20:25:45 +1000
 Greg Banks [EMAIL PROTECTED] wrote:
 
  I'm curious about the reasons behind this change.  You mention
  credential issues; how exactly is it that you have the correct creds
  to perform a WRITE rpc but not a SETATTR rpc?
  
 
 Consider this case. user1 and user2 are both members of group
 allusers:
 
 user1$ echo foo  foo
 user1$ chgrp allusers foo
 user1$ chmod 04770 foo
 user2$ echo bar  foo
 
 On most local filesystems, this would work correctly. The end result
 would be a file with mode 0770 and the expected contents. On NFS
 though, the write by user2 fails. When the write is attempted, the
 kernel tries to squash the setuid bit using the credentials of user2,
 who's not allowed to change the mode. The write then fails because the
 setattr fails.

Ok, I ran an experiment and I see this failure mode.

So the SETATTR rpc is really a side effect of the client kernel's
behaviour and not an operation directly requested by the user process
on the client.  Is there any reason why that rpc needs to have user2's
creds?  Why not do the rpc with a fake set of creds with uid and gid
set to the uid and gid of the file, in this case user1/allusers ?
That way the rpc will most likely pass the server's permission check.

Greg.
-- 
Greg Banks, RD Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Nick Piggin
On Thursday 13 September 2007 09:17, Christoph Lameter wrote:
 On Wed, 12 Sep 2007, Nick Piggin wrote:
  I will still argue that my approach is the better technical solution for
  large block support than yours, I don't think we made progress on that.
  And I'm quite sure we agreed at the VM summit not to rely on your patches
  for VM or IO scalability.

 The approach has already been tried (see the XFS layer) and found lacking.

It is lacking because our vmap algorithms are simplistic to the point
of being utterly inadequate for the new requirements. There has not
been any fundamental problem revealed (like the fragmentation 
approach has).

However fsblock can do everything that higher order pagecache can
do in terms of avoiding vmap and giving contiguous memory to block
devices by opportunistically allocating higher orders of pages, and falling
back to vmap if they cannot be satisfied.

So if you argue that vmap is a downside, then please tell me how you
consider the -ENOMEM of your approach to be better?


 Having a fake linear block through vmalloc means that a special software
 layer must be introduced and we may face special casing in the block / fs
 layer to check if we have one of these strange vmalloc blocks.

I guess you're a little confused. There is nothing fake about the linear
address. Filesystems of course need changes (the block layer needs
none -- why would it?). But changing APIs to accommodate a better
solution is what Linux is about.

If, by special software layer, you mean the vmap/vunmap support in
fsblock, let's see... that's probably all of a hundred or two lines.
Contrast that with anti-fragmentation, lumpy reclaim, higher order
pagecache and its new special mmap layer... Hmm, seems like a no
brainer to me. You really still want to persue the extra layer
argument as a point against fsblock here?


  But you just showed in two emails that you don't understand what the
  problem is. To reiterate: lumpy reclaim does *not* invalidate my
  formulae; and antifrag does *not* isolate the issue.

 I do understand what the problem is. I just do not get what your problem
 with this is and why you have this drive to demand perfection. We are

Oh. I don't think I could explain that if you still don't understand by now.
But that's not the main issue: all that I ask is you consider fsblock on
technical grounds. 


 working a variety of approaches on the (potential) issue but you
 categorically state that it cannot be solved.

Of course I wouldn't state that. On the contrary, I categorically state that
I have already solved it :)


  But what do you say about viable alternatives that do not have to
  worry about these unlikely scenarios, full stop? So, why should we
  not use fs block for higher order page support?

 Because it has already been rejected in another form and adds more

You have rejected it. But they are bogus reasons, as I showed above.

You also describe some other real (if lesser) issues like number of page
structs to manage in the pagecache. But this is hardly enough to reject
my patch now... for every downside you can point out in my approach, I
can point out one in yours.

- fsblock doesn't require changes to virtual memory layer
- fsblock can retain cache of just 4K in a  4K block size file

How about those? I know very well how Linus feels about both of them.


 Maybe we coud get to something like a hybrid that avoids some of these
 issues? Add support so something like a virtual compound page can be
 handled transparently in the filesystem layer with special casing if
 such a beast reaches the block layer?

That's conceptually much worse, IMO.

And practically worse as well: vmap space is limited on 32-bit; fsblock
approach can avoid vmap completely in many cases; for two reasons.

The fsblock data accessor APIs aren't _that_ bad changes. They change
zero conceptually in the filesystem, are arguably cleaner, and can be
essentially nooped if we wanted to stay with a b_data type approach
(but they give you that flexibility to replace it with any implementation).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change

2007-09-14 Thread Jeff Layton
On Fri, 14 Sep 2007 23:09:24 +1000
Greg Banks [EMAIL PROTECTED] wrote:

 On Fri, Sep 14, 2007 at 07:02:58AM -0400, Jeff Layton wrote:
  On Fri, 14 Sep 2007 20:25:45 +1000
  Greg Banks [EMAIL PROTECTED] wrote:
  
   I'm curious about the reasons behind this change.  You mention
   credential issues; how exactly is it that you have the correct creds
   to perform a WRITE rpc but not a SETATTR rpc?
   
  
  Consider this case. user1 and user2 are both members of group
  allusers:
  
  user1$ echo foo  foo
  user1$ chgrp allusers foo
  user1$ chmod 04770 foo
  user2$ echo bar  foo
  
  On most local filesystems, this would work correctly. The end result
  would be a file with mode 0770 and the expected contents. On NFS
  though, the write by user2 fails. When the write is attempted, the
  kernel tries to squash the setuid bit using the credentials of user2,
  who's not allowed to change the mode. The write then fails because the
  setattr fails.
 
 Ok, I ran an experiment and I see this failure mode.
 
 So the SETATTR rpc is really a side effect of the client kernel's
 behaviour and not an operation directly requested by the user process
 on the client.  Is there any reason why that rpc needs to have user2's
 creds?  Why not do the rpc with a fake set of creds with uid and gid
 set to the uid and gid of the file, in this case user1/allusers ?
 That way the rpc will most likely pass the server's permission check.
 

That might work in some cases, but there are many where it wouldn't...

Suppose user1 here is root and all of the user1 operations are being
done on the server. If the server has root squashing enabled, then
user2's operation would still fail.

Another problem:

Suppose we're using gssapi. There's no guarantee that the client will
have the proper credentials to fake up a call as user1 (you might need
user1 krb5 tickets, etc).

-- 
Jeff Layton [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change

2007-09-14 Thread Greg Banks
On Fri, Sep 14, 2007 at 09:38:46AM -0400, Jeff Layton wrote:
 On Fri, 14 Sep 2007 23:09:24 +1000
 Greg Banks [EMAIL PROTECTED] wrote:
 
  On Fri, Sep 14, 2007 at 07:02:58AM -0400, Jeff Layton wrote:
   On Fri, 14 Sep 2007 20:25:45 +1000
   Greg Banks [EMAIL PROTECTED] wrote:
   
I'm curious about the reasons behind this change.  You mention
credential issues; how exactly is it that you have the correct creds
to perform a WRITE rpc but not a SETATTR rpc?

   
   Consider this case. user1 and user2 are both members of group
   allusers:
   
   user1$ echo foo  foo
   user1$ chgrp allusers foo
   user1$ chmod 04770 foo
   user2$ echo bar  foo
   
   On most local filesystems, this would work correctly. The end result
   would be a file with mode 0770 and the expected contents. On NFS
   though, the write by user2 fails. When the write is attempted, the
   kernel tries to squash the setuid bit using the credentials of user2,
   who's not allowed to change the mode. The write then fails because the
   setattr fails.
  
  Ok, I ran an experiment and I see this failure mode.
  
  So the SETATTR rpc is really a side effect of the client kernel's
  behaviour and not an operation directly requested by the user process
  on the client.  Is there any reason why that rpc needs to have user2's
  creds?  Why not do the rpc with a fake set of creds with uid and gid
  set to the uid and gid of the file, in this case user1/allusers ?
  That way the rpc will most likely pass the server's permission check.
  
 
 That might work in some cases, but there are many where it wouldn't...
 
 Suppose user1 here is root and all of the user1 operations are being
 done on the server. If the server has root squashing enabled, then
 user2's operation would still fail.

In that case, user1's operations would also fail, which is even more
serious a problem.  Also arguably you actually *want* writes by a
nonroot user to a setuid root executable to fail ;-)

 Another problem:
 
 Suppose we're using gssapi. There's no guarantee that the client will
 have the proper credentials to fake up a call as user1 (you might need
 user1 krb5 tickets, etc).

Yes, good point.  You could use the root creds, except for root squashing.
Ok, you convinced me.

Greg.
-- 
Greg Banks, RD Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATH 04/19] ext2: new export ops

2007-09-14 Thread Greg Banks
On Thu, Aug 30, 2007 at 03:16:09PM +0200, Christoph Hellwig wrote:
 +
 +static struct dentry *ext2_fh_to_dentry(struct super_block *sb, struct fid 
 *fid,
 + int fh_len, int fh_type)
 +{
 + return generic_fh_to_dentry(sb, fid, fh_len, fh_type, 
 ext2_nfs_get_inode);
 +}
 +
 +static struct dentry *ext2_fh_to_parent(struct super_block *sb, struct fid 
 *fid,
 + int fh_len, int fh_type)
 +{
 + return generic_fh_to_parent(sb, fid, fh_len, fh_type, 
 ext2_nfs_get_inode);
  }
  

These patches look good, and cleanup in this area is certainly a
good thing.  One small comment: the easy filesystems (ext[234], efs,
ntfs) might be cleaner if the per-fs get_inode function were a member
of export_ops instead of an extra argument to generic_fh_to_dentry().
That way you wouldn't need these two little helper functions in each
filesystem, because you could point export_ops.fh_to_dentry directly
at generic_fh_to_dentry.

Greg.
-- 
Greg Banks, RD Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change

2007-09-14 Thread Jeff Layton
On Sat, 15 Sep 2007 00:40:33 +1000
Greg Banks [EMAIL PROTECTED] wrote:

 On Fri, Sep 14, 2007 at 09:38:46AM -0400, Jeff Layton wrote:
  On Fri, 14 Sep 2007 23:09:24 +1000
  Greg Banks [EMAIL PROTECTED] wrote:
  
   On Fri, Sep 14, 2007 at 07:02:58AM -0400, Jeff Layton wrote:
On Fri, 14 Sep 2007 20:25:45 +1000
Greg Banks [EMAIL PROTECTED] wrote:

 I'm curious about the reasons behind this change.  You mention
 credential issues; how exactly is it that you have the correct creds
 to perform a WRITE rpc but not a SETATTR rpc?
 

Consider this case. user1 and user2 are both members of group
allusers:

user1$ echo foo  foo
user1$ chgrp allusers foo
user1$ chmod 04770 foo
user2$ echo bar  foo

On most local filesystems, this would work correctly. The end result
would be a file with mode 0770 and the expected contents. On NFS
though, the write by user2 fails. When the write is attempted, the
kernel tries to squash the setuid bit using the credentials of user2,
who's not allowed to change the mode. The write then fails because the
setattr fails.
   
   Ok, I ran an experiment and I see this failure mode.
   
   So the SETATTR rpc is really a side effect of the client kernel's
   behaviour and not an operation directly requested by the user process
   on the client.  Is there any reason why that rpc needs to have user2's
   creds?  Why not do the rpc with a fake set of creds with uid and gid
   set to the uid and gid of the file, in this case user1/allusers ?
   That way the rpc will most likely pass the server's permission check.
   
  
  That might work in some cases, but there are many where it wouldn't...
  
  Suppose user1 here is root and all of the user1 operations are being
  done on the server. If the server has root squashing enabled, then
  user2's operation would still fail.
 
 In that case, user1's operations would also fail, which is even more
 serious a problem.  Also arguably you actually *want* writes by a
 nonroot user to a setuid root executable to fail ;-)
 

Well, user1's operations would fail if done from the client, which is
why I mentioned that they would have to be done on the server.

The second point is a good one, but POSIX says that it should be allowed
if the permissions allow for it. The whole situation is somewhat
contrived anyway, I can't think of a place where this is something
you'd really want to do, but I think we need to try to follow the spec
as best as possible...

  Another problem:
  
  Suppose we're using gssapi. There's no guarantee that the client will
  have the proper credentials to fake up a call as user1 (you might need
  user1 krb5 tickets, etc).
 
 Yes, good point.  You could use the root creds, except for root squashing.
 Ok, you convinced me.
 

Right. When I was first looking at this, I considered some similar
approaches, but hit roadblocks with all of them. The only real option
seems to be to leave this to the server, but that does assume that the
server handles this properly.

Servers that don't are broken, IMO. If Irix isn't clearing these bits
on a write then it might be good to see if they can fix that...

-- 
Jeff Layton [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATH 04/19] ext2: new export ops

2007-09-14 Thread Christoph Hellwig
On Sat, Sep 15, 2007 at 12:58:03AM +1000, Greg Banks wrote:
 These patches look good, and cleanup in this area is certainly a
 good thing.  One small comment: the easy filesystems (ext[234], efs,
 ntfs) might be cleaner if the per-fs get_inode function were a member
 of export_ops instead of an extra argument to generic_fh_to_dentry().
 That way you wouldn't need these two little helper functions in each
 filesystem, because you could point export_ops.fh_to_dentry directly
 at generic_fh_to_dentry.

I was pondering this, but in the end I prefer the cleaner layering of
the callback version.  This also mirrors what we do in other areas
(e.g. address_space operations)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/19] gfs2: new export ops

2007-09-14 Thread Steven Whitehouse
Hi,

On Fri, 2007-09-14 at 13:49 +0200, [EMAIL PROTECTED] wrote:
 plain text document attachment (gfs2-implement-fh_to_dentry)
 Convert gfs2 to the new ops.   Uses a similar structure to the generic
 helpers, but gfs2 has it's own file handle formats.
 
 
 Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]
 
This looks good from a GFS2 point of view:

Acked-by: Steven Whitehouse [EMAIL PROTECTED]
Acked-by: Wendy Cheng [EMAIL PROTECTED]

Steve.

 
 Index: linux-2.6/fs/gfs2/ops_export.c
 ===
 --- linux-2.6.orig/fs/gfs2/ops_export.c   2007-07-19 15:56:46.0 
 +0200
 +++ linux-2.6/fs/gfs2/ops_export.c2007-07-20 19:58:06.0 +0200
 @@ -31,40 +31,6 @@
  #define GFS2_LARGE_FH_SIZE 8
  #define GFS2_OLD_FH_SIZE 10
  
 -static struct dentry *gfs2_decode_fh(struct super_block *sb,
 -  __u32 *p,
 -  int fh_len,
 -  int fh_type,
 -  int (*acceptable)(void *context,
 -struct dentry *dentry),
 -  void *context)
 -{
 - __be32 *fh = (__force __be32 *)p;
 - struct gfs2_inum_host inum, parent;
 -
 - memset(parent, 0, sizeof(struct gfs2_inum));
 -
 - switch (fh_len) {
 - case GFS2_LARGE_FH_SIZE:
 - case GFS2_OLD_FH_SIZE:
 - parent.no_formal_ino = ((u64)be32_to_cpu(fh[4]))  32;
 - parent.no_formal_ino |= be32_to_cpu(fh[5]);
 - parent.no_addr = ((u64)be32_to_cpu(fh[6]))  32;
 - parent.no_addr |= be32_to_cpu(fh[7]);
 - case GFS2_SMALL_FH_SIZE:
 - inum.no_formal_ino = ((u64)be32_to_cpu(fh[0]))  32;
 - inum.no_formal_ino |= be32_to_cpu(fh[1]);
 - inum.no_addr = ((u64)be32_to_cpu(fh[2]))  32;
 - inum.no_addr |= be32_to_cpu(fh[3]);
 - break;
 - default:
 - return NULL;
 - }
 -
 - return gfs2_export_ops.find_exported_dentry(sb, inum, parent,
 - acceptable, context);
 -}
 -
  static int gfs2_encode_fh(struct dentry *dentry, __u32 *p, int *len,
 int connectable)
  {
 @@ -189,10 +155,10 @@ static struct dentry *gfs2_get_parent(st
   return dentry;
  }
  
 -static struct dentry *gfs2_get_dentry(struct super_block *sb, void *inum_obj)
 +static struct dentry *gfs2_get_dentry(struct super_block *sb,
 + struct gfs2_inum_host *inum)
  {
   struct gfs2_sbd *sdp = sb-s_fs_info;
 - struct gfs2_inum_host *inum = inum_obj;
   struct gfs2_holder i_gh, ri_gh, rgd_gh;
   struct gfs2_rgrpd *rgd;
   struct inode *inode;
 @@ -289,11 +255,50 @@ fail:
   return ERR_PTR(error);
  }
  
 +static struct dentry *gfs2_fh_to_dentry(struct super_block *sb, struct fid 
 *fid,
 + int fh_len, int fh_type)
 +{
 + struct gfs2_inum_host this;
 + __be32 *fh = (__force __be32 *)fid-raw;
 +
 + switch (fh_type) {
 + case GFS2_SMALL_FH_SIZE:
 + case GFS2_LARGE_FH_SIZE:
 + case GFS2_OLD_FH_SIZE:
 + this.no_formal_ino = ((u64)be32_to_cpu(fh[0]))  32;
 + this.no_formal_ino |= be32_to_cpu(fh[1]);
 + this.no_addr = ((u64)be32_to_cpu(fh[2]))  32;
 + this.no_addr |= be32_to_cpu(fh[3]);
 + return gfs2_get_dentry(sb, this);
 + default:
 + return NULL;
 + }
 +}
 +
 +static struct dentry *gfs2_fh_to_parent(struct super_block *sb, struct fid 
 *fid,
 + int fh_len, int fh_type)
 +{
 + struct gfs2_inum_host parent;
 + __be32 *fh = (__force __be32 *)fid-raw;
 +
 + switch (fh_type) {
 + case GFS2_LARGE_FH_SIZE:
 + case GFS2_OLD_FH_SIZE:
 + parent.no_formal_ino = ((u64)be32_to_cpu(fh[4]))  32;
 + parent.no_formal_ino |= be32_to_cpu(fh[5]);
 + parent.no_addr = ((u64)be32_to_cpu(fh[6]))  32;
 + parent.no_addr |= be32_to_cpu(fh[7]);
 + return gfs2_get_dentry(sb, parent);
 + default:
 + return NULL;
 + }
 +}
 +
  struct export_operations gfs2_export_ops = {
 - .decode_fh = gfs2_decode_fh,
   .encode_fh = gfs2_encode_fh,
 + .fh_to_dentry = gfs2_fh_to_dentry,
 + .fh_to_parent = gfs2_fh_to_parent,
   .get_name = gfs2_get_name,
   .get_parent = gfs2_get_parent,
 - .get_dentry = gfs2_get_dentry,
  };
  
 
 --
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATH 18/19] exportfs: update documentation

2007-09-14 Thread Greg Banks
On Thu, Aug 30, 2007 at 03:17:24PM +0200, Christoph Hellwig wrote:
 + encode_fh  (optinonal)
^
 +Takes a dentry and creates a filehandle fragment which can later be used
 +to find/create a dentry for the same object.  The default implementation
 +creates a filehandle fragment that encodes a 32bit inode and generation
 +number for the inode encoded, and if nessecary the same information for
^
 +the parent.
 +
 +  fh_to_dentry (mandatory)
 +Given a filehandle fragment, this should find the implied object and
 +create a dentry for it (possibly with d_alloc_anon).
 +
 +  fh_to_parent (optional but strongly recommended)
 +Given a filehandle fragment, this should find the parent of the
 +implied object and create a dentry for it (possibly with d_alloc_anon).
 +May simplify fail if the filehandle fragment is too small.
   

Greg.
-- 
Greg Banks, RD Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change

2007-09-14 Thread Greg Banks
On Fri, Sep 14, 2007 at 10:58:38AM -0400, Jeff Layton wrote:
 On Sat, 15 Sep 2007 00:40:33 +1000
 Greg Banks [EMAIL PROTECTED] wrote:
 
 
  Ok, you convinced me.
 
 Right. When I was first looking at this, I considered some similar
 approaches, but hit roadblocks with all of them. The only real option
 seems to be to leave this to the server, but that does assume that the
 server handles this properly.
 
 Servers that don't are broken, IMO.

According to what spec?  A quick trip around the machine room shows
that neither Solaris 10 nor Darwin 7.9.0 clobber setuid on write
either.

 If Irix isn't clearing these bits
 on a write then it might be good to see if they can fix that...

I think first you'd have to mount a serious argument that it's broken,
more serious than it works differently from Linux.

Greg.
-- 
Greg Banks, RD Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATH 10/19] xfs: new export ops

2007-09-14 Thread Christoph Hellwig
On Sat, Sep 15, 2007 at 01:22:16AM +1000, Greg Banks wrote:
 Not really a comment on your patches, but I got the original logic
 wrong here.  The VFS_32BITINODES flag only affects newly allocated
 inodes and is no guarantee that any particular inode is  2^32-1.
 It's possible for an unlucky user to perform a sequence of mounts
 and IO which results in large inode numbers despite the presence of
 that flag; we recently saw this happen by accident on a customer site.
 So the right thing to do is probably to check the inode number against
 (u32)~0.  Unfortunately, given the current encoding scheme, you have to
 check both the inode and the parent inode, which complicates the logic.

I'll see if we can do anything later on.  But for now I'll leave it
as-is becaue this file will be merge hell anyway when both vfs removal
and exporting changes hit the tree..

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Goswin von Brederlow
Hi,


Nick Piggin [EMAIL PROTECTED] writes:

 In my attack, I cause the kernel to allocate lots of unmovable allocations
 and deplete movable groups. I theoretically then only need to keep a
 small number (1/2^N) of these allocations around in order to DoS a
 page allocation of order N.

I'm assuming that when an unmovable allocation hijacks a movable group
any further unmovable alloc will evict movable objects out of that
group before hijacking another one. right?

 And it doesn't even have to be a DoS. The natural fragmentation
 that occurs today in a kernel today has the possibility to slowly push out
 the movable groups and give you the same situation.

How would you cause that? Say you do want to purposefully place one
unmovable 4k page into every 64k compund page. So you allocate
4K. First 64k page locked. But now, to get 4K into the second 64K page
you have to first use up all the rest of the first 64k page. Meaning
one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then
will a new 64k chunk be broken and become locked.

So to get the last 64k chunk used all previous 32k chunks need to be
blocked and you need to allocate 32k (or less if more is blocked). For
all previous 32k chunks to be blocked every second 16k needs to be
blocked. To block the last of those 16k chunks all previous 8k chunks
need to be blocked and you need to allocate 8k. For all previous 8k
chunks to be blocked every second 4k page needs to be used. To alloc
the last of those 4k pages all previous 4k pages need to be used.

So to construct a situation where no continious 64k chunk is free you
have to allocate total mem - 64k - 32k - 16k - 8k - 4k (or there
about) of memory first. Only then could you free memory again while
still keeping every 64k page blocked. Does that occur naturally given
enough ram to start with?



Too see how bad fragmentation could be I wrote a little progamm to
simulate allocations with the following simplified alogrithm:

Memory management:
- Free pages are kept in buckets, one per order, and sorted by address.
- alloc() the front page (smallest address) out of the bucket of the
  right order or recursively splits the next higher bucket.
- free() recursively tries to merge a page with its neighbour and puts
  the result back into the proper bucket (sorted by address).

Allocation and lifetime:
- Every tick a new page is allocated with random order.
- The order is a triangle distribution with max at 0 (throw 2 dice,
  add the eyes, subtract 7, abs() the number).
- The page is scheduled to be freed after X ticks. Where X is nearly
  a gaus curve centered at 0 and maximum at total num pages * 1.5.
  (What I actualy do is throw 8 dice and sum them up and shift the
   result.)

Display:
I start with a white window. Every page allocation draws a black box
from the address of the page and as wide as the page is big (-1 pixel to
give a seperation to the next page). Every page free draws a yellow
box in place of the black one. Yellow to show where a page was in use
at one point while white means the page was never used.

As the time ticks the memory fills up. Quickly at first and then comes
to a stop around 80% filled. And then something interesting
happens. The yellow regions (previously used but now free) start
drifting up. Small pages tend to end up in the lower addresses and big
pages at the higher addresses. The memory defragments itself to some
degree.

http://mrvn.homeip.net/fragment/

Simulating 256MB ram and after 1472943 ticks and 530095 4k, 411841 8k,
295296 16k, 176647 32k and 59064 64k allocations you get this:
http://mrvn.homeip.net/fragment/256mb.png

Simulating 1GB ram and after 5881185 ticks  and 2116671 4k, 1645957
8k, 1176994 16k, 705873 32k and 235690 64k allocations you get this:
http://mrvn.homeip.net/fragment/1gb.png

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change

2007-09-14 Thread Jeff Layton
On Sat, 15 Sep 2007 01:43:45 +1000
Greg Banks [EMAIL PROTECTED] wrote:

 On Fri, Sep 14, 2007 at 10:58:38AM -0400, Jeff Layton wrote:
  On Sat, 15 Sep 2007 00:40:33 +1000
  Greg Banks [EMAIL PROTECTED] wrote:
  
  
   Ok, you convinced me.
  
  Right. When I was first looking at this, I considered some similar
  approaches, but hit roadblocks with all of them. The only real option
  seems to be to leave this to the server, but that does assume that the
  server handles this properly.
  
  Servers that don't are broken, IMO.
 
 According to what spec?  A quick trip around the machine room shows
 that neither Solaris 10 nor Darwin 7.9.0 clobber setuid on write
 either.
 

Hmm, last time I checked Solaris, I thought it did, but that was
Solaris 11. I'll plan to fire up my solaris qemu image and test
it again...

  If Irix isn't clearing these bits
  on a write then it might be good to see if they can fix that...
 
 I think first you'd have to mount a serious argument that it's broken,
 more serious than it works differently from Linux.
 

Good point. POSIX is frustratingly ambiguous on this:

 Upon successful completion, where nbyte is greater than 0, write()
 shall mark for update the st_ctime and st_mtime fields of the file,
 and if the file is a regular file, the S_ISUID and S_ISGID bits of
 the file mode may be cleared.

...the may in that last sentence makes it optional, I suppose. Even if
it weren't then I guess there's also an argument that a write that comes
in via a nfs server may not be subject to the same semantics as the
write() syscall.

In any case, broken is probably too strong a term :-)

-- 
Jeff Layton [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] exportfs: fix doc types

2007-09-14 Thread Christoph Hellwig
And here's a patch to fix the typos Greg found:


Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Index: linux-2.6/Documentation/filesystems/Exporting
===
--- linux-2.6.orig/Documentation/filesystems/Exporting  2007-09-14 
17:59:35.0 +0200
+++ linux-2.6/Documentation/filesystems/Exporting   2007-09-14 
18:01:41.0 +0200
@@ -104,7 +104,7 @@ are exportable by setting the s_export_o
 super_block.  This field must point to a struct export_operations
 struct which has the following members:
 
- encode_fh  (optinonal)
+ encode_fh  (optional)
 Takes a dentry and creates a filehandle fragment which can later be used
 to find/create a dentry for the same object.  The default implementation
 creates a filehandle fragment that encodes a 32bit inode and generation
@@ -118,7 +118,7 @@ struct which has the following members:
   fh_to_parent (optional but strongly recommended)
 Given a filehandle fragment, this should find the parent of the
 implied object and create a dentry for it (possibly with d_alloc_anon).
-May simplify fail if the filehandle fragment is too small.
+May fail if the filehandle fragment is too small.
 
   get_parent (optional but strongly recommended)
 When given a dentry for a directory, this should return  a dentry for
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH] 9p: add readahead support for loose mode

2007-09-14 Thread Eric Van Hensbergen
This patch adds readpages support in support of readahead when using loose
cache mode.  It substantially increases performance for certain workloads.

Signed-off-by: Eric Van Hensbergen [EMAIL PROTECTED]
---
 fs/9p/v9fs.c|2 +-
 fs/9p/vfs_addr.c|   98 ++
 include/net/9p/client.h |3 +-
 net/9p/client.c |   82 +--
 4 files changed, 143 insertions(+), 42 deletions(-)

diff --git a/fs/9p/v9fs.c b/fs/9p/v9fs.c
index 89ee0ba..ca97404 100644
--- a/fs/9p/v9fs.c
+++ b/fs/9p/v9fs.c
@@ -131,7 +131,7 @@ static void v9fs_parse_options(struct v9fs_session_info 
*v9ses)
char *s, *e;
 
/* setup defaults */
-   v9ses-maxdata = 8192;
+   v9ses-maxdata = (64*1024);
v9ses-afid = ~0;
v9ses-debug = 0;
v9ses-cache = 0;
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 6248f0e..86c6e0d 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -31,8 +31,11 @@
 #include linux/string.h
 #include linux/inet.h
 #include linux/pagemap.h
+#include linux/pagevec.h
 #include linux/idr.h
 #include linux/sched.h
+#include linux/uio.h
+#include linux/task_io_accounting_ops.h
 #include net/9p/9p.h
 #include net/9p/client.h
 
@@ -50,31 +53,108 @@
 
 static int v9fs_vfs_readpage(struct file *filp, struct page *page)
 {
-   int retval;
loff_t offset;
char *buffer;
struct p9_fid *fid;
+   int retval = 0;
+   int total = 0;
+   int count = PAGE_SIZE;
 
P9_DPRINTK(P9_DEBUG_VFS, \n);
fid = filp-private_data;
buffer = kmap(page);
offset = page_offset(page);
 
-   retval = p9_client_readn(fid, buffer, offset, PAGE_CACHE_SIZE);
-   if (retval  0)
-   goto done;
+   while (count) {
+   struct kvec kv = {buffer+offset, PAGE_SIZE-count};
+   retval = p9_client_readv(fid, kv, offset, 1);
+   if (retval = 0)
+   break;
 
-   memset(buffer + retval, 0, PAGE_CACHE_SIZE - retval);
-   flush_dcache_page(page);
-   SetPageUptodate(page);
-   retval = 0;
+   buffer += retval;
+   offset += retval;
+   count -= retval;
+   total += retval;
+   }
+
+   if (retval = 0) {
+   flush_dcache_page(page);
+   SetPageUptodate(page);
+   retval = 0;
+   }
 
-done:
kunmap(page);
unlock_page(page);
return retval;
 }
 
+/* large chunks copied and adapted from fs/cifs/file.c */
+static int v9fs_vfs_readpages(struct file *file, struct address_space *mapping,
+   struct list_head *page_list, unsigned num_pages)
+{
+   struct page *tmp_page;
+   loff_t offset;
+   struct pagevec lru_pvec;
+   struct p9_fid *fid;
+   u32 read_size;
+   int retval = 0;
+   unsigned int count = 0;
+   struct list_head *p, *n;
+
+   struct kvec *kv = kmalloc(sizeof(struct kvec)*num_pages, GFP_KERNEL);
+
+   P9_DPRINTK(P9_DEBUG_VFS, %d pages\n, num_pages);
+
+   if (!kv)
+   return -ENOMEM;
+
+   if (list_empty(page_list))
+   goto free_vec;
+
+   pagevec_init(lru_pvec, 0);
+
+   fid = file-private_data;
+   tmp_page = list_entry(page_list-prev, struct page, lru);
+   offset = (loff_t)tmp_page-index  PAGE_CACHE_SHIFT;
+
+   list_for_each_entry_reverse(tmp_page, page_list, lru) {
+   BUG_ON(count  num_pages);
+   if (add_to_page_cache(tmp_page, mapping,
+   tmp_page-index, GFP_KERNEL)) {
+   page_cache_release(tmp_page);
+   continue;
+   }
+
+   kv[count].iov_base = kmap(tmp_page);
+   kv[count].iov_len = PAGE_CACHE_SIZE;
+   count++;
+   }
+
+   read_size = count * PAGE_CACHE_SIZE;
+   if (!read_size)
+   goto cleanup;
+
+   retval = p9_client_readv(fid, kv, offset, count);
+
+cleanup:
+   list_for_each_safe(p, n, page_list) {
+   tmp_page = list_entry(p, struct page, lru);
+   list_del(tmp_page-lru);
+   if (!pagevec_add(lru_pvec, tmp_page))
+   __pagevec_lru_add(lru_pvec);
+   kunmap(tmp_page);
+   flush_dcache_page(tmp_page);
+   SetPageUptodate(tmp_page);
+   unlock_page(tmp_page);
+   }
+   pagevec_lru_add(lru_pvec);
+
+free_vec:
+   kfree(kv);
+   return retval;
+}
+
 const struct address_space_operations v9fs_addr_operations = {
   .readpage = v9fs_vfs_readpage,
+  .readpages = v9fs_vfs_readpages,
 };
diff --git a/include/net/9p/client.h b/include/net/9p/client.h
index 9b9221a..6f17d0a 100644
--- a/include/net/9p/client.h
+++ b/include/net/9p/client.h
@@ -67,8 +67,7 @@ int p9_client_fcreate(struct p9_fid *fid, char *name, u32 
perm, int mode,
  

Re: [NFS] [PATCH] exportfs: fix doc types

2007-09-14 Thread J. Bruce Fields
On Fri, Sep 14, 2007 at 06:03:01PM +0200, Christoph Hellwig wrote:
 And here's a patch to fix the typos Greg found:

Thanks.  I already had a couple of those in a separate patch, so I've
folded this in.

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Mel Gorman
On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
 Nick Piggin [EMAIL PROTECTED] writes:
 
  In my attack, I cause the kernel to allocate lots of unmovable allocations
  and deplete movable groups. I theoretically then only need to keep a
  small number (1/2^N) of these allocations around in order to DoS a
  page allocation of order N.
 
 I'm assuming that when an unmovable allocation hijacks a movable group
 any further unmovable alloc will evict movable objects out of that
 group before hijacking another one. right?
 

No eviction takes place. If an unmovable allocation gets placed in a
movable group, then steps are taken to ensure that future unmovable
allocations will take place in the same range (these decisions take
place in __rmqueue_fallback()). When choosing a movable block to
pollute, it will also choose the lowest possible block in PFN terms to
steal so that fragmentation pollution will be as confined as possible.
Evicting the unmovable pages would be one of those expensive steps that
have been avoided to date.

  And it doesn't even have to be a DoS. The natural fragmentation
  that occurs today in a kernel today has the possibility to slowly push out
  the movable groups and give you the same situation.
 
 How would you cause that? Say you do want to purposefully place one
 unmovable 4k page into every 64k compund page. So you allocate
 4K. First 64k page locked. But now, to get 4K into the second 64K page
 you have to first use up all the rest of the first 64k page. Meaning
 one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then
 will a new 64k chunk be broken and become locked.

It would be easier early in the boot to mmap a large area and fault it
in in virtual address order then mlock every a page every 64K. Early in
the systems lifetime, there will be a rough correlation between physical
and virtual memory.

Without mlock(), the most successful attack will like mmap() a 60K
region and fault it in as an attempt to get pagetable pages placed in
every 64K region. This strategy would not work with grouping pages by
mobility though as it would group the pagetable pages together.

Targetted attacks on grouping pages by mobility are not very easy and
not that interesting either. As Nick suggests, the natural fragmentation
over long periods of time is what is interesting.

 So to get the last 64k chunk used all previous 32k chunks need to be
 blocked and you need to allocate 32k (or less if more is blocked). For
 all previous 32k chunks to be blocked every second 16k needs to be
 blocked. To block the last of those 16k chunks all previous 8k chunks
 need to be blocked and you need to allocate 8k. For all previous 8k
 chunks to be blocked every second 4k page needs to be used. To alloc
 the last of those 4k pages all previous 4k pages need to be used.
 
 So to construct a situation where no continious 64k chunk is free you
 have to allocate total mem - 64k - 32k - 16k - 8k - 4k (or there
 about) of memory first. Only then could you free memory again while
 still keeping every 64k page blocked. Does that occur naturally given
 enough ram to start with?
 

I believe it's very difficult to craft an attack that will work in a
short period of time. An attack that worked on 2.6.22 as well may have
no success on 2.6.23-rc4-mm1 for example as grouping pages by mobility
does it make it exceedingly hard to craft an attack unless the attacker
can mlock large amounts of memory.

 
 Too see how bad fragmentation could be I wrote a little progamm to
 simulate allocations with the following simplified alogrithm:
 
 Memory management:
 - Free pages are kept in buckets, one per order, and sorted by address.
 - alloc() the front page (smallest address) out of the bucket of the
   right order or recursively splits the next higher bucket.
 - free() recursively tries to merge a page with its neighbour and puts
   the result back into the proper bucket (sorted by address).
 
 Allocation and lifetime:
 - Every tick a new page is allocated with random order.

This step in itself is not representative of what happens in the kernel.
The vast vast majority of allocations are order-0. It's a fun analysis
but I'm not sure can we draw any conclusions from it.

Statistical analysis of the buddy algorithm have implied that it doesn't
suffer that badly from external fragmentation but we know in practice
that things are different. A model is hard because minimally the
lifetime of pages varies widely.

 - The order is a triangle distribution with max at 0 (throw 2 dice,
   add the eyes, subtract 7, abs() the number).
 - The page is scheduled to be freed after X ticks. Where X is nearly
   a gaus curve centered at 0 and maximum at total num pages * 1.5.
   (What I actualy do is throw 8 dice and sum them up and shift the
result.)
 

I doubt this is how the kernel behaves either.

 Display:
 I start with a white window. Every page allocation draws a black box
 from the address of the page and as wide as 

Re: [NFS] [PATCH 00/19] export operations rewrite

2007-09-14 Thread J. Bruce Fields
On Fri, Sep 14, 2007 at 01:47:46PM +0200, [EMAIL PROTECTED] wrote:
 This patchset is a medium scale rewrite of the export operations
 interface.  The goal is to make the interface less complex, and
 easier to understand from the filesystem side, aswell as preparing
 generic support for exporting of 64bit inode numbers.
 
 This touches all nfs exporting filesystems, and I've done testing
 on all of the filesystems I have here locally (xfs, ext2, ext3, reiserfs,
 jfs)
 
 Compared to the last version I've fixed the white space issues that
 checkpatch.pl complained about.

OK, thanks again.  Everything I have now is in

git://linux-nfs.org/~bfields/linux.git  for-mm

I'm hoping Neil can take a quick look as well (and make a response to
the comment on patch #1 along the way).

 Note that this patch series is against mainline.  There will be some
 xfs changes landing in -mm soon that revamp lots of the code touched
 here.  They should hopefully include the first path in the series so
 it can be simply dropped, but the xfs conversion will need some smaller
 updates.  I will send this update as soon as the xfs tree updates get
 pulled into -mm.

OK.  Let me know if you need me to do anything when that happens.

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATH 10/19] xfs: new export ops

2007-09-14 Thread Greg Banks
On Fri, Sep 14, 2007 at 06:03:49PM +0200, Christoph Hellwig wrote:
 On Sat, Sep 15, 2007 at 01:22:16AM +1000, Greg Banks wrote:
  Not really a comment on your patches, but I got the original logic
  wrong here.  The VFS_32BITINODES flag only affects newly allocated
  inodes and is no guarantee that any particular inode is  2^32-1.
  It's possible for an unlucky user to perform a sequence of mounts
  and IO which results in large inode numbers despite the presence of
  that flag; we recently saw this happen by accident on a customer site.
  So the right thing to do is probably to check the inode number against
  (u32)~0.  Unfortunately, given the current encoding scheme, you have to
  check both the inode and the parent inode, which complicates the logic.
 
 I'll see if we can do anything later on.  But for now I'll leave it
 as-is becaue this file will be merge hell anyway when both vfs removal
 and exporting changes hit the tree..

Fair 'nuff.

Greg.
-- 
Greg Banks, RD Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/4] 2.6.23-rc6: known regressions

2007-09-14 Thread Kamalesh Babulal

Michal Piotrowski wrote:

Hi all,

Here is a list of some known regressions in 2.6.23-rc6.

Feel free to add new regressions/remove fixed etc.
http://kernelnewbies.org/known_regressions

List of Aces

NameRegressions fixed since 21-Jun-2007
Adrian Bunk10
Linus Torvalds 6
Alan Stern 5
Andi Kleen 5
Hugh Dickins   5
Trond Myklebust5
Andrew Morton  4
Al Viro3
Alexey Starikovskiy3
Cornelia Huck  3
David S. Miller3
Jens Axboe 3
Stephen Hemminger  3
Tejun Heo  3



ACPI

Subject : 2.6.23-rc5 hangs on boot, apparently when initializing the EC
References  : http://lkml.org/lkml/2007/9/11/369
Last known good : ?
Submitter   : Chuck Ebbert [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : ?
Status  : unknown



CPUFREQ

Subject : ide problems: 2.6.22-git17 working, 2.6.23-rc1* is not
References  : http://lkml.org/lkml/2007/7/27/298
  http://lkml.org/lkml/2007/7/29/371
Last known good : ?
Submitter   : dth [EMAIL PROTECTED]
Caused-By   : Len Brown [EMAIL PROTECTED]
  commit f79e3185dd0f8650022518d7624c876d8929061b
Handled-By  : Len Brown [EMAIL PROTECTED]
Status  : problem is being debugged



FS

Subject : hanging ext3 dbench tests
References  : http://lkml.org/lkml/2007/9/11/176
Last known good : ?
Submitter   : Andy Whitcroft [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : ?
Status  : unknown

Subject : umount triggers a warning in jfs and takes almost a minute
References  : http://lkml.org/lkml/2007/9/4/73
Last known good : ?
Submitter   : Oliver Neukum [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : ?
Status  : unknown

Subject : [NFSv4] 2.6.23-rc4 oops in nfs4_cb_recall
References  : http://lkml.org/lkml/2007/9/4/53
  http://bugzilla.kernel.org/show_bug.cgi?id=9003
Last known good : ?
Submitter   : Daniel J Blueman [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : ?
Workaround  : http://bugzilla.kernel.org/attachment.cgi?id=12797
Status  : unknown

Subject : [NFSD OOPS] 2.6.23-rc1-git10
References  : http://lkml.org/lkml/2007/8/2/462
Last known good : ?
Submitter   : Andrew Clayton [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : ?
Status  : unknown



Regards,
Michal

--
LOG
http://www.stardust.webpages.pl/log/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
  

Hi Michal,

The NFSV4 [BUG] 2.6.23-rc5 kernel BUG at fs/nfs/nfs4xdr.c:945, is again
seen in 2.6.23-rc6. Can this bug be added as known regression for
2.6.23-rc6. References http://lkml.org/lkml/2007/9/7/27.

--
Thanks  Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 00/19] export operations rewrite

2007-09-14 Thread Christoph Hellwig
On Fri, Sep 14, 2007 at 12:43:55PM -0400, J. Bruce Fields wrote:
 I'm hoping Neil can take a quick look as well (and make a response to
 the comment on patch #1 along the way).

The exportfs_d_alloc one?  I have a patch for the done, I just need
to run it through my testing setup before sending it out.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 00/19] export operations rewrite

2007-09-14 Thread J. Bruce Fields
On Fri, Sep 14, 2007 at 07:00:25PM +0200, Christoph Hellwig wrote:
 On Fri, Sep 14, 2007 at 12:43:55PM -0400, J. Bruce Fields wrote:
  I'm hoping Neil can take a quick look as well (and make a response to
  the comment on patch #1 along the way).
 
 The exportfs_d_alloc one?  I have a patch for the done, I just need
 to run it through my testing setup before sending it out.

Oops, I meant #19, not #1--there's still a note addressed to Neil
sitting in there.

Typing faster than I was thinking, sorry--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/4] 2.6.23-rc6: known regressions

2007-09-14 Thread J. Bruce Fields
On Wed, Sep 12, 2007 at 06:58:54PM +0200, Michal Piotrowski wrote:
 Subject : [NFSv4] 2.6.23-rc4 oops in nfs4_cb_recall
 References  : http://lkml.org/lkml/2007/9/4/53
   http://bugzilla.kernel.org/show_bug.cgi?id=9003
 Last known good : ?
 Submitter   : Daniel J Blueman [EMAIL PROTECTED]
 Caused-By   : ?
 Handled-By  : ?
 Workaround  : http://bugzilla.kernel.org/attachment.cgi?id=12797
 Status  : unknown

I have patches which fix this, which we're testing.  (See bugzilla.)
It's a long-standing bug, not a regression.

 Subject : [NFSD OOPS] 2.6.23-rc1-git10
 References  : http://lkml.org/lkml/2007/8/2/462
 Last known good : ?
 Submitter   : Andrew Clayton [EMAIL PROTECTED]
 Caused-By   : ?
 Handled-By  : ?
 Status  : unknown

Neil's working on this.  Also not a regression.

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Christoph Lameter
On Fri, 14 Sep 2007, Nick Piggin wrote:

   [*] ok, this isn't quite true because if you can actually put a hard
   limit on unmovable allocations then anti-frag will fundamentally help --
   get back to me on that when you get patches to move most of the obvious 
   ones.
 
  We have this hard limit using ZONE_MOVABLE in 2.6.23.
 
 So we're back to 2nd class support.

2nd class support for me means a feature that is not enabled by default 
but that can be enabled in order to increase performance. The 2nd class 
support is there because we are not yet sure about the maturity of the 
memory allocation methods.

 
  Reserve pools as handled (by the not yet available) large page pool
  patches (which again has altogether another purpose) are not a limit. The
  reserve pools are used to provide a mininum of higher order pages that is
  not broken down in order to insure that a mininum number of the desired
  order of pages is even available in your worst case scenario. Mainly I
  think that is needed during the period when memory defragmentation is
  still under development.
 
 fsblock doesn't need any of those hacks, of course.

Nor does mine for the low orders that we are considering. For order  
MAX_ORDER this is unavoidable since the page allocator cannot manage such 
large pages. It can be used for lower order if there are issues (that I 
have not seen yet).

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Christoph Lameter
On Fri, 14 Sep 2007, Nick Piggin wrote:

 However fsblock can do everything that higher order pagecache can
 do in terms of avoiding vmap and giving contiguous memory to block
 devices by opportunistically allocating higher orders of pages, and falling
 back to vmap if they cannot be satisfied.

fsblock is restricted to the page cache and cannot be used in other 
contexts where subsystems can benefit from larger linear memory.

 So if you argue that vmap is a downside, then please tell me how you
 consider the -ENOMEM of your approach to be better?

That is again pretty undifferentiated. Are we talking about low page 
orders? There we will reclaim the all of reclaimable memory before getting 
an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine 
has 256 milllion 4k pages--and the unmovable ratios we see today it 
would require a very strange setup to get an allocation failure while 
still be able to allocate order 0 pages.

With the ZONE_MOVABLE you can remove the unmovable objects into a defined 
pool then higher order success rates become reasonable.

 If, by special software layer, you mean the vmap/vunmap support in
 fsblock, let's see... that's probably all of a hundred or two lines.
 Contrast that with anti-fragmentation, lumpy reclaim, higher order
 pagecache and its new special mmap layer... Hmm, seems like a no
 brainer to me. You really still want to persue the extra layer
 argument as a point against fsblock here?

Yes sure. You code could not live without these approaches. Without the 
antifragmentation measures your fsblock code would not be very successful 
in getting the larger contiguous segments you need to improve performance.

(There is no new mmap layer, the higher order pagecache is simply the old 
API with set_blocksize expanded).
 

 Of course I wouldn't state that. On the contrary, I categorically state that
 I have already solved it :)

Well then I guess that you have not read the requirements...

  Because it has already been rejected in another form and adds more
 
 You have rejected it. But they are bogus reasons, as I showed above.

Thats not me. I am working on this because many of the filesystem people 
have repeatedly asked me to do this. I am no expert on filesystems.

 You also describe some other real (if lesser) issues like number of page
 structs to manage in the pagecache. But this is hardly enough to reject
 my patch now... for every downside you can point out in my approach, I
 can point out one in yours.
 
 - fsblock doesn't require changes to virtual memory layer

Therefore it is not a generic change but special to the block layer. So 
other subsystems still have to deal with the single page issues on 
their own.

  Maybe we coud get to something like a hybrid that avoids some of these
  issues? Add support so something like a virtual compound page can be
  handled transparently in the filesystem layer with special casing if
  such a beast reaches the block layer?
 
 That's conceptually much worse, IMO.

Why: It is the same approach that you use. If it is barely ever used and 
satisfies your concern then I am fine with it.

 And practically worse as well: vmap space is limited on 32-bit; fsblock
 approach can avoid vmap completely in many cases; for two reasons.
 
 The fsblock data accessor APIs aren't _that_ bad changes. They change
 zero conceptually in the filesystem, are arguably cleaner, and can be
 essentially nooped if we wanted to stay with a b_data type approach
 (but they give you that flexibility to replace it with any implementation).

The largeblock changes are generic. They improve general handling of 
compound pages, they make the existing APIs work for large units of 
memory, they are not adding additional new API layers.
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Christoph Lameter
On Fri, 14 Sep 2007, Christoph Lameter wrote:

 an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine 

s/1G/1T/ Sigh.

 has 256 milllion 4k pages--and the unmovable ratios we see today it 

256k for 1G.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] JBD slab cleanups

2007-09-14 Thread Mingming Cao
jbd/jbd2: Replace slab allocations with page cache allocations

From: Christoph Lameter [EMAIL PROTECTED]

JBD should not pass slab pages down to the block layer.
Use page allocator pages instead. This will also prepare
JBD for the large blocksize patchset.

Tested on 2.6.23-rc6 with fsx runs fine.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Mingming Cao [EMAIL PROTECTED]
---
 fs/jbd/checkpoint.c   |2 
 fs/jbd/commit.c   |6 +-
 fs/jbd/journal.c  |  107 -
 fs/jbd/transaction.c  |   10 ++--
 fs/jbd2/checkpoint.c  |2 
 fs/jbd2/commit.c  |6 +-
 fs/jbd2/journal.c |  109 --
 fs/jbd2/transaction.c |   18 
 include/linux/jbd.h   |   23 +-
 include/linux/jbd2.h  |   28 ++--
 10 files changed, 83 insertions(+), 228 deletions(-)

Index: linux-2.6.23-rc5/fs/jbd/journal.c
===
--- linux-2.6.23-rc5.orig/fs/jbd/journal.c  2007-09-13 13:37:57.0 
-0700
+++ linux-2.6.23-rc5/fs/jbd/journal.c   2007-09-13 13:45:39.0 -0700
@@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -334,10 +333,10 @@ repeat:
char *tmp;
 
jbd_unlock_bh_state(bh_in);
-   tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS);
+   tmp = jbd_alloc(bh_in-b_size, GFP_NOFS);
jbd_lock_bh_state(bh_in);
if (jh_in-b_frozen_data) {
-   jbd_slab_free(tmp, bh_in-b_size);
+   jbd_free(tmp, bh_in-b_size);
goto repeat;
}
 
@@ -679,7 +678,7 @@ static journal_t * journal_init_common (
/* Set up a default-sized revoke table for the new mount. */
err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
if (err) {
-   kfree(journal);
+   jbd_kfree(journal);
goto fail;
}
return journal;
@@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc
if (!journal-j_wbuf) {
printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
__FUNCTION__);
-   kfree(journal);
+   jbd_kfree(journal);
journal = NULL;
goto out;
}
@@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i
if (!journal-j_wbuf) {
printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
__FUNCTION__);
-   kfree(journal);
+   jbd_kfree(journal);
return NULL;
}
 
@@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i
if (err) {
printk(KERN_ERR %s: Cannnot locate journal superblock\n,
   __FUNCTION__);
-   kfree(journal);
+   jbd_kfree(journal);
return NULL;
}
 
@@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
}
}
 
-   /*
-* Create a slab for this blocksize
-*/
-   err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize));
-   if (err)
-   return err;
-
/* Let the recovery code check whether it needs to recover any
 * data from the journal. */
if (journal_recover(journal))
@@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal)
if (journal-j_revoke)
journal_destroy_revoke(journal);
kfree(journal-j_wbuf);
-   kfree(journal);
+   jbd_kfree(journal);
 }
 
 
@@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
 }
 
 /*
- * Simple support for retrying memory allocations.  Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
-   return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size  11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-   jbd_1k, jbd_2k, jbd_4k, NULL, jbd_8k
-};
-
-static void journal_destroy_jbd_slabs(void)
-{
-   int i;
-
-   for (i = 0; i  JBD_MAX_SLABS; i++) {
-   if (jbd_slab[i])
-   kmem_cache_destroy(jbd_slab[i]);
-   jbd_slab[i] = NULL;
-   }

Re: [PATCH] JBD slab cleanups

2007-09-14 Thread Christoph Lameter
Thanks Mingming.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Jeff Garzik

Evgeniy Polyakov wrote:

Hi.

I'm pleased to announce fourth release of the distributed storage
subsystem, which allows to form a storage on top of remote and local
nodes, which in turn can be exported to another storage as a node to
form tree-like storages.

This release includes new configuration interface (kernel connector over
netlink socket) and number of fixes of various bugs found during move 
to it (in error path).


Further TODO list includes:
* implement optional saving of mirroring/linear information on the remote
nodes (simple)
* new redundancy algorithm (complex)
* some thoughts about distributed filesystem tightly connected to DST
(far-far planes so far)

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


My thoughts.  But first a disclaimer:   Perhaps you will recall me as 
one of the people who really reads all your patches, and examines your 
code and proposals closely.  So, with that in mind...


I question the value of distributed block services (DBS), whether its 
your version or the others out there.  DBS are not very useful, because 
it still relies on a useful filesystem sitting on top of the DBS.  It 
devolves into one of two cases:  (1) multi-path much like today's SCSI, 
with distributed filesystem arbitrarion to ensure coherency, or (2) the 
filesystem running on top of the DBS is on a single host, and thus, a 
single point of failure (SPOF).


It is quite logical to extend the concepts of RAID across the network, 
but ultimately you are still bound by the inflexibility and simplicity 
of the block device.


In contrast, a distributed filesystem offers far more scalability, 
eliminates single points of failure, and offers more room for 
optimization and redundancy across the cluster.


A distributed filesystem is also much more complex, which is why 
distributed block devices are so appealing :)


With a redundant, distributed filesystem, you simply do not need any 
complexity at all at the block device level.  You don't even need RAID.


It is my hope that you will put your skills towards a distributed 
filesystem :)  Of the current solutions, GFS (currently in kernel) 
scales poorly, and NFS v4.1 is amazingly bloated and overly complex.


I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.


Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Al Boldi
Jeff Garzik wrote:
 Evgeniy Polyakov wrote:
  Hi.
 
  I'm pleased to announce fourth release of the distributed storage
  subsystem, which allows to form a storage on top of remote and local
  nodes, which in turn can be exported to another storage as a node to
  form tree-like storages.
 
  This release includes new configuration interface (kernel connector over
  netlink socket) and number of fixes of various bugs found during move
  to it (in error path).
 
  Further TODO list includes:
  * implement optional saving of mirroring/linear information on the
  remote nodes (simple)
  * new redundancy algorithm (complex)
  * some thoughts about distributed filesystem tightly connected to DST
  (far-far planes so far)
 
  Homepage:
  http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst
 
  Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

 My thoughts.  But first a disclaimer:   Perhaps you will recall me as
 one of the people who really reads all your patches, and examines your
 code and proposals closely.  So, with that in mind...

 I question the value of distributed block services (DBS), whether its
 your version or the others out there.  DBS are not very useful, because
 it still relies on a useful filesystem sitting on top of the DBS.  It
 devolves into one of two cases:  (1) multi-path much like today's SCSI,
 with distributed filesystem arbitrarion to ensure coherency, or (2) the
 filesystem running on top of the DBS is on a single host, and thus, a
 single point of failure (SPOF).

 It is quite logical to extend the concepts of RAID across the network,
 but ultimately you are still bound by the inflexibility and simplicity
 of the block device.

 In contrast, a distributed filesystem offers far more scalability,
 eliminates single points of failure, and offers more room for
 optimization and redundancy across the cluster.

 A distributed filesystem is also much more complex, which is why
 distributed block devices are so appealing :)

 With a redundant, distributed filesystem, you simply do not need any
 complexity at all at the block device level.  You don't even need RAID.

 It is my hope that you will put your skills towards a distributed
 filesystem :)  Of the current solutions, GFS (currently in kernel)
 scales poorly, and NFS v4.1 is amazingly bloated and overly complex.

 I've been waiting for years for a smart person to come along and write a
 POSIX-only distributed filesystem.

This http://lkml.org/lkml/2007/8/12/159 may provide a fast-path to reaching 
that goal.


Thanks!

--
Al
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread J. Bruce Fields
On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:
 J. Bruce Fields wrote:
 On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
 I've been waiting for years for a smart person to come along and write a 
 POSIX-only distributed filesystem.

 What exactly do you mean by POSIX-only?

 Don't bother supporting attributes, file modes, and other details not 
 supported by POSIX.  The prime example being NFSv4, which is larded down 
 with Windows features.

I am sympathetic  Cutting those out may still leave you with
something pretty complicated, though.

 NFSv4.1 adds to the fun, by throwing interoperability completely out the 
 window.

What parts are you worried about in particular?

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 19/19] exportfs: update documentation

2007-09-14 Thread Randy Dunlap
On Fri, 14 Sep 2007 13:49:28 +0200 [EMAIL PROTECTED] wrote:

typos only below:


 Update deocumentation to the current state of affairs.  Remove duplicated
 method descruptions in exportfs.h and point to Documentation/filesystems/
 Exporting instead.  Add a little file header comment in expfs.c describing
 what's going on and mentioning Neils and my copyright [1].
 
 [1] Neil, in case you want a different/additional attribution just change
 the patch in your queue to reflect the preferred version.
 
 
 Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]
 
 Index: linux-2.6/Documentation/filesystems/Exporting
 ===
 --- linux-2.6.orig/Documentation/filesystems/Exporting2007-03-16 
 15:10:54.0 +0100
 +++ linux-2.6/Documentation/filesystems/Exporting 2007-03-16 
 17:11:50.0 +0100
 + encode_fh  (optinonal)

   (optional)

 +Takes a dentry and creates a filehandle fragment which can later be used
 +to find/create a dentry for the same object.  The default implementation
 +creates a filehandle fragment that encodes a 32bit inode and generation
 +number for the inode encoded, and if nessecary the same information for

necessary

 +the parent.
 +
 +  fh_to_dentry (mandatory)
 +Given a filehandle fragment, this should find the implied object and
 +create a dentry for it (possibly with d_alloc_anon).
 +
 +  fh_to_parent (optional but strongly recommended)
 +Given a filehandle fragment, this should find the parent of the
 +implied object and create a dentry for it (possibly with d_alloc_anon).
 +May simplify fail if the filehandle fragment is too small.\

   simply  (?)

 +
 +  get_parent (optional but strongly recommended)
 +When given a dentry for a directory, this should return  a dentry for
 +the parent.  Quite possibly the parent dentry will have been allocated
 +by d_alloc_anon.  The default get_parent function just returns an error
 +so any filehandle lookup that requires finding a parent will fail.
 +-lookup(..) is *not* used as a default as it can leave .. entries
 +in the dcache which are too messy to work with.

 Index: linux-2.6/fs/exportfs/expfs.c
 ===
 --- linux-2.6.orig/fs/exportfs/expfs.c2007-03-16 17:06:10.0 
 +0100
 +++ linux-2.6/fs/exportfs/expfs.c 2007-03-16 23:45:14.0 +0100
 @@ -1,4 +1,13 @@
 -
 +/*
 + * Copyright (C) Neil Brown 2002
 + * Copyright (C) Christoph Hellwig 2007
 + *
 + * This file contains the code mapping from inodes to NFS file handles,
 + * and for mapping back from file handles to dentries.
 + *
 + * For details on why we doo all the strange and hairy things in here

do

 + * take a look at Documentation/filesystems/Exporting.
 + */
  #include linux/exportfs.h
  #include linux/fs.h
  #include linux/file.h


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Jeff Garzik

J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.



What exactly do you mean by POSIX-only?


Don't bother supporting attributes, file modes, and other details not 
supported by POSIX.  The prime example being NFSv4, which is larded down 
with Windows features.


NFSv4.1 adds to the fun, by throwing interoperability completely out the 
window.


Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread J. Bruce Fields
On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
 My thoughts.  But first a disclaimer:   Perhaps you will recall me as one 
 of the people who really reads all your patches, and examines your code and 
 proposals closely.  So, with that in mind...

 I question the value of distributed block services (DBS), whether its your 
 version or the others out there.  DBS are not very useful, because it still 
 relies on a useful filesystem sitting on top of the DBS.  It devolves into 
 one of two cases:  (1) multi-path much like today's SCSI, with distributed 
 filesystem arbitrarion to ensure coherency, or (2) the filesystem running 
 on top of the DBS is on a single host, and thus, a single point of failure 
 (SPOF).

 It is quite logical to extend the concepts of RAID across the network, but 
 ultimately you are still bound by the inflexibility and simplicity of the 
 block device.

 In contrast, a distributed filesystem offers far more scalability, 
 eliminates single points of failure, and offers more room for optimization 
 and redundancy across the cluster.

 A distributed filesystem is also much more complex, which is why 
 distributed block devices are so appealing :)

 With a redundant, distributed filesystem, you simply do not need any 
 complexity at all at the block device level.  You don't even need RAID.

 It is my hope that you will put your skills towards a distributed 
 filesystem :)  Of the current solutions, GFS (currently in kernel) scales 
 poorly, and NFS v4.1 is amazingly bloated and overly complex.

 I've been waiting for years for a smart person to come along and write a 
 POSIX-only distributed filesystem.

What exactly do you mean by POSIX-only?

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Jeff Garzik

J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:

J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
I've been waiting for years for a smart person to come along and write a 
POSIX-only distributed filesystem.

What exactly do you mean by POSIX-only?
Don't bother supporting attributes, file modes, and other details not 
supported by POSIX.  The prime example being NFSv4, which is larded down 
with Windows features.


I am sympathetic  Cutting those out may still leave you with
something pretty complicated, though.


Far less complicated than NFSv4.1 though (which is easy :))


NFSv4.1 adds to the fun, by throwing interoperability completely out the 
window.


What parts are you worried about in particular?


I'm not worried; I'm stating facts as they exist today (draft 13):

NFS v4.1 does something completely without precedent in the history of 
NFS:  the specification is defined such that interoperability is 
-impossible- to guarantee.


pNFS permits private and unspecified layout types.  This means it is 
impossible to guarantee that one NFSv4.1 implementation will be able to 
talk another NFSv4.1 implementation.


Even if Linux supports the entire NFSv4.1 RFC (as it stands in draft 13 
anyway), there is no guarantee at all that Linux will be able to store 
and retrieve data, since it's entirely possible that a proprietary 
protocol is required to access your data.


NFSv4.1 is no longer a completely open architecture.

Jeff




-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread J. Bruce Fields
On Fri, Sep 14, 2007 at 06:32:11PM -0400, Jeff Garzik wrote:
 J. Bruce Fields wrote:
 On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:
 J. Bruce Fields wrote:
 On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote:
 I've been waiting for years for a smart person to come along and write 
 a POSIX-only distributed filesystem.
 What exactly do you mean by POSIX-only?
 Don't bother supporting attributes, file modes, and other details not 
 supported by POSIX.  The prime example being NFSv4, which is larded down 
 with Windows features.
 I am sympathetic  Cutting those out may still leave you with
 something pretty complicated, though.

 Far less complicated than NFSv4.1 though (which is easy :))

One would hope so.

 NFSv4.1 adds to the fun, by throwing interoperability completely out the 
 window.
 What parts are you worried about in particular?

 I'm not worried; I'm stating facts as they exist today (draft 13):

 NFS v4.1 does something completely without precedent in the history of NFS: 
  the specification is defined such that interoperability is -impossible- to 
 guarantee.

 pNFS permits private and unspecified layout types.  This means it is 
 impossible to guarantee that one NFSv4.1 implementation will be able to 
 talk another NFSv4.1 implementation.

No, servers are required to support ordinary nfs operations to the
metadata server.

At least, that's the way it was last I heard, which was a while ago.  I
agree that it'd stink (for any number of reasons) if you ever *had* to
get a layout to access some file.

Was that your main concern?

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Goswin von Brederlow
Mel Gorman [EMAIL PROTECTED] writes:

 On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
 Nick Piggin [EMAIL PROTECTED] writes:
 
  In my attack, I cause the kernel to allocate lots of unmovable allocations
  and deplete movable groups. I theoretically then only need to keep a
  small number (1/2^N) of these allocations around in order to DoS a
  page allocation of order N.
 
 I'm assuming that when an unmovable allocation hijacks a movable group
 any further unmovable alloc will evict movable objects out of that
 group before hijacking another one. right?
 

 No eviction takes place. If an unmovable allocation gets placed in a
 movable group, then steps are taken to ensure that future unmovable
 allocations will take place in the same range (these decisions take
 place in __rmqueue_fallback()). When choosing a movable block to
 pollute, it will also choose the lowest possible block in PFN terms to
 steal so that fragmentation pollution will be as confined as possible.
 Evicting the unmovable pages would be one of those expensive steps that
 have been avoided to date.

But then you can have all blocks filled with movable data, free 4K in
one group, allocate 4K unmovable to take over the group, free 4k in
the next group, take that group and so on. You can end with 4k
unmovable in every 64k easily by accident.

There should be a lot of preassure for movable objects to vacate a
mixed group or you do get fragmentation catastrophs. Looking at my
little test program evicting movable objects from a mixed group should
not be that expensive as it doesn't happen often. The cost of it
should be freeing some pages (or finding free ones in a movable group)
and then memcpy. With my simplified simulation it never happens so I
expect it to only happen when the work set changes.

  And it doesn't even have to be a DoS. The natural fragmentation
  that occurs today in a kernel today has the possibility to slowly push out
  the movable groups and give you the same situation.
 
 How would you cause that? Say you do want to purposefully place one
 unmovable 4k page into every 64k compund page. So you allocate
 4K. First 64k page locked. But now, to get 4K into the second 64K page
 you have to first use up all the rest of the first 64k page. Meaning
 one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then
 will a new 64k chunk be broken and become locked.

 It would be easier early in the boot to mmap a large area and fault it
 in in virtual address order then mlock every a page every 64K. Early in
 the systems lifetime, there will be a rough correlation between physical
 and virtual memory.

 Without mlock(), the most successful attack will like mmap() a 60K
 region and fault it in as an attempt to get pagetable pages placed in
 every 64K region. This strategy would not work with grouping pages by
 mobility though as it would group the pagetable pages together.

But even with mlock the virtual pages should still be movable. So if
you evict movable objects from mixed group when needed all the
pagetable pages would end up in the same mixed group slowly taking it
over completly. No fragmentation at all. See how essential that
feature is. :)

 Targetted attacks on grouping pages by mobility are not very easy and
 not that interesting either. As Nick suggests, the natural fragmentation
 over long periods of time is what is interesting.

 So to get the last 64k chunk used all previous 32k chunks need to be
 blocked and you need to allocate 32k (or less if more is blocked). For
 all previous 32k chunks to be blocked every second 16k needs to be
 blocked. To block the last of those 16k chunks all previous 8k chunks
 need to be blocked and you need to allocate 8k. For all previous 8k
 chunks to be blocked every second 4k page needs to be used. To alloc
 the last of those 4k pages all previous 4k pages need to be used.
 
 So to construct a situation where no continious 64k chunk is free you
 have to allocate total mem - 64k - 32k - 16k - 8k - 4k (or there
 about) of memory first. Only then could you free memory again while
 still keeping every 64k page blocked. Does that occur naturally given
 enough ram to start with?
 

 I believe it's very difficult to craft an attack that will work in a
 short period of time. An attack that worked on 2.6.22 as well may have
 no success on 2.6.23-rc4-mm1 for example as grouping pages by mobility
 does it make it exceedingly hard to craft an attack unless the attacker
 can mlock large amounts of memory.

 
 Too see how bad fragmentation could be I wrote a little progamm to
 simulate allocations with the following simplified alogrithm:
 
 Memory management:
 - Free pages are kept in buckets, one per order, and sorted by address.
 - alloc() the front page (smallest address) out of the bucket of the
   right order or recursively splits the next higher bucket.
 - free() recursively tries to merge a page with its neighbour and puts
   the result back into the proper bucket (sorted 

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Goswin von Brederlow
Christoph Lameter [EMAIL PROTECTED] writes:

 On Fri, 14 Sep 2007, Christoph Lameter wrote:

 an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine 

 s/1G/1T/ Sigh.

 has 256 milllion 4k pages--and the unmovable ratios we see today it 

 256k for 1G.

256k == 64 pages for 1GB ram or 256k pages == 1Mb?

MfG
Goswin
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/3] 2.6.23-rc6: known regressions v2

2007-09-14 Thread Michal Piotrowski
Hi all,

Here is a list of some known regressions in 2.6.23-rc6.

Feel free to add new regressions/remove fixed etc.
http://kernelnewbies.org/known_regressions

List of Aces

NameRegressions fixed since 21-Jun-2007
Adrian Bunk10
Andi Kleen 7
Linus Torvalds 6
Alan Stern 5
Hugh Dickins   5
Trond Myklebust5
Andrew Morton  4
David S. Miller4
Al Viro3
Alexey Starikovskiy3
Cornelia Huck  3
Jens Axboe 3
Stephen Hemminger  3
Tejun Heo  3



FS

Subject : hanging ext3 dbench tests
References  : http://lkml.org/lkml/2007/9/11/176
Last known good : ?
Submitter   : Andy Whitcroft [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : ?
Status  : under test -- unreproducible at present

Subject : umount triggers a warning in jfs and takes almost a minute
References  : http://lkml.org/lkml/2007/9/4/73
Last known good : ?
Submitter   : Oliver Neukum [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : ?
Status  : unknown



Networking

Subject : build #301 failed for 2.6.23-rc6-g0d4cbb5 in 
linux/drivers/net/wireless/libertas/
References  : http://lkml.org/lkml/2007/9/11/150
Last known good : ?
Submitter   : Toralf Förster [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : ?
Status  : unknown

Subject : zd1211rw regression, device does not enumerate
References  : http://marc.info/?l=linux-usb-develm=118854967709322w=2
  http://bugzilla.kernel.org/show_bug.cgi?id=8972
Last known good : ?
Submitter   : Oliver Neukum [EMAIL PROTECTED]
Caused-By   : Daniel Drake [EMAIL PROTECTED]
  commit 74553aedd46b3a2cae986f909cf2a3f99369decc
Handled-By  : ?
Status  : unknown

Subject : NETDEV WATCHDOG: eth0: transmit timed out
References  : http://lkml.org/lkml/2007/8/13/737
Last known good : ?
Submitter   : Karl Meyer [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : Francois Romieu [EMAIL PROTECTED]
Status  : problem is being debugged

Subject : Weird network problems with 2.6.23-rc2
References  : http://lkml.org/lkml/2007/8/11/40
Last known good : ?
Submitter   : Shish [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : ?
Status  : unknown



Farewell!
Michal

--
LOGOUT
http://www.stardust.webpages.pl/
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Mike Snitzer
On 9/14/07, Jeff Garzik [EMAIL PROTECTED] wrote:
 Evgeniy Polyakov wrote:
  Hi.
 
  I'm pleased to announce fourth release of the distributed storage
  subsystem, which allows to form a storage on top of remote and local
  nodes, which in turn can be exported to another storage as a node to
  form tree-like storages.
 
  This release includes new configuration interface (kernel connector over
  netlink socket) and number of fixes of various bugs found during move
  to it (in error path).
 
  Further TODO list includes:
  * implement optional saving of mirroring/linear information on the remote
nodes (simple)
  * new redundancy algorithm (complex)
  * some thoughts about distributed filesystem tightly connected to DST
(far-far planes so far)
 
  Homepage:
  http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst
 
  Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

 My thoughts.  But first a disclaimer:   Perhaps you will recall me as
 one of the people who really reads all your patches, and examines your
 code and proposals closely.  So, with that in mind...

 I question the value of distributed block services (DBS), whether its
 your version or the others out there.  DBS are not very useful, because
 it still relies on a useful filesystem sitting on top of the DBS.  It
 devolves into one of two cases:  (1) multi-path much like today's SCSI,
 with distributed filesystem arbitrarion to ensure coherency, or (2) the
 filesystem running on top of the DBS is on a single host, and thus, a
 single point of failure (SPOF).

This distributed storage is very much needed; even if it were to act
as a more capable/performant replacement for NBD (or MD+NBD) in the
near term.  Many high availability applications don't _need_ all the
additional complexity of a full distributed filesystem.  So given
that, its discouraging to see you trying to gently push Evgeniy away
from all the promising work he has published.

Evgeniy, please continue your current work.

Mike
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread Jeff Garzik

J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 06:32:11PM -0400, Jeff Garzik wrote:

J. Bruce Fields wrote:

On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote:
NFSv4.1 adds to the fun, by throwing interoperability completely out the 
window.



What parts are you worried about in particular?



I'm not worried; I'm stating facts as they exist today (draft 13):

NFS v4.1 does something completely without precedent in the history of NFS: 
 the specification is defined such that interoperability is -impossible- to 
guarantee.


pNFS permits private and unspecified layout types.  This means it is 
impossible to guarantee that one NFSv4.1 implementation will be able to 
talk another NFSv4.1 implementation.



No, servers are required to support ordinary nfs operations to the
metadata server.

At least, that's the way it was last I heard, which was a while ago.  I
agree that it'd stink (for any number of reasons) if you ever *had* to
get a layout to access some file.

Was that your main concern?


I just sorta assumed you could fall back to the NFSv4.0 mode of 
operation, going through the metadata server for all data accesses.


But look at that choice in practice:  you can either ditch pNFS 
completely, or use a proprietary solution.  The market incentives are 
CLEARLY tilted in favor of makers of proprietary solutions.  But it's a 
poor choice (really little choice at all).


Overall, my main concern is that NFSv4.1 is no longer an open 
architecture solution.  The no-pNFS or proprietary platform choice 
merely illustrate one of many negative aspects of this architecture.


One of NFS's biggest value propositions is its interoperability.  To 
quote some Wall Street guys, NFS is like crack.  It Just Works.  We 
love it.


Now, for the first time in NFS's history (AFAIK), the protocol is no 
longer completely specified, completely known.  No longer a closed 
loop.  Private layout types mean that it is _highly_ unlikely that any 
OS or appliance or implementation will be able to claim full NFS 
compatibility.


And when the proprietary portion of the spec involves something as basic 
as accessing one's own data, I consider that a fundamental flaw.  NFS is 
no longer completely open.


Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-14 Thread J. Bruce Fields
On Sat, Sep 15, 2007 at 12:08:42AM -0400, Jeff Garzik wrote:
 J. Bruce Fields wrote:
 No, servers are required to support ordinary nfs operations to the
 metadata server.
 At least, that's the way it was last I heard, which was a while ago.  I
 agree that it'd stink (for any number of reasons) if you ever *had* to
 get a layout to access some file.
 Was that your main concern?

 I just sorta assumed you could fall back to the NFSv4.0 mode of operation, 
 going through the metadata server for all data accesses.

Right.  So any two pNFS implementations *will* be able to talk to each
other; they just may not be able to use the (possibly higher-bandwidth)
read/write path that pNFS gives them.

 But look at that choice in practice:  you can either ditch pNFS completely, 
 or use a proprietary solution.  The market incentives are CLEARLY tilted in 
 favor of makers of proprietary solutions.

I doubt somebody would go to all the trouble to implement pNFS and then
present their customers with that kind of choice.  But maybe I'm missing
something.  What market incentives do you see that would make that more
attractive than either 1) using a standard fully-specified layout type,
or 2) just implementing your own proprietary protocol instead of pNFS?

 Overall, my main concern is that NFSv4.1 is no longer an open architecture 
 solution.  The no-pNFS or proprietary platform choice merely illustrate 
 one of many negative aspects of this architecture.

It's always been possible to extend NFS in various ways if you want.
You could use sideband protocols with v2 and v3, for example.  People
have done that.  Some of them have been standardized and widely
implemented, some haven't.  You could probably add your own compound ops
to v4 if you wanted, I guess.

And there's advantages to experimenting with extensions first and then
standardizing when you figure out what works.  I wish it happened that
way more often.

 Now, for the first time in NFS's history (AFAIK), the protocol is no longer 
 completely specified, completely known.  No longer a closed loop.  
 Private layout types mean that it is _highly_ unlikely that any OS or 
 appliance or implementation will be able to claim full NFS compatibility.

Do you know of any such private layout types?

This is kind of a boring argument, isn't it?  I'd rather hear whatever
ideas you have for a new distributed filesystem protocol.

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html