Re: 2.6.22.6: kernel BUG at fs/locks.c:171
On Thu, 2007-09-13 at 09:51 +1000, Nick Piggin wrote: On Thursday 13 September 2007 19:20, Soeren Sonnenburg wrote: Dear all, I've just seen this in dmesg on a AMD K7 / kernel 2.6.22.6 machine (config attached). Any ideas / which further information needed ? Thanks for the report. Is it reproduceable? It seems like the locks_free_lock call that's oopsing is coming from __posix_lock_file. The actual function looks fine, but the lock being freed could have been corrupted if there was slab corruption, or a hardware corruption. You could: try running memtest86+ overnight. And try the following patch and turn on slab debugging then try to reproduce the problem. OK so far I've run memtest86+ 1.40 from freedos for 8 hrs (v1.70 hung on startup) - nothing. Could this corruption be caused by a pci card/driver? I am asking as I am using a new dvb-t card (asus p7131) and the oops happened after 5 or 6 days of uptime just about a day after watching some movie (very bad reception/lots of errors). However this machine used to have uptimes of months before the dvb card was in there and the kernel version upgrade (don't know which version that was...). Anyway I am not sure if this is reproducible, but I will keep memtest running today and then proceed as you said... Thanks, Soeren -- Sometimes, there's a moment as you're waking, when you become aware of the real world around you, but you're still dreaming. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change
On Tue, Sep 04, 2007 at 10:37:04AM -0400, Jeff Layton wrote: If the ATTR_KILL_S*ID bits are set then any mode change is only for clearing the setuid/setgid bits. For NFS skip the mode change and let the server handle it. You're assuming the server will remove setuid and setgid bits on WRITE? I don't see that behaviour specified in the RFC, at least for v3. The RFC specifies a behaviour for the mtime attribute as a side effect of WRITE, but says nothing about mode. This means server implementations are free to clobber setuid or not. A quick experiment shows that at least the Irix server will *NOT* clobber those bits. So with an Irix server you've now lost this Linux-specific security feature. I'm curious about the reasons behind this change. You mention credential issues; how exactly is it that you have the correct creds to perform a WRITE rpc but not a SETATTR rpc? Greg. -- Greg Banks, RD Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/19] export operations rewrite
This patchset is a medium scale rewrite of the export operations interface. The goal is to make the interface less complex, and easier to understand from the filesystem side, aswell as preparing generic support for exporting of 64bit inode numbers. This touches all nfs exporting filesystems, and I've done testing on all of the filesystems I have here locally (xfs, ext2, ext3, reiserfs, jfs) Compared to the last version I've fixed the white space issues that checkpatch.pl complained about. Note that this patch series is against mainline. There will be some xfs changes landing in -mm soon that revamp lots of the code touched here. They should hopefully include the first path in the series so it can be simply dropped, but the xfs conversion will need some smaller updates. I will send this update as soon as the xfs tree updates get pulled into -mm. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/19] exportfs: add fid type
Add a structured fid type so that we don't have to pass an array of u32 values around everywhere. It's a union of possible layouts. As a start there's only the u32 array and the traditional 32bit inode format, but there will be more in one of my next patchset when I start to document the various filehandle formats we have in lowlevel filesystems better. Also add an enum that gives the various filehandle types human- readable names. Note: Some people might think the struct containing an anonymous union is ugly, but I didn't want to pass around a raw union type. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/include/linux/exportfs.h === --- linux-2.6.orig/include/linux/exportfs.h 2007-09-13 15:10:59.0 +0200 +++ linux-2.6/include/linux/exportfs.h 2007-09-13 15:11:11.0 +0200 @@ -7,6 +7,44 @@ struct dentry; struct super_block; struct vfsmount; +/* + * The fileid_type identifies how the file within the filesystem is encoded. + * In theory this is freely set and parsed by the filesystem, but we try to + * stick to conventions so we can share some generic code and don't confuse + * sniffers like ethereal/wireshark. + * + * The filesystem must not use the value '0' or '0xff'. + */ +enum fid_type { + /* +* The root, or export point, of the filesystem. +* (Never actually passed down to the filesystem. +*/ + FILEID_ROOT = 0, + + /* +* 32bit inode number, 32 bit generation number. +*/ + FILEID_INO32_GEN = 1, + + /* +* 32bit inode number, 32 bit generation number, +* 32 bit parent directory inode number. +*/ + FILEID_INO32_GEN_PARENT = 2, +}; + +struct fid { + union { + struct { + u32 ino; + u32 gen; + u32 parent_ino; + u32 parent_gen; + } i32; + __u32 raw[6]; + }; +}; /** * struct export_operations - for nfsd to communicate with file systems @@ -117,9 +155,9 @@ extern struct dentry *find_exported_dent void *parent, int (*acceptable)(void *context, struct dentry *de), void *context); -extern int exportfs_encode_fh(struct dentry *dentry, __u32 *fh, int *max_len, - int connectable); -extern struct dentry *exportfs_decode_fh(struct vfsmount *mnt, __u32 *fh, +extern int exportfs_encode_fh(struct dentry *dentry, struct fid *fid, + int *max_len, int connectable); +extern struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid, int fh_len, int fileid_type, int (*acceptable)(void *, struct dentry *), void *context); Index: linux-2.6/fs/nfsd/nfsfh.c === --- linux-2.6.orig/fs/nfsd/nfsfh.c 2007-09-13 15:10:59.0 +0200 +++ linux-2.6/fs/nfsd/nfsfh.c 2007-09-13 15:13:32.0 +0200 @@ -115,8 +115,7 @@ fh_verify(struct svc_rqst *rqstp, struct dprintk(nfsd: fh_verify(%s)\n, SVCFH_fmt(fhp)); if (!fhp-fh_dentry) { - __u32 *datap=NULL; - __u32 tfh[3]; /* filehandle fragment for oldstyle filehandles */ + struct fid *fid = NULL, sfid; int fileid_type; int data_left = fh-fh_size/4; @@ -128,7 +127,6 @@ fh_verify(struct svc_rqst *rqstp, struct if (fh-fh_version == 1) { int len; - datap = fh-fh_auth; if (--data_left0) goto out; switch (fh-fh_auth_type) { case 0: break; @@ -144,9 +142,11 @@ fh_verify(struct svc_rqst *rqstp, struct fh-fh_fsid[1] = fh-fh_fsid[2]; } if ((data_left -= len)0) goto out; - exp = rqst_exp_find(rqstp, fh-fh_fsid_type, datap); - datap += len; + exp = rqst_exp_find(rqstp, fh-fh_fsid_type, + fh-fh_auth); + fid = (struct fid *)(fh-fh_auth + len); } else { + __u32 tfh[2]; dev_t xdev; ino_t xino; if (fh-fh_size != NFS_FHSIZE) @@ -190,22 +190,22 @@ fh_verify(struct svc_rqst *rqstp, struct error = nfserr_badhandle; if (fh-fh_version != 1) { - tfh[0] = fh-ofh_ino; - tfh[1] = fh-ofh_generation; - tfh[2] = fh-ofh_dirino; - datap = tfh; + sfid.i32.ino = fh-ofh_ino; + sfid.i32.gen = fh-ofh_generation; + sfid.i32.parent_ino = fh-ofh_dirino; + fid = sfid;
[PATCH 03/19] exportfs: add new methods
Add the guts for the new filesystem API to exportfs. There's now a fh_to_dentry method that returns a dentry for the object looked for given a filehandle fragment, and a fh_to_parent operation that returns the dentry for the encoded parent directory in case the file handle contains it. There are default implementations for these methods that only take a callback for an nfs-enhanced iget variant and implement the rest of the semantics. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/include/linux/exportfs.h === --- linux-2.6.orig/include/linux/exportfs.h 2007-09-13 15:11:11.0 +0200 +++ linux-2.6/include/linux/exportfs.h 2007-09-13 15:13:57.0 +0200 @@ -4,6 +4,7 @@ #include linux/types.h struct dentry; +struct inode; struct super_block; struct vfsmount; @@ -101,6 +102,21 @@ struct fid { *the filehandle fragment. encode_fh() should return the number of bytes *stored or a negative error code such as %-ENOSPC * + * fh_to_dentry: + *@fh_to_dentry is given a struct super_block (@sb) and a file handle + *fragment (@fh, @fh_len). It should return a struct dentry which refers + *to the same file that the file handle fragment refers to. If it cannot, + *it should return a %NULL pointer if the file was found but no acceptable + *dentries were available, or an %ERR_PTR error code indicating why it + *couldn't be found (e.g. %ENOENT or %ENOMEM). Any suitable dentry can be + *returned including, if necessary, a new dentry created with d_alloc_root. + *The caller can then find any other extant dentries by following the + *d_alias links. + * + * fh_to_parent: + *Same as @fh_to_dentry, except that it returns a pointer to the parent + *dentry if it was encoded into the filehandle fragment by @encode_fh. + * * get_name: *@get_name should find a name for the given @child in the given @parent *directory. The name should be stored in the @name (with the @@ -139,6 +155,10 @@ struct export_operations { void *context); int (*encode_fh)(struct dentry *de, __u32 *fh, int *max_len, int connectable); + struct dentry * (*fh_to_dentry)(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type); + struct dentry * (*fh_to_parent)(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type); int (*get_name)(struct dentry *parent, char *name, struct dentry *child); struct dentry * (*get_parent)(struct dentry *child); @@ -161,4 +181,14 @@ extern struct dentry *exportfs_decode_fh int fh_len, int fileid_type, int (*acceptable)(void *, struct dentry *), void *context); +/* + * Generic helpers for filesystems. + */ +extern struct dentry *generic_fh_to_dentry(struct super_block *sb, + struct fid *fid, int fh_len, int fh_type, + struct inode *(*get_inode) (struct super_block *sb, u64 ino, u32 gen)); +extern struct dentry *generic_fh_to_parent(struct super_block *sb, + struct fid *fid, int fh_len, int fh_type, + struct inode *(*get_inode) (struct super_block *sb, u64 ino, u32 gen)); + #endif /* LINUX_EXPORTFS_H */ Index: linux-2.6/fs/exportfs/expfs.c === --- linux-2.6.orig/fs/exportfs/expfs.c 2007-09-13 15:13:02.0 +0200 +++ linux-2.6/fs/exportfs/expfs.c 2007-09-13 15:14:42.0 +0200 @@ -514,17 +514,141 @@ struct dentry *exportfs_decode_fh(struct int (*acceptable)(void *, struct dentry *), void *context) { struct export_operations *nop = mnt-mnt_sb-s_export_op; - struct dentry *result; + struct dentry *result, *alias; + int err; - if (nop-decode_fh) { - result = nop-decode_fh(mnt-mnt_sb, fid-raw, fh_len, + /* +* Old way of doing things. Will go away soon. +*/ + if (!nop-fh_to_dentry) { + if (nop-decode_fh) { + return nop-decode_fh(mnt-mnt_sb, fid-raw, fh_len, fileid_type, acceptable, context); + } else { + return export_decode_fh(mnt-mnt_sb, fid-raw, fh_len, + fileid_type, acceptable, context); + } + } + + /* +* Try to get any dentry for the given file handle from the filesystem. +*/ + result = nop-fh_to_dentry(mnt-mnt_sb, fid, fh_len, fileid_type); + if (!result) + result = ERR_PTR(-ESTALE); + if (IS_ERR(result)) + return result; + + if (S_ISDIR(result-d_inode-i_mode)) { + /* +* This request is for a directory. +* +* On the positive side there is only one dentry for each +
[PATCH 11/19] fat: new export ops
Very little changes here, fat had a mostly no op decode_fh before and does not store any parent information. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/fs/fat/inode.c === --- linux-2.6.orig/fs/fat/inode.c 2007-03-13 19:22:40.0 +0100 +++ linux-2.6/fs/fat/inode.c2007-03-13 19:23:13.0 +0100 @@ -651,24 +651,15 @@ static const struct super_operations fat * of i_logstart is used to store the directory entry offset. */ -static struct dentry * -fat_decode_fh(struct super_block *sb, __u32 *fh, int len, int fhtype, - int (*acceptable)(void *context, struct dentry *de), - void *context) -{ - if (fhtype != 3) - return ERR_PTR(-ESTALE); - if (len 5) - return ERR_PTR(-ESTALE); - - return sb-s_export_op-find_exported_dentry(sb, fh, NULL, acceptable, context); -} - -static struct dentry *fat_get_dentry(struct super_block *sb, void *inump) +static struct dentry *fat_fh_to_dentry(struct super_block *sb, + struct fid *fid, int fh_len, int fh_type) { struct inode *inode = NULL; struct dentry *result; - __u32 *fh = inump; + u32 *fh = fid-raw; + + if (fh_len 5 || fh_type != 3) + return NULL; inode = iget(sb, fh[0]); if (!inode || is_bad_inode(inode) || inode-i_generation != fh[1]) { @@ -782,9 +773,8 @@ out: } static struct export_operations fat_export_ops = { - .decode_fh = fat_decode_fh, .encode_fh = fat_encode_fh, - .get_dentry = fat_get_dentry, + .fh_to_dentry = fat_fh_to_dentry, .get_parent = fat_get_parent, }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/19] isofs: new export ops
Nice little cleanup by consolidating things a little and using a structure for the special file handle format. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/fs/isofs/export.c === --- linux-2.6.orig/fs/isofs/export.c2007-02-11 10:31:19.0 +0100 +++ linux-2.6/fs/isofs/export.c 2007-02-11 10:45:25.0 +0100 @@ -42,16 +42,6 @@ return result; } -static struct dentry * -isofs_export_get_dentry(struct super_block *sb, void *vobjp) -{ - __u32 *objp = vobjp; - unsigned long block = objp[0]; - unsigned long offset = objp[1]; - __u32 generation = objp[2]; - return isofs_export_iget(sb, block, offset, generation); -} - /* This function is surprisingly simple. The trick is understanding * that child is always a directory. So, to find its parent, you * simply need to find its .. entry, normalize its block and offset, @@ -182,43 +172,44 @@ return type; } +struct isofs_fid { + u32 block; + u16 offset; + u16 parent_offset; + u32 generation; + u32 parent_block; + u32 parent_generation; +}; -static struct dentry * -isofs_export_decode_fh(struct super_block *sb, - __u32 *fh32, - int fh_len, - int fileid_type, - int (*acceptable)(void *context, struct dentry *de), - void *context) +static struct dentry *isofs_fh_to_dentry(struct super_block *sb, + struct fid *fid, int fh_len, int fh_type) { - __u16 *fh16 = (__u16*)fh32; - __u32 child[3]; /* The child is what triggered all this. */ - __u32 parent[3]; /* The parent is just along for the ride. */ + struct isofs_fid *ifid = (struct isofs_fid *)fid; - if (fh_len 3 || fileid_type 2) + if (fh_len 3 || fh_type 2) return NULL; - child[0] = fh32[0]; - child[1] = fh16[2]; /* fh16 [sic] */ - child[2] = fh32[2]; - - parent[0] = 0; - parent[1] = 0; - parent[2] = 0; - if (fileid_type == 2) { - if (fh_len 2) parent[0] = fh32[3]; - parent[1] = fh16[3]; /* fh16 [sic] */ - if (fh_len 4) parent[2] = fh32[4]; - } - - return sb-s_export_op-find_exported_dentry(sb, child, parent, -acceptable, context); + return isofs_export_iget(sb, ifid-block, ifid-offset, + ifid-generation); } +static struct dentry *isofs_fh_to_parent(struct super_block *sb, + struct fid *fid, int fh_len, int fh_type) +{ + struct isofs_fid *ifid = (struct isofs_fid *)fid; + + if (fh_type != 2) + return NULL; + + return isofs_export_iget(sb, + fh_len 2 ? ifid-parent_block : 0, + ifid-parent_offset, + fh_len 4 ? ifid-parent_generation : 0); +} struct export_operations isofs_export_ops = { - .decode_fh = isofs_export_decode_fh, .encode_fh = isofs_export_encode_fh, - .get_dentry = isofs_export_get_dentry, + .fh_to_dentry = isofs_fh_to_dentry, + .fh_to_parent = isofs_fh_to_parent, .get_parent = isofs_export_get_parent, }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/19] shmem: new export ops
I'm not sure what people were thinking when adding support to export tmpfs, but here's the conversion anyway: Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/mm/shmem.c === --- linux-2.6.orig/mm/shmem.c 2007-02-11 10:46:30.0 +0100 +++ linux-2.6/mm/shmem.c2007-02-11 10:53:12.0 +0100 @@ -1977,33 +1977,25 @@ return ino-i_ino == inum fh[0] == ino-i_generation; } -static struct dentry *shmem_get_dentry(struct super_block *sb, void *vfh) +static struct dentry *shmem_fh_to_dentry(struct super_block *sb, + struct fid *fid, int fh_len, int fh_type) { - struct dentry *de = NULL; struct inode *inode; - __u32 *fh = vfh; - __u64 inum = fh[2]; - inum = (inum 32) | fh[1]; + struct dentry *dentry = NULL; + u64 inum = fid-raw[2]; + inum = (inum 32) | fid-raw[1]; - inode = ilookup5(sb, (unsigned long)(inum+fh[0]), shmem_match, vfh); + if (fh_len 3) + return NULL; + + inode = ilookup5(sb, (unsigned long)(inum + fid-raw[0]), + shmem_match, fid-raw); if (inode) { - de = d_find_alias(inode); + dentry = d_find_alias(inode); iput(inode); } - return de? de: ERR_PTR(-ESTALE); -} - -static struct dentry *shmem_decode_fh(struct super_block *sb, __u32 *fh, - int len, int type, - int (*acceptable)(void *context, struct dentry *de), - void *context) -{ - if (len 3) - return ERR_PTR(-ESTALE); - - return sb-s_export_op-find_exported_dentry(sb, fh, NULL, acceptable, - context); + return dentry; } static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len, @@ -2038,9 +2030,8 @@ static struct export_operations shmem_export_ops = { .get_parent = shmem_get_parent, - .get_dentry = shmem_get_dentry, .encode_fh = shmem_encode_fh, - .decode_fh = shmem_decode_fh, + .fh_to_dentry = shmem_fh_to_dentry, }; static int shmem_parse_options(char *options, int *mode, uid_t *uid, -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/19] reiserfs: new export ops
Another nice little cleanup by using the new methods. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/fs/reiserfs/inode.c === --- linux-2.6.orig/fs/reiserfs/inode.c 2007-09-13 15:10:45.0 +0200 +++ linux-2.6/fs/reiserfs/inode.c 2007-09-13 15:21:12.0 +0200 @@ -1514,19 +1514,20 @@ struct inode *reiserfs_iget(struct super return inode; } -struct dentry *reiserfs_get_dentry(struct super_block *sb, void *vobjp) +static struct dentry *reiserfs_get_dentry(struct super_block *sb, + u32 objectid, u32 dir_id, u32 generation) + { - __u32 *data = vobjp; struct cpu_key key; struct dentry *result; struct inode *inode; - key.on_disk_key.k_objectid = data[0]; - key.on_disk_key.k_dir_id = data[1]; + key.on_disk_key.k_objectid = objectid; + key.on_disk_key.k_dir_id = dir_id; reiserfs_write_lock(sb); inode = reiserfs_iget(sb, key); - if (inode !IS_ERR(inode) data[2] != 0 - data[2] != inode-i_generation) { + if (inode !IS_ERR(inode) generation != 0 + generation != inode-i_generation) { iput(inode); inode = NULL; } @@ -1543,14 +1544,9 @@ struct dentry *reiserfs_get_dentry(struc return result; } -struct dentry *reiserfs_decode_fh(struct super_block *sb, __u32 * data, - int len, int fhtype, - int (*acceptable) (void *contect, -struct dentry * de), - void *context) +struct dentry *reiserfs_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) { - __u32 obj[3], parent[3]; - /* fhtype happens to reflect the number of u32s encoded. * due to a bug in earlier code, fhtype might indicate there * are more u32s then actually fitted. @@ -1563,32 +1559,28 @@ struct dentry *reiserfs_decode_fh(struct * 6 - as above plus generation of directory * 6 does not fit in NFSv2 handles */ - if (fhtype len) { - if (fhtype != 6 || len != 5) + if (fh_type fh_len) { + if (fh_type != 6 || fh_len != 5) reiserfs_warning(sb, -nfsd/reiserfs, fhtype=%d, len=%d - odd, -fhtype, len); - fhtype = 5; + nfsd/reiserfs, fhtype=%d, len=%d - odd, + fh_type, fh_len); + fh_type = 5; } - obj[0] = data[0]; - obj[1] = data[1]; - if (fhtype == 3 || fhtype = 5) - obj[2] = data[2]; - else - obj[2] = 0; /* generation number */ + return reiserfs_get_dentry(sb, fid-raw[0], fid-raw[1], + (fh_type == 3 || fh_type = 5) ? fid-raw[2] : 0); +} - if (fhtype = 4) { - parent[0] = data[fhtype = 5 ? 3 : 2]; - parent[1] = data[fhtype = 5 ? 4 : 3]; - if (fhtype == 6) - parent[2] = data[5]; - else - parent[2] = 0; - } - return sb-s_export_op-find_exported_dentry(sb, obj, -fhtype 4 ? NULL : parent, -acceptable, context); +struct dentry *reiserfs_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + if (fh_type 4) + return NULL; + + return reiserfs_get_dentry(sb, + (fh_type = 5) ? fid-raw[3] : fid-raw[2], + (fh_type = 5) ? fid-raw[4] : fid-raw[3], + (fh_type == 6) ? fid-raw[5] : 0); } int reiserfs_encode_fh(struct dentry *dentry, __u32 * data, int *lenp, Index: linux-2.6/fs/reiserfs/super.c === --- linux-2.6.orig/fs/reiserfs/super.c 2007-09-13 15:10:45.0 +0200 +++ linux-2.6/fs/reiserfs/super.c 2007-09-13 15:21:12.0 +0200 @@ -651,9 +651,9 @@ static struct quotactl_ops reiserfs_qctl static struct export_operations reiserfs_export_ops = { .encode_fh = reiserfs_encode_fh, - .decode_fh = reiserfs_decode_fh, + .fh_to_dentry = reiserfs_fh_to_dentry, + .fh_to_parent = reiserfs_fh_to_parent, .get_parent = reiserfs_get_parent, - .get_dentry = reiserfs_get_dentry, }; /* this struct is used in reiserfs_getopt () for containing the value for those Index: linux-2.6/include/linux/reiserfs_fs.h === --- linux-2.6.orig/include/linux/reiserfs_fs.h 2007-09-13 15:10:45.0 +0200 +++ linux-2.6/include/linux/reiserfs_fs.h
[PATCH 15/19] gfs2: new export ops
Convert gfs2 to the new ops. Uses a similar structure to the generic helpers, but gfs2 has it's own file handle formats. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/fs/gfs2/ops_export.c === --- linux-2.6.orig/fs/gfs2/ops_export.c 2007-07-19 15:56:46.0 +0200 +++ linux-2.6/fs/gfs2/ops_export.c 2007-07-20 19:58:06.0 +0200 @@ -31,40 +31,6 @@ #define GFS2_LARGE_FH_SIZE 8 #define GFS2_OLD_FH_SIZE 10 -static struct dentry *gfs2_decode_fh(struct super_block *sb, -__u32 *p, -int fh_len, -int fh_type, -int (*acceptable)(void *context, - struct dentry *dentry), -void *context) -{ - __be32 *fh = (__force __be32 *)p; - struct gfs2_inum_host inum, parent; - - memset(parent, 0, sizeof(struct gfs2_inum)); - - switch (fh_len) { - case GFS2_LARGE_FH_SIZE: - case GFS2_OLD_FH_SIZE: - parent.no_formal_ino = ((u64)be32_to_cpu(fh[4])) 32; - parent.no_formal_ino |= be32_to_cpu(fh[5]); - parent.no_addr = ((u64)be32_to_cpu(fh[6])) 32; - parent.no_addr |= be32_to_cpu(fh[7]); - case GFS2_SMALL_FH_SIZE: - inum.no_formal_ino = ((u64)be32_to_cpu(fh[0])) 32; - inum.no_formal_ino |= be32_to_cpu(fh[1]); - inum.no_addr = ((u64)be32_to_cpu(fh[2])) 32; - inum.no_addr |= be32_to_cpu(fh[3]); - break; - default: - return NULL; - } - - return gfs2_export_ops.find_exported_dentry(sb, inum, parent, - acceptable, context); -} - static int gfs2_encode_fh(struct dentry *dentry, __u32 *p, int *len, int connectable) { @@ -189,10 +155,10 @@ static struct dentry *gfs2_get_parent(st return dentry; } -static struct dentry *gfs2_get_dentry(struct super_block *sb, void *inum_obj) +static struct dentry *gfs2_get_dentry(struct super_block *sb, + struct gfs2_inum_host *inum) { struct gfs2_sbd *sdp = sb-s_fs_info; - struct gfs2_inum_host *inum = inum_obj; struct gfs2_holder i_gh, ri_gh, rgd_gh; struct gfs2_rgrpd *rgd; struct inode *inode; @@ -289,11 +255,50 @@ fail: return ERR_PTR(error); } +static struct dentry *gfs2_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + struct gfs2_inum_host this; + __be32 *fh = (__force __be32 *)fid-raw; + + switch (fh_type) { + case GFS2_SMALL_FH_SIZE: + case GFS2_LARGE_FH_SIZE: + case GFS2_OLD_FH_SIZE: + this.no_formal_ino = ((u64)be32_to_cpu(fh[0])) 32; + this.no_formal_ino |= be32_to_cpu(fh[1]); + this.no_addr = ((u64)be32_to_cpu(fh[2])) 32; + this.no_addr |= be32_to_cpu(fh[3]); + return gfs2_get_dentry(sb, this); + default: + return NULL; + } +} + +static struct dentry *gfs2_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + struct gfs2_inum_host parent; + __be32 *fh = (__force __be32 *)fid-raw; + + switch (fh_type) { + case GFS2_LARGE_FH_SIZE: + case GFS2_OLD_FH_SIZE: + parent.no_formal_ino = ((u64)be32_to_cpu(fh[4])) 32; + parent.no_formal_ino |= be32_to_cpu(fh[5]); + parent.no_addr = ((u64)be32_to_cpu(fh[6])) 32; + parent.no_addr |= be32_to_cpu(fh[7]); + return gfs2_get_dentry(sb, parent); + default: + return NULL; + } +} + struct export_operations gfs2_export_ops = { - .decode_fh = gfs2_decode_fh, .encode_fh = gfs2_encode_fh, + .fh_to_dentry = gfs2_fh_to_dentry, + .fh_to_parent = gfs2_fh_to_parent, .get_name = gfs2_get_name, .get_parent = gfs2_get_parent, - .get_dentry = gfs2_get_dentry, }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 16/19] ocfs2: new export ops
OCFS2 has it's own 64bit-firendly filehandle format so we can't use the generic helpers here. I'll add a struct for the types later. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/fs/ocfs2/export.c === --- linux-2.6.orig/fs/ocfs2/export.c2007-05-06 13:51:17.0 +0200 +++ linux-2.6/fs/ocfs2/export.c 2007-06-12 15:54:44.0 +0200 @@ -45,9 +45,9 @@ struct ocfs2_inode_handle u32 ih_generation; }; -static struct dentry *ocfs2_get_dentry(struct super_block *sb, void *vobjp) +static struct dentry *ocfs2_get_dentry(struct super_block *sb, + struct ocfs2_inode_handle *handle) { - struct ocfs2_inode_handle *handle = vobjp; struct inode *inode; struct dentry *result; @@ -200,54 +200,37 @@ bail: return type; } -static struct dentry *ocfs2_decode_fh(struct super_block *sb, u32 *fh_in, - int fh_len, int fileid_type, - int (*acceptable)(void *context, - struct dentry *de), - void *context) +static struct dentry *ocfs2_fh_to_dentry(struct super_block *sb, + struct fid *fid, int fh_len, int fh_type) { - struct ocfs2_inode_handle handle, parent; - struct dentry *ret = NULL; - __le32 *fh = (__force __le32 *) fh_in; - - mlog_entry((0x%p, 0x%p, %d, %d, 0x%p, 0x%p)\n, - sb, fh, fh_len, fileid_type, acceptable, context); + struct ocfs2_inode_handle handle; - if (fh_len 3 || fileid_type 2) - goto bail; + if (fh_len 3 || fh_type 2) + return NULL; - if (fileid_type == 2) { - if (fh_len 6) - goto bail; - - parent.ih_blkno = (u64)le32_to_cpu(fh[3]) 32; - parent.ih_blkno |= (u64)le32_to_cpu(fh[4]); - parent.ih_generation = le32_to_cpu(fh[5]); - - mlog(0, Decoding parent: blkno: %llu, generation: %u\n, -(unsigned long long)parent.ih_blkno, -parent.ih_generation); - } - - handle.ih_blkno = (u64)le32_to_cpu(fh[0]) 32; - handle.ih_blkno |= (u64)le32_to_cpu(fh[1]); - handle.ih_generation = le32_to_cpu(fh[2]); + handle.ih_blkno = (u64)le32_to_cpu(fid-raw[0]) 32; + handle.ih_blkno |= (u64)le32_to_cpu(fid-raw[1]); + handle.ih_generation = le32_to_cpu(fid-raw[2]); + return ocfs2_get_dentry(sb, handle); +} - mlog(0, Encoding fh: blkno: %llu, generation: %u\n, -(unsigned long long)handle.ih_blkno, handle.ih_generation); +static struct dentry *ocfs2_fh_to_parent(struct super_block *sb, + struct fid *fid, int fh_len, int fh_type) +{ + struct ocfs2_inode_handle parent; - ret = ocfs2_export_ops.find_exported_dentry(sb, handle, parent, - acceptable, context); + if (fh_type != 2 || fh_len 6) + return NULL; -bail: - mlog_exit_ptr(ret); - return ret; + parent.ih_blkno = (u64)le32_to_cpu(fid-raw[3]) 32; + parent.ih_blkno |= (u64)le32_to_cpu(fid-raw[4]); + parent.ih_generation = le32_to_cpu(fid-raw[5]); + return ocfs2_get_dentry(sb, parent); } struct export_operations ocfs2_export_ops = { - .decode_fh = ocfs2_decode_fh, .encode_fh = ocfs2_encode_fh, - + .fh_to_dentry = ocfs2_fh_to_dentry, + .fh_to_parent = ocfs2_fh_to_parent, .get_parent = ocfs2_get_parent, - .get_dentry = ocfs2_get_dentry, }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 17/19] exportfs: remove old methods
Now that all filesystems are converted remove support for the old methods. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/fs/exportfs/expfs.c === --- linux-2.6.orig/fs/exportfs/expfs.c 2007-08-29 13:52:01.0 +0200 +++ linux-2.6/fs/exportfs/expfs.c 2007-08-29 14:02:41.0 +0200 @@ -13,19 +13,6 @@ static int get_name(struct dentry *dentr struct dentry *child); -static struct dentry *exportfs_get_dentry(struct super_block *sb, void *obj) -{ - struct dentry *result = ERR_PTR(-ESTALE); - - if (sb-s_export_op-get_dentry) { - result = sb-s_export_op-get_dentry(sb, obj); - if (!result) - result = ERR_PTR(-ESTALE); - } - - return result; -} - static int exportfs_get_name(struct dentry *dir, char *name, struct dentry *child) { @@ -214,125 +201,6 @@ reconnect_path(struct super_block *sb, s return 0; } -/** - * find_exported_dentry - helper routine to implement export_operations-decode_fh - * @sb:The super_block identifying the filesystem - * @obj: An opaque identifier of the object to be found - passed to - * get_inode - * @parent:An optional opqaue identifier of the parent of the object. - * @acceptable:A function used to test possible dentries to see if they are - * acceptable - * @context: A parameter to @acceptable so that it knows on what basis to - * judge. - * - * find_exported_dentry is the central helper routine to enable file systems - * to provide the decode_fh() export_operation. It's main task is to take - * an inode, find or create an appropriate dentry structure, and possibly - * splice this into the dcache in the correct place. - * - * The decode_fh() operation provided by the filesystem should call - * find_exported_dentry() with the same parameters that it received except - * that instead of the file handle fragment, pointers to opaque identifiers - * for the object and optionally its parent are passed. The default decode_fh - * routine passes one pointer to the start of the filehandle fragment, and - * one 8 bytes into the fragment. It is expected that most filesystems will - * take this approach, though the offset to the parent identifier may well be - * different. - * - * find_exported_dentry() will call get_dentry to get an dentry pointer from - * the file system. If any dentry in the d_alias list is acceptable, it will - * be returned. Otherwise find_exported_dentry() will attempt to splice a new - * dentry into the dcache using get_name() and get_parent() to find the - * appropriate place. - */ - -struct dentry * -find_exported_dentry(struct super_block *sb, void *obj, void *parent, -int (*acceptable)(void *context, struct dentry *de), -void *context) -{ - struct dentry *result, *alias; - int err = -ESTALE; - - /* -* Attempt to find the inode. -*/ - result = exportfs_get_dentry(sb, obj); - if (IS_ERR(result)) - return result; - - if (S_ISDIR(result-d_inode-i_mode)) { - if (!(result-d_flags DCACHE_DISCONNECTED)) { - if (acceptable(context, result)) - return result; - err = -EACCES; - goto err_result; - } - - err = reconnect_path(sb, result); - if (err) - goto err_result; - } else { - struct dentry *target_dir, *nresult; - char nbuf[NAME_MAX+1]; - - alias = find_acceptable_alias(result, acceptable, context); - if (alias) - return alias; - - if (parent == NULL) - goto err_result; - - target_dir = exportfs_get_dentry(sb,parent); - if (IS_ERR(target_dir)) { - err = PTR_ERR(target_dir); - goto err_result; - } - - err = reconnect_path(sb, target_dir); - if (err) { - dput(target_dir); - goto err_result; - } - - /* -* As we weren't after a directory, have one more step to go. -*/ - err = exportfs_get_name(target_dir, nbuf, result); - if (!err) { - mutex_lock(target_dir-d_inode-i_mutex); - nresult = lookup_one_len(nbuf, target_dir, -strlen(nbuf)); - mutex_unlock(target_dir-d_inode-i_mutex); - if (!IS_ERR(nresult)) { - if (nresult-d_inode) { - dput(result); -
[PATCH 18/19] exportfs: make struct export_operations const
Now that nfsd has stopped writing to the find_exported_dentry member we an mark the export_operations const Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/fs/efs/super.c === --- linux-2.6.orig/fs/efs/super.c 2007-09-13 15:06:51.0 +0200 +++ linux-2.6/fs/efs/super.c2007-09-13 15:09:05.0 +0200 @@ -113,7 +113,7 @@ static const struct super_operations efs .remount_fs = efs_remount, }; -static struct export_operations efs_export_ops = { +static const struct export_operations efs_export_ops = { .fh_to_dentry = efs_fh_to_dentry, .fh_to_parent = efs_fh_to_parent, .get_parent = efs_get_parent, Index: linux-2.6/fs/ext2/super.c === --- linux-2.6.orig/fs/ext2/super.c 2007-09-13 15:05:26.0 +0200 +++ linux-2.6/fs/ext2/super.c 2007-09-13 15:09:05.0 +0200 @@ -292,7 +292,7 @@ static struct dentry *ext2_fh_to_parent( * systems, but can be improved upon. * Currently only get_parent is required. */ -static struct export_operations ext2_export_ops = { +static const struct export_operations ext2_export_ops = { .fh_to_dentry = ext2_fh_to_dentry, .fh_to_parent = ext2_fh_to_parent, .get_parent = ext2_get_parent, Index: linux-2.6/fs/ext3/super.c === --- linux-2.6.orig/fs/ext3/super.c 2007-09-13 15:06:39.0 +0200 +++ linux-2.6/fs/ext3/super.c 2007-09-13 15:09:05.0 +0200 @@ -670,7 +670,7 @@ static const struct super_operations ext #endif }; -static struct export_operations ext3_export_ops = { +static const struct export_operations ext3_export_ops = { .fh_to_dentry = ext3_fh_to_dentry, .fh_to_parent = ext3_fh_to_parent, .get_parent = ext3_get_parent, Index: linux-2.6/fs/ext4/super.c === --- linux-2.6.orig/fs/ext4/super.c 2007-09-13 15:06:44.0 +0200 +++ linux-2.6/fs/ext4/super.c 2007-09-13 15:09:05.0 +0200 @@ -721,7 +721,7 @@ static const struct super_operations ext #endif }; -static struct export_operations ext4_export_ops = { +static const struct export_operations ext4_export_ops = { .fh_to_dentry = ext4_fh_to_dentry, .fh_to_parent = ext4_fh_to_parent, .get_parent = ext4_get_parent, Index: linux-2.6/fs/fat/inode.c === --- linux-2.6.orig/fs/fat/inode.c 2007-09-13 15:08:12.0 +0200 +++ linux-2.6/fs/fat/inode.c2007-09-13 15:09:05.0 +0200 @@ -769,7 +769,7 @@ out: return parent; } -static struct export_operations fat_export_ops = { +static const struct export_operations fat_export_ops = { .encode_fh = fat_encode_fh, .fh_to_dentry = fat_fh_to_dentry, .get_parent = fat_get_parent, Index: linux-2.6/fs/gfs2/ops_export.c === --- linux-2.6.orig/fs/gfs2/ops_export.c 2007-09-13 15:08:53.0 +0200 +++ linux-2.6/fs/gfs2/ops_export.c 2007-09-13 15:09:05.0 +0200 @@ -294,7 +294,7 @@ static struct dentry *gfs2_fh_to_parent( } } -struct export_operations gfs2_export_ops = { +const struct export_operations gfs2_export_ops = { .encode_fh = gfs2_encode_fh, .fh_to_dentry = gfs2_fh_to_dentry, .fh_to_parent = gfs2_fh_to_parent, Index: linux-2.6/fs/isofs/export.c === --- linux-2.6.orig/fs/isofs/export.c2007-09-13 15:08:18.0 +0200 +++ linux-2.6/fs/isofs/export.c 2007-09-13 15:09:05.0 +0200 @@ -207,7 +207,7 @@ static struct dentry *isofs_fh_to_parent fh_len 4 ? ifid-parent_generation : 0); } -struct export_operations isofs_export_ops = { +const struct export_operations isofs_export_ops = { .encode_fh = isofs_export_encode_fh, .fh_to_dentry = isofs_fh_to_dentry, .fh_to_parent = isofs_fh_to_parent, Index: linux-2.6/fs/isofs/isofs.h === --- linux-2.6.orig/fs/isofs/isofs.h 2007-09-11 16:23:34.0 +0200 +++ linux-2.6/fs/isofs/isofs.h 2007-09-13 15:09:05.0 +0200 @@ -178,4 +178,4 @@ isofs_normalize_block_and_offset(struct extern const struct inode_operations isofs_dir_inode_operations; extern const struct file_operations isofs_dir_operations; extern const struct address_space_operations isofs_symlink_aops; -extern struct export_operations isofs_export_ops; +extern const struct export_operations isofs_export_ops; Index: linux-2.6/fs/jfs/super.c === --- linux-2.6.orig/fs/jfs/super.c 2007-09-13 15:07:17.0 +0200 +++
[PATCH 19/19] exportfs: update documentation
Update deocumentation to the current state of affairs. Remove duplicated method descruptions in exportfs.h and point to Documentation/filesystems/ Exporting instead. Add a little file header comment in expfs.c describing what's going on and mentioning Neils and my copyright [1]. [1] Neil, in case you want a different/additional attribution just change the patch in your queue to reflect the preferred version. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/Documentation/filesystems/Exporting === --- linux-2.6.orig/Documentation/filesystems/Exporting 2007-03-16 15:10:54.0 +0100 +++ linux-2.6/Documentation/filesystems/Exporting 2007-03-16 17:11:50.0 +0100 @@ -2,7 +2,10 @@ Making Filesystems Exportable = -Most filesystem operations require a dentry (or two) as a starting +Overview + + +All filesystem operations require a dentry (or two) as a starting point. Local applications have a reference-counted hold on suitable dentrys via open file descriptors or cwd/root. However remote applications that access a filesystem via a remote filesystem protocol @@ -89,11 +92,9 @@ For a filesystem to be exportable it mus 1/ provide the filehandle fragment routines described below. 2/ make sure that d_splice_alias is used rather than d_add when -lookup finds an inode for a given parent and name. - Typically the -lookup routine will end: - if (inode) - return d_splice(inode, dentry); - d_add(dentry, inode); - return NULL; + Typically the -lookup routine will end with a: + + return d_splice_alias(inode, dentry); } @@ -101,67 +102,39 @@ For a filesystem to be exportable it mus A file system implementation declares that instances of the filesystem are exportable by setting the s_export_op field in the struct super_block. This field must point to a struct export_operations -struct which could potentially be full of NULLs, though normally at -least get_parent will be set. +struct which has the following members: - The primary operations are decode_fh and encode_fh. -decode_fh takes a filehandle fragment and tries to find or create a -dentry for the object referred to by the filehandle. -encode_fh takes a dentry and creates a filehandle fragment which can -later be used to find/create a dentry for the same object. - -decode_fh will probably make use of find_exported_dentry. -This function lives in the exportfs module which a filesystem does -not need unless it is being exported. So rather that calling -find_exported_dentry directly, each filesystem should call it through -the find_exported_dentry pointer in it's export_operations table. -This field is set correctly by the exporting agent (e.g. nfsd) when a -filesystem is exported, and before any export operations are called. - -find_exported_dentry needs three support functions from the -filesystem: - get_name. When given a parent dentry and a child dentry, this -should find a name in the directory identified by the parent -dentry, which leads to the object identified by the child dentry. -If no get_name function is supplied, a default implementation is -provided which uses vfs_readdir to find potential names, and -matches inode numbers to find the correct match. - - get_parent. When given a dentry for a directory, this should return -a dentry for the parent. Quite possibly the parent dentry will -have been allocated by d_alloc_anon. -The default get_parent function just returns an error so any -filehandle lookup that requires finding a parent will fail. --lookup(..) is *not* used as a default as it can leave .. -entries in the dcache which are too messy to work with. - - get_dentry. When given an opaque datum, this should find the -implied object and create a dentry for it (possibly with -d_alloc_anon). -The opaque datum is whatever is passed down by the decode_fh -function, and is often simply a fragment of the filehandle -fragment. -decode_fh passes two datums through find_exported_dentry. One that -should be used to identify the target object, and one that can be -used to identify the object's parent, should that be necessary. -The default get_dentry function assumes that the datum contains an -inode number and a generation number, and it attempts to get the -inode using iget and check it's validity by matching the -generation number. A filesystem should only depend on the default -if iget can safely be used this way. - -If decode_fh and/or encode_fh are left as NULL, then default -implementations are used. These defaults are suitable for ext2 and -extremely similar filesystems (like ext3). - -The default encode_fh creates a filehandle fragment from the inode -number and generation
[PATCH 06/19] ext4: new export ops
Trivial switch over to the new generic helpers. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/fs/ext4/super.c === --- linux-2.6.orig/fs/ext4/super.c 2007-09-13 15:10:46.0 +0200 +++ linux-2.6/fs/ext4/super.c 2007-09-13 15:18:21.0 +0200 @@ -613,13 +613,10 @@ static int ext4_show_options(struct seq_ } -static struct dentry *ext4_get_dentry(struct super_block *sb, void *vobjp) +static struct inode *ext4_nfs_get_inode(struct super_block *sb, + u64 ino, u32 generation) { - __u32 *objp = vobjp; - unsigned long ino = objp[0]; - __u32 generation = objp[1]; struct inode *inode; - struct dentry *result; if (ino EXT4_FIRST_INO(sb) ino != EXT4_ROOT_INO) return ERR_PTR(-ESTALE); @@ -642,15 +639,22 @@ static struct dentry *ext4_get_dentry(st iput(inode); return ERR_PTR(-ESTALE); } - /* now to find a dentry. -* If possible, get a well-connected one -*/ - result = d_alloc_anon(inode); - if (!result) { - iput(inode); - return ERR_PTR(-ENOMEM); - } - return result; + + return inode; +} + +static struct dentry *ext4_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_dentry(sb, fid, fh_len, fh_type, + ext4_nfs_get_inode); +} + +static struct dentry *ext4_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_parent(sb, fid, fh_len, fh_type, + ext4_nfs_get_inode); } #ifdef CONFIG_QUOTA @@ -720,8 +724,9 @@ static const struct super_operations ext }; static struct export_operations ext4_export_ops = { + .fh_to_dentry = ext4_fh_to_dentry, + .fh_to_parent = ext4_fh_to_parent, .get_parent = ext4_get_parent, - .get_dentry = ext4_get_dentry, }; enum { -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/19] ext2: new export ops
Trivial switch over to the new generic helpers. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/fs/ext2/super.c === --- linux-2.6.orig/fs/ext2/super.c 2007-09-13 15:10:46.0 +0200 +++ linux-2.6/fs/ext2/super.c 2007-09-13 15:16:25.0 +0200 @@ -248,13 +248,10 @@ static const struct super_operations ext #endif }; -static struct dentry *ext2_get_dentry(struct super_block *sb, void *vobjp) +static struct inode *ext2_nfs_get_inode(struct super_block *sb, + u64 ino, u32 generation) { - __u32 *objp = vobjp; - unsigned long ino = objp[0]; - __u32 generation = objp[1]; struct inode *inode; - struct dentry *result; if (ino EXT2_FIRST_INO(sb) ino != EXT2_ROOT_INO) return ERR_PTR(-ESTALE); @@ -275,15 +272,21 @@ static struct dentry *ext2_get_dentry(st iput(inode); return ERR_PTR(-ESTALE); } - /* now to find a dentry. -* If possible, get a well-connected one -*/ - result = d_alloc_anon(inode); - if (!result) { - iput(inode); - return ERR_PTR(-ENOMEM); - } - return result; + return inode; +} + +static struct dentry *ext2_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_dentry(sb, fid, fh_len, fh_type, + ext2_nfs_get_inode); +} + +static struct dentry *ext2_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_parent(sb, fid, fh_len, fh_type, + ext2_nfs_get_inode); } /* Yes, most of these are left as NULL!! @@ -292,8 +295,9 @@ static struct dentry *ext2_get_dentry(st * Currently only get_parent is required. */ static struct export_operations ext2_export_ops = { + .fh_to_dentry = ext2_fh_to_dentry, + .fh_to_parent = ext2_fh_to_parent, .get_parent = ext2_get_parent, - .get_dentry = ext2_get_dentry, }; static unsigned long get_sb_block(void **data) -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/19] ext3: new export ops
Trivial switch over to the new generic helpers. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/fs/ext3/super.c === --- linux-2.6.orig/fs/ext3/super.c 2007-09-13 15:10:46.0 +0200 +++ linux-2.6/fs/ext3/super.c 2007-09-13 15:16:57.0 +0200 @@ -562,13 +562,10 @@ static int ext3_show_options(struct seq_ } -static struct dentry *ext3_get_dentry(struct super_block *sb, void *vobjp) +static struct inode *ext3_nfs_get_inode(struct super_block *sb, + u64 ino, u32 generation) { - __u32 *objp = vobjp; - unsigned long ino = objp[0]; - __u32 generation = objp[1]; struct inode *inode; - struct dentry *result; if (ino EXT3_FIRST_INO(sb) ino != EXT3_ROOT_INO) return ERR_PTR(-ESTALE); @@ -591,15 +588,22 @@ static struct dentry *ext3_get_dentry(st iput(inode); return ERR_PTR(-ESTALE); } - /* now to find a dentry. -* If possible, get a well-connected one -*/ - result = d_alloc_anon(inode); - if (!result) { - iput(inode); - return ERR_PTR(-ENOMEM); - } - return result; + + return inode; +} + +static struct dentry *ext3_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_dentry(sb, fid, fh_len, fh_type, + ext3_nfs_get_inode); +} + +static struct dentry *ext3_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_parent(sb, fid, fh_len, fh_type, + ext3_nfs_get_inode); } #ifdef CONFIG_QUOTA @@ -669,8 +673,9 @@ static const struct super_operations ext }; static struct export_operations ext3_export_ops = { + .fh_to_dentry = ext3_fh_to_dentry, + .fh_to_parent = ext3_fh_to_parent, .get_parent = ext3_get_parent, - .get_dentry = ext3_get_dentry, }; enum { -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/19] jfs: new export ops
Trivial switch over to the new generic helpers. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/fs/jfs/jfs_inode.h === --- linux-2.6.orig/fs/jfs/jfs_inode.h 2007-09-13 15:10:46.0 +0200 +++ linux-2.6/fs/jfs/jfs_inode.h2007-09-13 15:19:06.0 +0200 @@ -18,6 +18,8 @@ #ifndef_H_JFS_INODE #define _H_JFS_INODE +struct fid; + extern struct inode *ialloc(struct inode *, umode_t); extern int jfs_fsync(struct file *, struct dentry *, int); extern int jfs_ioctl(struct inode *, struct file *, @@ -32,7 +34,10 @@ extern void jfs_truncate_nolock(struct i extern void jfs_free_zero_link(struct inode *); extern struct dentry *jfs_get_parent(struct dentry *dentry); extern void jfs_get_inode_flags(struct jfs_inode_info *); -extern struct dentry *jfs_get_dentry(struct super_block *sb, void *vobjp); +extern struct dentry *jfs_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type); +extern struct dentry *jfs_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type); extern void jfs_set_inode_flags(struct inode *); extern int jfs_get_block(struct inode *, sector_t, struct buffer_head *, int); Index: linux-2.6/fs/jfs/namei.c === --- linux-2.6.orig/fs/jfs/namei.c 2007-09-13 15:10:46.0 +0200 +++ linux-2.6/fs/jfs/namei.c2007-09-13 15:19:42.0 +0200 @@ -20,6 +20,7 @@ #include linux/fs.h #include linux/ctype.h #include linux/quotaops.h +#include linux/exportfs.h #include jfs_incore.h #include jfs_superblock.h #include jfs_inode.h @@ -1477,13 +1478,10 @@ static struct dentry *jfs_lookup(struct return dentry; } -struct dentry *jfs_get_dentry(struct super_block *sb, void *vobjp) +static struct inode *jfs_nfs_get_inode(struct super_block *sb, + u64 ino, u32 generation) { - __u32 *objp = vobjp; - unsigned long ino = objp[0]; - __u32 generation = objp[1]; struct inode *inode; - struct dentry *result; if (ino == 0) return ERR_PTR(-ESTALE); @@ -1493,20 +1491,25 @@ struct dentry *jfs_get_dentry(struct sup if (is_bad_inode(inode) || (generation inode-i_generation != generation)) { - result = ERR_PTR(-ESTALE); - goto out_iput; + iput(inode); + return ERR_PTR(-ESTALE); } - result = d_alloc_anon(inode); - if (!result) { - result = ERR_PTR(-ENOMEM); - goto out_iput; - } - return result; + return inode; +} - out_iput: - iput(inode); - return result; +struct dentry *jfs_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_dentry(sb, fid, fh_len, fh_type, + jfs_nfs_get_inode); +} + +struct dentry *jfs_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_parent(sb, fid, fh_len, fh_type, + jfs_nfs_get_inode); } struct dentry *jfs_get_parent(struct dentry *dentry) Index: linux-2.6/fs/jfs/super.c === --- linux-2.6.orig/fs/jfs/super.c 2007-09-13 15:10:46.0 +0200 +++ linux-2.6/fs/jfs/super.c2007-09-13 15:19:06.0 +0200 @@ -738,7 +738,8 @@ static const struct super_operations jfs }; static struct export_operations jfs_export_operations = { - .get_dentry = jfs_get_dentry, + .fh_to_dentry = jfs_fh_to_dentry, + .fh_to_parent = jfs_fh_to_parent, .get_parent = jfs_get_parent, }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/19] ntfs: new export ops
Trivial switch over to the new generic helpers. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/fs/ntfs/namei.c === --- linux-2.6.orig/fs/ntfs/namei.c 2007-09-13 15:10:45.0 +0200 +++ linux-2.6/fs/ntfs/namei.c 2007-09-13 15:20:12.0 +0200 @@ -450,58 +450,40 @@ try_next: return parent_dent; } -/** - * ntfs_get_dentry - find a dentry for the inode from a file handle sub-fragment - * @sb:super block identifying the mounted ntfs volume - * @fh:the file handle sub-fragment - * - * Find a dentry for the inode given a file handle sub-fragment. This function - * is called from fs/exportfs/expfs.c::find_exported_dentry() which in turn is - * called from the default -decode_fh() which is export_decode_fh() in the - * same file. The code is closely based on the default -get_dentry() helper - * fs/exportfs/expfs.c::get_object(). - * - * The @fh contains two 32-bit unsigned values, the first one is the inode - * number and the second one is the inode generation. - * - * Return the dentry on success or the error code on error (IS_ERR() is true). - */ -static struct dentry *ntfs_get_dentry(struct super_block *sb, void *fh) +static struct inode *ntfs_nfs_get_inode(struct super_block *sb, + u64 ino, u32 generation) { - struct inode *vi; - struct dentry *dent; - unsigned long ino = ((u32 *)fh)[0]; - u32 gen = ((u32 *)fh)[1]; - - ntfs_debug(Entering for inode 0x%lx, generation 0x%x., ino, gen); - vi = ntfs_iget(sb, ino); - if (IS_ERR(vi)) { - ntfs_error(sb, Failed to get inode 0x%lx., ino); - return (struct dentry *)vi; - } - if (unlikely(is_bad_inode(vi) || vi-i_generation != gen)) { - /* We didn't find the right inode. */ - ntfs_error(sb, Inode 0x%lx, bad count: %d %d or version 0x%x - 0x%x., vi-i_ino, vi-i_nlink, - atomic_read(vi-i_count), vi-i_generation, - gen); - iput(vi); - return ERR_PTR(-ESTALE); - } - /* Now find a dentry. If possible, get a well-connected one. */ - dent = d_alloc_anon(vi); - if (unlikely(!dent)) { - iput(vi); - return ERR_PTR(-ENOMEM); + struct inode *inode; + + inode = ntfs_iget(sb, ino); + if (!IS_ERR(inode)) { + if (is_bad_inode(inode) || inode-i_generation != generation) { + iput(inode); + inode = ERR_PTR(-ESTALE); + } } - ntfs_debug(Done for inode 0x%lx, generation 0x%x., ino, gen); - return dent; + + return inode; +} + +static struct dentry *ntfs_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_dentry(sb, fid, fh_len, fh_type, + ntfs_nfs_get_inode); +} + +static struct dentry *ntfs_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_parent(sb, fid, fh_len, fh_type, + ntfs_nfs_get_inode); } /** * Export operations allowing NFS exporting of mounted NTFS partitions. * - * We use the default -decode_fh() and -encode_fh() for now. Note that they + * We use the default -encode_fh() for now. Note that they * use 32 bits to store the inode number which is an unsigned long so on 64-bit * architectures is usually 64 bits so it would all fail horribly on huge * volumes. I guess we need to define our own encode and decode fh functions @@ -520,7 +502,6 @@ static struct dentry *ntfs_get_dentry(st struct export_operations ntfs_export_ops = { .get_parent = ntfs_get_parent, /* Find the parent of a given directory. */ - .get_dentry = ntfs_get_dentry, /* Find a dentry for the inode - given a file handle - sub-fragment. */ + .fh_to_dentry = ntfs_fh_to_dentry, + .fh_to_parent = ntfs_fh_to_parent, }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Thursday 13 September 2007 09:06, Christoph Lameter wrote: On Wed, 12 Sep 2007, Nick Piggin wrote: So lumpy reclaim does not change my formula nor significantly help against a fragmentation attack. AFAIKS. Lumpy reclaim improves the situation significantly because the overwhelming majority of allocation during the lifetime of a systems are movable and thus it is able to opportunistically restore the availability of higher order pages by reclaiming neighboring pages. I'm talking about non movable allocations. [*] ok, this isn't quite true because if you can actually put a hard limit on unmovable allocations then anti-frag will fundamentally help -- get back to me on that when you get patches to move most of the obvious ones. We have this hard limit using ZONE_MOVABLE in 2.6.23. So we're back to 2nd class support. Sure, and I pointed out the theoretical figure for 64K pages as well. Is that figure not problematic to you? Where do you draw the limit for what is acceptable? Why? What happens with tiny memory machines where a reserve or even the anti-frag patches may not be acceptable and/or work very well? When do you require reserve pools? Why are reserve pools acceptable for first-class support of filesystems when it has been very loudly been made a known policy decision by Linus in the past (and for some valid reasons) that we should not put limits on the sizes of caches in the kernel. 64K pages may problematic because it is above the PAGE_ORDER_COSTLY in 2.6.23. 32K is currently much safer because lumpy reclaim can restore these and does so on my systems. I expect the situation for 64K pages to improve when more of Mel's patches go in. We have long term experience with 32k sized allocation through Andrew's tree. Reserve pools as handled (by the not yet available) large page pool patches (which again has altogether another purpose) are not a limit. The reserve pools are used to provide a mininum of higher order pages that is not broken down in order to insure that a mininum number of the desired order of pages is even available in your worst case scenario. Mainly I think that is needed during the period when memory defragmentation is still under development. fsblock doesn't need any of those hacks, of course. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Thursday 13 September 2007 12:01, Nick Piggin wrote: On Thursday 13 September 2007 23:03, David Chinner wrote: Then just do operations on directories with lots of files in them (tens of thousands). Every directory operation will require at least one vmap in this situation - e.g. a traversal will result in lots and lots of blocks being read that will require vmap() for every directory block read from disk and an unmap almost immediately afterwards when the reference is dropped Ah, wow, thanks: I can reproduce it. OK, the vunmap batching code wipes your TLB flushing and IPIs off the table. Diffstat below, but the TLB portions are here (besides that _everything_ is probably lower due to less TLB misses caused by the TLB flushing): -170 -99.4% sn2_send_IPI -343 -100.0% sn_send_IPI_phys -17911 -99.9% smp_call_function Total performance went up by 30% on a 64-way system (248 seconds to 172 seconds to run parallel finds over different huge directories). 23012 54790.5% _read_lock 9427 329.0% __get_vm_area_node 5792 0.0% __find_vm_area 1590 53000.0% __vunmap 10726.0% _spin_lock 74 119.4% _xfs_buf_find 58 0.0% __unmap_kernel_range 5336.6% kmem_zone_alloc -129 -100.0% pio_phys_write_mmr -144 -100.0% unmap_kernel_range -170 -99.4% sn2_send_IPI -233 -59.1% kfree -266 -100.0% find_next_bit -343 -100.0% sn_send_IPI_phys -564 -19.9% xfs_iget_core -1946 -100.0% remove_vm_area -17911 -99.9% smp_call_function -62726-7.2% _write_lock -438360 -64.2% default_idle -482631 -30.4% total Next I have some patches to scale the vmap locks and data structures better, but they're not quite ready yet. This looks like it should result in a further speedup of several times when combined with the TLB flushing reductions here... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change
On Fri, Sep 14, 2007 at 07:02:58AM -0400, Jeff Layton wrote: On Fri, 14 Sep 2007 20:25:45 +1000 Greg Banks [EMAIL PROTECTED] wrote: I'm curious about the reasons behind this change. You mention credential issues; how exactly is it that you have the correct creds to perform a WRITE rpc but not a SETATTR rpc? Consider this case. user1 and user2 are both members of group allusers: user1$ echo foo foo user1$ chgrp allusers foo user1$ chmod 04770 foo user2$ echo bar foo On most local filesystems, this would work correctly. The end result would be a file with mode 0770 and the expected contents. On NFS though, the write by user2 fails. When the write is attempted, the kernel tries to squash the setuid bit using the credentials of user2, who's not allowed to change the mode. The write then fails because the setattr fails. Ok, I ran an experiment and I see this failure mode. So the SETATTR rpc is really a side effect of the client kernel's behaviour and not an operation directly requested by the user process on the client. Is there any reason why that rpc needs to have user2's creds? Why not do the rpc with a fake set of creds with uid and gid set to the uid and gid of the file, in this case user1/allusers ? That way the rpc will most likely pass the server's permission check. Greg. -- Greg Banks, RD Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Thursday 13 September 2007 09:17, Christoph Lameter wrote: On Wed, 12 Sep 2007, Nick Piggin wrote: I will still argue that my approach is the better technical solution for large block support than yours, I don't think we made progress on that. And I'm quite sure we agreed at the VM summit not to rely on your patches for VM or IO scalability. The approach has already been tried (see the XFS layer) and found lacking. It is lacking because our vmap algorithms are simplistic to the point of being utterly inadequate for the new requirements. There has not been any fundamental problem revealed (like the fragmentation approach has). However fsblock can do everything that higher order pagecache can do in terms of avoiding vmap and giving contiguous memory to block devices by opportunistically allocating higher orders of pages, and falling back to vmap if they cannot be satisfied. So if you argue that vmap is a downside, then please tell me how you consider the -ENOMEM of your approach to be better? Having a fake linear block through vmalloc means that a special software layer must be introduced and we may face special casing in the block / fs layer to check if we have one of these strange vmalloc blocks. I guess you're a little confused. There is nothing fake about the linear address. Filesystems of course need changes (the block layer needs none -- why would it?). But changing APIs to accommodate a better solution is what Linux is about. If, by special software layer, you mean the vmap/vunmap support in fsblock, let's see... that's probably all of a hundred or two lines. Contrast that with anti-fragmentation, lumpy reclaim, higher order pagecache and its new special mmap layer... Hmm, seems like a no brainer to me. You really still want to persue the extra layer argument as a point against fsblock here? But you just showed in two emails that you don't understand what the problem is. To reiterate: lumpy reclaim does *not* invalidate my formulae; and antifrag does *not* isolate the issue. I do understand what the problem is. I just do not get what your problem with this is and why you have this drive to demand perfection. We are Oh. I don't think I could explain that if you still don't understand by now. But that's not the main issue: all that I ask is you consider fsblock on technical grounds. working a variety of approaches on the (potential) issue but you categorically state that it cannot be solved. Of course I wouldn't state that. On the contrary, I categorically state that I have already solved it :) But what do you say about viable alternatives that do not have to worry about these unlikely scenarios, full stop? So, why should we not use fs block for higher order page support? Because it has already been rejected in another form and adds more You have rejected it. But they are bogus reasons, as I showed above. You also describe some other real (if lesser) issues like number of page structs to manage in the pagecache. But this is hardly enough to reject my patch now... for every downside you can point out in my approach, I can point out one in yours. - fsblock doesn't require changes to virtual memory layer - fsblock can retain cache of just 4K in a 4K block size file How about those? I know very well how Linus feels about both of them. Maybe we coud get to something like a hybrid that avoids some of these issues? Add support so something like a virtual compound page can be handled transparently in the filesystem layer with special casing if such a beast reaches the block layer? That's conceptually much worse, IMO. And practically worse as well: vmap space is limited on 32-bit; fsblock approach can avoid vmap completely in many cases; for two reasons. The fsblock data accessor APIs aren't _that_ bad changes. They change zero conceptually in the filesystem, are arguably cleaner, and can be essentially nooped if we wanted to stay with a b_data type approach (but they give you that flexibility to replace it with any implementation). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change
On Fri, 14 Sep 2007 23:09:24 +1000 Greg Banks [EMAIL PROTECTED] wrote: On Fri, Sep 14, 2007 at 07:02:58AM -0400, Jeff Layton wrote: On Fri, 14 Sep 2007 20:25:45 +1000 Greg Banks [EMAIL PROTECTED] wrote: I'm curious about the reasons behind this change. You mention credential issues; how exactly is it that you have the correct creds to perform a WRITE rpc but not a SETATTR rpc? Consider this case. user1 and user2 are both members of group allusers: user1$ echo foo foo user1$ chgrp allusers foo user1$ chmod 04770 foo user2$ echo bar foo On most local filesystems, this would work correctly. The end result would be a file with mode 0770 and the expected contents. On NFS though, the write by user2 fails. When the write is attempted, the kernel tries to squash the setuid bit using the credentials of user2, who's not allowed to change the mode. The write then fails because the setattr fails. Ok, I ran an experiment and I see this failure mode. So the SETATTR rpc is really a side effect of the client kernel's behaviour and not an operation directly requested by the user process on the client. Is there any reason why that rpc needs to have user2's creds? Why not do the rpc with a fake set of creds with uid and gid set to the uid and gid of the file, in this case user1/allusers ? That way the rpc will most likely pass the server's permission check. That might work in some cases, but there are many where it wouldn't... Suppose user1 here is root and all of the user1 operations are being done on the server. If the server has root squashing enabled, then user2's operation would still fail. Another problem: Suppose we're using gssapi. There's no guarantee that the client will have the proper credentials to fake up a call as user1 (you might need user1 krb5 tickets, etc). -- Jeff Layton [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change
On Fri, Sep 14, 2007 at 09:38:46AM -0400, Jeff Layton wrote: On Fri, 14 Sep 2007 23:09:24 +1000 Greg Banks [EMAIL PROTECTED] wrote: On Fri, Sep 14, 2007 at 07:02:58AM -0400, Jeff Layton wrote: On Fri, 14 Sep 2007 20:25:45 +1000 Greg Banks [EMAIL PROTECTED] wrote: I'm curious about the reasons behind this change. You mention credential issues; how exactly is it that you have the correct creds to perform a WRITE rpc but not a SETATTR rpc? Consider this case. user1 and user2 are both members of group allusers: user1$ echo foo foo user1$ chgrp allusers foo user1$ chmod 04770 foo user2$ echo bar foo On most local filesystems, this would work correctly. The end result would be a file with mode 0770 and the expected contents. On NFS though, the write by user2 fails. When the write is attempted, the kernel tries to squash the setuid bit using the credentials of user2, who's not allowed to change the mode. The write then fails because the setattr fails. Ok, I ran an experiment and I see this failure mode. So the SETATTR rpc is really a side effect of the client kernel's behaviour and not an operation directly requested by the user process on the client. Is there any reason why that rpc needs to have user2's creds? Why not do the rpc with a fake set of creds with uid and gid set to the uid and gid of the file, in this case user1/allusers ? That way the rpc will most likely pass the server's permission check. That might work in some cases, but there are many where it wouldn't... Suppose user1 here is root and all of the user1 operations are being done on the server. If the server has root squashing enabled, then user2's operation would still fail. In that case, user1's operations would also fail, which is even more serious a problem. Also arguably you actually *want* writes by a nonroot user to a setuid root executable to fail ;-) Another problem: Suppose we're using gssapi. There's no guarantee that the client will have the proper credentials to fake up a call as user1 (you might need user1 krb5 tickets, etc). Yes, good point. You could use the root creds, except for root squashing. Ok, you convinced me. Greg. -- Greg Banks, RD Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATH 04/19] ext2: new export ops
On Thu, Aug 30, 2007 at 03:16:09PM +0200, Christoph Hellwig wrote: + +static struct dentry *ext2_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_dentry(sb, fid, fh_len, fh_type, ext2_nfs_get_inode); +} + +static struct dentry *ext2_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_parent(sb, fid, fh_len, fh_type, ext2_nfs_get_inode); } These patches look good, and cleanup in this area is certainly a good thing. One small comment: the easy filesystems (ext[234], efs, ntfs) might be cleaner if the per-fs get_inode function were a member of export_ops instead of an extra argument to generic_fh_to_dentry(). That way you wouldn't need these two little helper functions in each filesystem, because you could point export_ops.fh_to_dentry directly at generic_fh_to_dentry. Greg. -- Greg Banks, RD Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change
On Sat, 15 Sep 2007 00:40:33 +1000 Greg Banks [EMAIL PROTECTED] wrote: On Fri, Sep 14, 2007 at 09:38:46AM -0400, Jeff Layton wrote: On Fri, 14 Sep 2007 23:09:24 +1000 Greg Banks [EMAIL PROTECTED] wrote: On Fri, Sep 14, 2007 at 07:02:58AM -0400, Jeff Layton wrote: On Fri, 14 Sep 2007 20:25:45 +1000 Greg Banks [EMAIL PROTECTED] wrote: I'm curious about the reasons behind this change. You mention credential issues; how exactly is it that you have the correct creds to perform a WRITE rpc but not a SETATTR rpc? Consider this case. user1 and user2 are both members of group allusers: user1$ echo foo foo user1$ chgrp allusers foo user1$ chmod 04770 foo user2$ echo bar foo On most local filesystems, this would work correctly. The end result would be a file with mode 0770 and the expected contents. On NFS though, the write by user2 fails. When the write is attempted, the kernel tries to squash the setuid bit using the credentials of user2, who's not allowed to change the mode. The write then fails because the setattr fails. Ok, I ran an experiment and I see this failure mode. So the SETATTR rpc is really a side effect of the client kernel's behaviour and not an operation directly requested by the user process on the client. Is there any reason why that rpc needs to have user2's creds? Why not do the rpc with a fake set of creds with uid and gid set to the uid and gid of the file, in this case user1/allusers ? That way the rpc will most likely pass the server's permission check. That might work in some cases, but there are many where it wouldn't... Suppose user1 here is root and all of the user1 operations are being done on the server. If the server has root squashing enabled, then user2's operation would still fail. In that case, user1's operations would also fail, which is even more serious a problem. Also arguably you actually *want* writes by a nonroot user to a setuid root executable to fail ;-) Well, user1's operations would fail if done from the client, which is why I mentioned that they would have to be done on the server. The second point is a good one, but POSIX says that it should be allowed if the permissions allow for it. The whole situation is somewhat contrived anyway, I can't think of a place where this is something you'd really want to do, but I think we need to try to follow the spec as best as possible... Another problem: Suppose we're using gssapi. There's no guarantee that the client will have the proper credentials to fake up a call as user1 (you might need user1 krb5 tickets, etc). Yes, good point. You could use the root creds, except for root squashing. Ok, you convinced me. Right. When I was first looking at this, I considered some similar approaches, but hit roadblocks with all of them. The only real option seems to be to leave this to the server, but that does assume that the server handles this properly. Servers that don't are broken, IMO. If Irix isn't clearing these bits on a write then it might be good to see if they can fix that... -- Jeff Layton [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATH 04/19] ext2: new export ops
On Sat, Sep 15, 2007 at 12:58:03AM +1000, Greg Banks wrote: These patches look good, and cleanup in this area is certainly a good thing. One small comment: the easy filesystems (ext[234], efs, ntfs) might be cleaner if the per-fs get_inode function were a member of export_ops instead of an extra argument to generic_fh_to_dentry(). That way you wouldn't need these two little helper functions in each filesystem, because you could point export_ops.fh_to_dentry directly at generic_fh_to_dentry. I was pondering this, but in the end I prefer the cleaner layering of the callback version. This also mirrors what we do in other areas (e.g. address_space operations) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/19] gfs2: new export ops
Hi, On Fri, 2007-09-14 at 13:49 +0200, [EMAIL PROTECTED] wrote: plain text document attachment (gfs2-implement-fh_to_dentry) Convert gfs2 to the new ops. Uses a similar structure to the generic helpers, but gfs2 has it's own file handle formats. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] This looks good from a GFS2 point of view: Acked-by: Steven Whitehouse [EMAIL PROTECTED] Acked-by: Wendy Cheng [EMAIL PROTECTED] Steve. Index: linux-2.6/fs/gfs2/ops_export.c === --- linux-2.6.orig/fs/gfs2/ops_export.c 2007-07-19 15:56:46.0 +0200 +++ linux-2.6/fs/gfs2/ops_export.c2007-07-20 19:58:06.0 +0200 @@ -31,40 +31,6 @@ #define GFS2_LARGE_FH_SIZE 8 #define GFS2_OLD_FH_SIZE 10 -static struct dentry *gfs2_decode_fh(struct super_block *sb, - __u32 *p, - int fh_len, - int fh_type, - int (*acceptable)(void *context, -struct dentry *dentry), - void *context) -{ - __be32 *fh = (__force __be32 *)p; - struct gfs2_inum_host inum, parent; - - memset(parent, 0, sizeof(struct gfs2_inum)); - - switch (fh_len) { - case GFS2_LARGE_FH_SIZE: - case GFS2_OLD_FH_SIZE: - parent.no_formal_ino = ((u64)be32_to_cpu(fh[4])) 32; - parent.no_formal_ino |= be32_to_cpu(fh[5]); - parent.no_addr = ((u64)be32_to_cpu(fh[6])) 32; - parent.no_addr |= be32_to_cpu(fh[7]); - case GFS2_SMALL_FH_SIZE: - inum.no_formal_ino = ((u64)be32_to_cpu(fh[0])) 32; - inum.no_formal_ino |= be32_to_cpu(fh[1]); - inum.no_addr = ((u64)be32_to_cpu(fh[2])) 32; - inum.no_addr |= be32_to_cpu(fh[3]); - break; - default: - return NULL; - } - - return gfs2_export_ops.find_exported_dentry(sb, inum, parent, - acceptable, context); -} - static int gfs2_encode_fh(struct dentry *dentry, __u32 *p, int *len, int connectable) { @@ -189,10 +155,10 @@ static struct dentry *gfs2_get_parent(st return dentry; } -static struct dentry *gfs2_get_dentry(struct super_block *sb, void *inum_obj) +static struct dentry *gfs2_get_dentry(struct super_block *sb, + struct gfs2_inum_host *inum) { struct gfs2_sbd *sdp = sb-s_fs_info; - struct gfs2_inum_host *inum = inum_obj; struct gfs2_holder i_gh, ri_gh, rgd_gh; struct gfs2_rgrpd *rgd; struct inode *inode; @@ -289,11 +255,50 @@ fail: return ERR_PTR(error); } +static struct dentry *gfs2_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + struct gfs2_inum_host this; + __be32 *fh = (__force __be32 *)fid-raw; + + switch (fh_type) { + case GFS2_SMALL_FH_SIZE: + case GFS2_LARGE_FH_SIZE: + case GFS2_OLD_FH_SIZE: + this.no_formal_ino = ((u64)be32_to_cpu(fh[0])) 32; + this.no_formal_ino |= be32_to_cpu(fh[1]); + this.no_addr = ((u64)be32_to_cpu(fh[2])) 32; + this.no_addr |= be32_to_cpu(fh[3]); + return gfs2_get_dentry(sb, this); + default: + return NULL; + } +} + +static struct dentry *gfs2_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + struct gfs2_inum_host parent; + __be32 *fh = (__force __be32 *)fid-raw; + + switch (fh_type) { + case GFS2_LARGE_FH_SIZE: + case GFS2_OLD_FH_SIZE: + parent.no_formal_ino = ((u64)be32_to_cpu(fh[4])) 32; + parent.no_formal_ino |= be32_to_cpu(fh[5]); + parent.no_addr = ((u64)be32_to_cpu(fh[6])) 32; + parent.no_addr |= be32_to_cpu(fh[7]); + return gfs2_get_dentry(sb, parent); + default: + return NULL; + } +} + struct export_operations gfs2_export_ops = { - .decode_fh = gfs2_decode_fh, .encode_fh = gfs2_encode_fh, + .fh_to_dentry = gfs2_fh_to_dentry, + .fh_to_parent = gfs2_fh_to_parent, .get_name = gfs2_get_name, .get_parent = gfs2_get_parent, - .get_dentry = gfs2_get_dentry, }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATH 18/19] exportfs: update documentation
On Thu, Aug 30, 2007 at 03:17:24PM +0200, Christoph Hellwig wrote: + encode_fh (optinonal) ^ +Takes a dentry and creates a filehandle fragment which can later be used +to find/create a dentry for the same object. The default implementation +creates a filehandle fragment that encodes a 32bit inode and generation +number for the inode encoded, and if nessecary the same information for ^ +the parent. + + fh_to_dentry (mandatory) +Given a filehandle fragment, this should find the implied object and +create a dentry for it (possibly with d_alloc_anon). + + fh_to_parent (optional but strongly recommended) +Given a filehandle fragment, this should find the parent of the +implied object and create a dentry for it (possibly with d_alloc_anon). +May simplify fail if the filehandle fragment is too small. Greg. -- Greg Banks, RD Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change
On Fri, Sep 14, 2007 at 10:58:38AM -0400, Jeff Layton wrote: On Sat, 15 Sep 2007 00:40:33 +1000 Greg Banks [EMAIL PROTECTED] wrote: Ok, you convinced me. Right. When I was first looking at this, I considered some similar approaches, but hit roadblocks with all of them. The only real option seems to be to leave this to the server, but that does assume that the server handles this properly. Servers that don't are broken, IMO. According to what spec? A quick trip around the machine room shows that neither Solaris 10 nor Darwin 7.9.0 clobber setuid on write either. If Irix isn't clearing these bits on a write then it might be good to see if they can fix that... I think first you'd have to mount a serious argument that it's broken, more serious than it works differently from Linux. Greg. -- Greg Banks, RD Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATH 10/19] xfs: new export ops
On Sat, Sep 15, 2007 at 01:22:16AM +1000, Greg Banks wrote: Not really a comment on your patches, but I got the original logic wrong here. The VFS_32BITINODES flag only affects newly allocated inodes and is no guarantee that any particular inode is 2^32-1. It's possible for an unlucky user to perform a sequence of mounts and IO which results in large inode numbers despite the presence of that flag; we recently saw this happen by accident on a customer site. So the right thing to do is probably to check the inode number against (u32)~0. Unfortunately, given the current encoding scheme, you have to check both the inode and the parent inode, which complicates the logic. I'll see if we can do anything later on. But for now I'll leave it as-is becaue this file will be merge hell anyway when both vfs removal and exporting changes hit the tree.. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Hi, Nick Piggin [EMAIL PROTECTED] writes: In my attack, I cause the kernel to allocate lots of unmovable allocations and deplete movable groups. I theoretically then only need to keep a small number (1/2^N) of these allocations around in order to DoS a page allocation of order N. I'm assuming that when an unmovable allocation hijacks a movable group any further unmovable alloc will evict movable objects out of that group before hijacking another one. right? And it doesn't even have to be a DoS. The natural fragmentation that occurs today in a kernel today has the possibility to slowly push out the movable groups and give you the same situation. How would you cause that? Say you do want to purposefully place one unmovable 4k page into every 64k compund page. So you allocate 4K. First 64k page locked. But now, to get 4K into the second 64K page you have to first use up all the rest of the first 64k page. Meaning one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then will a new 64k chunk be broken and become locked. So to get the last 64k chunk used all previous 32k chunks need to be blocked and you need to allocate 32k (or less if more is blocked). For all previous 32k chunks to be blocked every second 16k needs to be blocked. To block the last of those 16k chunks all previous 8k chunks need to be blocked and you need to allocate 8k. For all previous 8k chunks to be blocked every second 4k page needs to be used. To alloc the last of those 4k pages all previous 4k pages need to be used. So to construct a situation where no continious 64k chunk is free you have to allocate total mem - 64k - 32k - 16k - 8k - 4k (or there about) of memory first. Only then could you free memory again while still keeping every 64k page blocked. Does that occur naturally given enough ram to start with? Too see how bad fragmentation could be I wrote a little progamm to simulate allocations with the following simplified alogrithm: Memory management: - Free pages are kept in buckets, one per order, and sorted by address. - alloc() the front page (smallest address) out of the bucket of the right order or recursively splits the next higher bucket. - free() recursively tries to merge a page with its neighbour and puts the result back into the proper bucket (sorted by address). Allocation and lifetime: - Every tick a new page is allocated with random order. - The order is a triangle distribution with max at 0 (throw 2 dice, add the eyes, subtract 7, abs() the number). - The page is scheduled to be freed after X ticks. Where X is nearly a gaus curve centered at 0 and maximum at total num pages * 1.5. (What I actualy do is throw 8 dice and sum them up and shift the result.) Display: I start with a white window. Every page allocation draws a black box from the address of the page and as wide as the page is big (-1 pixel to give a seperation to the next page). Every page free draws a yellow box in place of the black one. Yellow to show where a page was in use at one point while white means the page was never used. As the time ticks the memory fills up. Quickly at first and then comes to a stop around 80% filled. And then something interesting happens. The yellow regions (previously used but now free) start drifting up. Small pages tend to end up in the lower addresses and big pages at the higher addresses. The memory defragments itself to some degree. http://mrvn.homeip.net/fragment/ Simulating 256MB ram and after 1472943 ticks and 530095 4k, 411841 8k, 295296 16k, 176647 32k and 59064 64k allocations you get this: http://mrvn.homeip.net/fragment/256mb.png Simulating 1GB ram and after 5881185 ticks and 2116671 4k, 1645957 8k, 1176994 16k, 705873 32k and 235690 64k allocations you get this: http://mrvn.homeip.net/fragment/1gb.png MfG Goswin - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 2/7] NFS: if ATTR_KILL_S*ID bits are set, then skip mode change
On Sat, 15 Sep 2007 01:43:45 +1000 Greg Banks [EMAIL PROTECTED] wrote: On Fri, Sep 14, 2007 at 10:58:38AM -0400, Jeff Layton wrote: On Sat, 15 Sep 2007 00:40:33 +1000 Greg Banks [EMAIL PROTECTED] wrote: Ok, you convinced me. Right. When I was first looking at this, I considered some similar approaches, but hit roadblocks with all of them. The only real option seems to be to leave this to the server, but that does assume that the server handles this properly. Servers that don't are broken, IMO. According to what spec? A quick trip around the machine room shows that neither Solaris 10 nor Darwin 7.9.0 clobber setuid on write either. Hmm, last time I checked Solaris, I thought it did, but that was Solaris 11. I'll plan to fire up my solaris qemu image and test it again... If Irix isn't clearing these bits on a write then it might be good to see if they can fix that... I think first you'd have to mount a serious argument that it's broken, more serious than it works differently from Linux. Good point. POSIX is frustratingly ambiguous on this: Upon successful completion, where nbyte is greater than 0, write() shall mark for update the st_ctime and st_mtime fields of the file, and if the file is a regular file, the S_ISUID and S_ISGID bits of the file mode may be cleared. ...the may in that last sentence makes it optional, I suppose. Even if it weren't then I guess there's also an argument that a write that comes in via a nfs server may not be subject to the same semantics as the write() syscall. In any case, broken is probably too strong a term :-) -- Jeff Layton [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] exportfs: fix doc types
And here's a patch to fix the typos Greg found: Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/Documentation/filesystems/Exporting === --- linux-2.6.orig/Documentation/filesystems/Exporting 2007-09-14 17:59:35.0 +0200 +++ linux-2.6/Documentation/filesystems/Exporting 2007-09-14 18:01:41.0 +0200 @@ -104,7 +104,7 @@ are exportable by setting the s_export_o super_block. This field must point to a struct export_operations struct which has the following members: - encode_fh (optinonal) + encode_fh (optional) Takes a dentry and creates a filehandle fragment which can later be used to find/create a dentry for the same object. The default implementation creates a filehandle fragment that encodes a 32bit inode and generation @@ -118,7 +118,7 @@ struct which has the following members: fh_to_parent (optional but strongly recommended) Given a filehandle fragment, this should find the parent of the implied object and create a dentry for it (possibly with d_alloc_anon). -May simplify fail if the filehandle fragment is too small. +May fail if the filehandle fragment is too small. get_parent (optional but strongly recommended) When given a dentry for a directory, this should return a dentry for - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH] 9p: add readahead support for loose mode
This patch adds readpages support in support of readahead when using loose cache mode. It substantially increases performance for certain workloads. Signed-off-by: Eric Van Hensbergen [EMAIL PROTECTED] --- fs/9p/v9fs.c|2 +- fs/9p/vfs_addr.c| 98 ++ include/net/9p/client.h |3 +- net/9p/client.c | 82 +-- 4 files changed, 143 insertions(+), 42 deletions(-) diff --git a/fs/9p/v9fs.c b/fs/9p/v9fs.c index 89ee0ba..ca97404 100644 --- a/fs/9p/v9fs.c +++ b/fs/9p/v9fs.c @@ -131,7 +131,7 @@ static void v9fs_parse_options(struct v9fs_session_info *v9ses) char *s, *e; /* setup defaults */ - v9ses-maxdata = 8192; + v9ses-maxdata = (64*1024); v9ses-afid = ~0; v9ses-debug = 0; v9ses-cache = 0; diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c index 6248f0e..86c6e0d 100644 --- a/fs/9p/vfs_addr.c +++ b/fs/9p/vfs_addr.c @@ -31,8 +31,11 @@ #include linux/string.h #include linux/inet.h #include linux/pagemap.h +#include linux/pagevec.h #include linux/idr.h #include linux/sched.h +#include linux/uio.h +#include linux/task_io_accounting_ops.h #include net/9p/9p.h #include net/9p/client.h @@ -50,31 +53,108 @@ static int v9fs_vfs_readpage(struct file *filp, struct page *page) { - int retval; loff_t offset; char *buffer; struct p9_fid *fid; + int retval = 0; + int total = 0; + int count = PAGE_SIZE; P9_DPRINTK(P9_DEBUG_VFS, \n); fid = filp-private_data; buffer = kmap(page); offset = page_offset(page); - retval = p9_client_readn(fid, buffer, offset, PAGE_CACHE_SIZE); - if (retval 0) - goto done; + while (count) { + struct kvec kv = {buffer+offset, PAGE_SIZE-count}; + retval = p9_client_readv(fid, kv, offset, 1); + if (retval = 0) + break; - memset(buffer + retval, 0, PAGE_CACHE_SIZE - retval); - flush_dcache_page(page); - SetPageUptodate(page); - retval = 0; + buffer += retval; + offset += retval; + count -= retval; + total += retval; + } + + if (retval = 0) { + flush_dcache_page(page); + SetPageUptodate(page); + retval = 0; + } -done: kunmap(page); unlock_page(page); return retval; } +/* large chunks copied and adapted from fs/cifs/file.c */ +static int v9fs_vfs_readpages(struct file *file, struct address_space *mapping, + struct list_head *page_list, unsigned num_pages) +{ + struct page *tmp_page; + loff_t offset; + struct pagevec lru_pvec; + struct p9_fid *fid; + u32 read_size; + int retval = 0; + unsigned int count = 0; + struct list_head *p, *n; + + struct kvec *kv = kmalloc(sizeof(struct kvec)*num_pages, GFP_KERNEL); + + P9_DPRINTK(P9_DEBUG_VFS, %d pages\n, num_pages); + + if (!kv) + return -ENOMEM; + + if (list_empty(page_list)) + goto free_vec; + + pagevec_init(lru_pvec, 0); + + fid = file-private_data; + tmp_page = list_entry(page_list-prev, struct page, lru); + offset = (loff_t)tmp_page-index PAGE_CACHE_SHIFT; + + list_for_each_entry_reverse(tmp_page, page_list, lru) { + BUG_ON(count num_pages); + if (add_to_page_cache(tmp_page, mapping, + tmp_page-index, GFP_KERNEL)) { + page_cache_release(tmp_page); + continue; + } + + kv[count].iov_base = kmap(tmp_page); + kv[count].iov_len = PAGE_CACHE_SIZE; + count++; + } + + read_size = count * PAGE_CACHE_SIZE; + if (!read_size) + goto cleanup; + + retval = p9_client_readv(fid, kv, offset, count); + +cleanup: + list_for_each_safe(p, n, page_list) { + tmp_page = list_entry(p, struct page, lru); + list_del(tmp_page-lru); + if (!pagevec_add(lru_pvec, tmp_page)) + __pagevec_lru_add(lru_pvec); + kunmap(tmp_page); + flush_dcache_page(tmp_page); + SetPageUptodate(tmp_page); + unlock_page(tmp_page); + } + pagevec_lru_add(lru_pvec); + +free_vec: + kfree(kv); + return retval; +} + const struct address_space_operations v9fs_addr_operations = { .readpage = v9fs_vfs_readpage, + .readpages = v9fs_vfs_readpages, }; diff --git a/include/net/9p/client.h b/include/net/9p/client.h index 9b9221a..6f17d0a 100644 --- a/include/net/9p/client.h +++ b/include/net/9p/client.h @@ -67,8 +67,7 @@ int p9_client_fcreate(struct p9_fid *fid, char *name, u32 perm, int mode,
Re: [NFS] [PATCH] exportfs: fix doc types
On Fri, Sep 14, 2007 at 06:03:01PM +0200, Christoph Hellwig wrote: And here's a patch to fix the typos Greg found: Thanks. I already had a couple of those in a separate patch, so I've folded this in. --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote: Nick Piggin [EMAIL PROTECTED] writes: In my attack, I cause the kernel to allocate lots of unmovable allocations and deplete movable groups. I theoretically then only need to keep a small number (1/2^N) of these allocations around in order to DoS a page allocation of order N. I'm assuming that when an unmovable allocation hijacks a movable group any further unmovable alloc will evict movable objects out of that group before hijacking another one. right? No eviction takes place. If an unmovable allocation gets placed in a movable group, then steps are taken to ensure that future unmovable allocations will take place in the same range (these decisions take place in __rmqueue_fallback()). When choosing a movable block to pollute, it will also choose the lowest possible block in PFN terms to steal so that fragmentation pollution will be as confined as possible. Evicting the unmovable pages would be one of those expensive steps that have been avoided to date. And it doesn't even have to be a DoS. The natural fragmentation that occurs today in a kernel today has the possibility to slowly push out the movable groups and give you the same situation. How would you cause that? Say you do want to purposefully place one unmovable 4k page into every 64k compund page. So you allocate 4K. First 64k page locked. But now, to get 4K into the second 64K page you have to first use up all the rest of the first 64k page. Meaning one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then will a new 64k chunk be broken and become locked. It would be easier early in the boot to mmap a large area and fault it in in virtual address order then mlock every a page every 64K. Early in the systems lifetime, there will be a rough correlation between physical and virtual memory. Without mlock(), the most successful attack will like mmap() a 60K region and fault it in as an attempt to get pagetable pages placed in every 64K region. This strategy would not work with grouping pages by mobility though as it would group the pagetable pages together. Targetted attacks on grouping pages by mobility are not very easy and not that interesting either. As Nick suggests, the natural fragmentation over long periods of time is what is interesting. So to get the last 64k chunk used all previous 32k chunks need to be blocked and you need to allocate 32k (or less if more is blocked). For all previous 32k chunks to be blocked every second 16k needs to be blocked. To block the last of those 16k chunks all previous 8k chunks need to be blocked and you need to allocate 8k. For all previous 8k chunks to be blocked every second 4k page needs to be used. To alloc the last of those 4k pages all previous 4k pages need to be used. So to construct a situation where no continious 64k chunk is free you have to allocate total mem - 64k - 32k - 16k - 8k - 4k (or there about) of memory first. Only then could you free memory again while still keeping every 64k page blocked. Does that occur naturally given enough ram to start with? I believe it's very difficult to craft an attack that will work in a short period of time. An attack that worked on 2.6.22 as well may have no success on 2.6.23-rc4-mm1 for example as grouping pages by mobility does it make it exceedingly hard to craft an attack unless the attacker can mlock large amounts of memory. Too see how bad fragmentation could be I wrote a little progamm to simulate allocations with the following simplified alogrithm: Memory management: - Free pages are kept in buckets, one per order, and sorted by address. - alloc() the front page (smallest address) out of the bucket of the right order or recursively splits the next higher bucket. - free() recursively tries to merge a page with its neighbour and puts the result back into the proper bucket (sorted by address). Allocation and lifetime: - Every tick a new page is allocated with random order. This step in itself is not representative of what happens in the kernel. The vast vast majority of allocations are order-0. It's a fun analysis but I'm not sure can we draw any conclusions from it. Statistical analysis of the buddy algorithm have implied that it doesn't suffer that badly from external fragmentation but we know in practice that things are different. A model is hard because minimally the lifetime of pages varies widely. - The order is a triangle distribution with max at 0 (throw 2 dice, add the eyes, subtract 7, abs() the number). - The page is scheduled to be freed after X ticks. Where X is nearly a gaus curve centered at 0 and maximum at total num pages * 1.5. (What I actualy do is throw 8 dice and sum them up and shift the result.) I doubt this is how the kernel behaves either. Display: I start with a white window. Every page allocation draws a black box from the address of the page and as wide as
Re: [NFS] [PATCH 00/19] export operations rewrite
On Fri, Sep 14, 2007 at 01:47:46PM +0200, [EMAIL PROTECTED] wrote: This patchset is a medium scale rewrite of the export operations interface. The goal is to make the interface less complex, and easier to understand from the filesystem side, aswell as preparing generic support for exporting of 64bit inode numbers. This touches all nfs exporting filesystems, and I've done testing on all of the filesystems I have here locally (xfs, ext2, ext3, reiserfs, jfs) Compared to the last version I've fixed the white space issues that checkpatch.pl complained about. OK, thanks again. Everything I have now is in git://linux-nfs.org/~bfields/linux.git for-mm I'm hoping Neil can take a quick look as well (and make a response to the comment on patch #1 along the way). Note that this patch series is against mainline. There will be some xfs changes landing in -mm soon that revamp lots of the code touched here. They should hopefully include the first path in the series so it can be simply dropped, but the xfs conversion will need some smaller updates. I will send this update as soon as the xfs tree updates get pulled into -mm. OK. Let me know if you need me to do anything when that happens. --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATH 10/19] xfs: new export ops
On Fri, Sep 14, 2007 at 06:03:49PM +0200, Christoph Hellwig wrote: On Sat, Sep 15, 2007 at 01:22:16AM +1000, Greg Banks wrote: Not really a comment on your patches, but I got the original logic wrong here. The VFS_32BITINODES flag only affects newly allocated inodes and is no guarantee that any particular inode is 2^32-1. It's possible for an unlucky user to perform a sequence of mounts and IO which results in large inode numbers despite the presence of that flag; we recently saw this happen by accident on a customer site. So the right thing to do is probably to check the inode number against (u32)~0. Unfortunately, given the current encoding scheme, you have to check both the inode and the parent inode, which complicates the logic. I'll see if we can do anything later on. But for now I'll leave it as-is becaue this file will be merge hell anyway when both vfs removal and exporting changes hit the tree.. Fair 'nuff. Greg. -- Greg Banks, RD Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2/4] 2.6.23-rc6: known regressions
Michal Piotrowski wrote: Hi all, Here is a list of some known regressions in 2.6.23-rc6. Feel free to add new regressions/remove fixed etc. http://kernelnewbies.org/known_regressions List of Aces NameRegressions fixed since 21-Jun-2007 Adrian Bunk10 Linus Torvalds 6 Alan Stern 5 Andi Kleen 5 Hugh Dickins 5 Trond Myklebust5 Andrew Morton 4 Al Viro3 Alexey Starikovskiy3 Cornelia Huck 3 David S. Miller3 Jens Axboe 3 Stephen Hemminger 3 Tejun Heo 3 ACPI Subject : 2.6.23-rc5 hangs on boot, apparently when initializing the EC References : http://lkml.org/lkml/2007/9/11/369 Last known good : ? Submitter : Chuck Ebbert [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : unknown CPUFREQ Subject : ide problems: 2.6.22-git17 working, 2.6.23-rc1* is not References : http://lkml.org/lkml/2007/7/27/298 http://lkml.org/lkml/2007/7/29/371 Last known good : ? Submitter : dth [EMAIL PROTECTED] Caused-By : Len Brown [EMAIL PROTECTED] commit f79e3185dd0f8650022518d7624c876d8929061b Handled-By : Len Brown [EMAIL PROTECTED] Status : problem is being debugged FS Subject : hanging ext3 dbench tests References : http://lkml.org/lkml/2007/9/11/176 Last known good : ? Submitter : Andy Whitcroft [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : unknown Subject : umount triggers a warning in jfs and takes almost a minute References : http://lkml.org/lkml/2007/9/4/73 Last known good : ? Submitter : Oliver Neukum [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : unknown Subject : [NFSv4] 2.6.23-rc4 oops in nfs4_cb_recall References : http://lkml.org/lkml/2007/9/4/53 http://bugzilla.kernel.org/show_bug.cgi?id=9003 Last known good : ? Submitter : Daniel J Blueman [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Workaround : http://bugzilla.kernel.org/attachment.cgi?id=12797 Status : unknown Subject : [NFSD OOPS] 2.6.23-rc1-git10 References : http://lkml.org/lkml/2007/8/2/462 Last known good : ? Submitter : Andrew Clayton [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : unknown Regards, Michal -- LOG http://www.stardust.webpages.pl/log/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Hi Michal, The NFSV4 [BUG] 2.6.23-rc5 kernel BUG at fs/nfs/nfs4xdr.c:945, is again seen in 2.6.23-rc6. Can this bug be added as known regression for 2.6.23-rc6. References http://lkml.org/lkml/2007/9/7/27. -- Thanks Regards, Kamalesh Babulal, Linux Technology Center, IBM, ISTL. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 00/19] export operations rewrite
On Fri, Sep 14, 2007 at 12:43:55PM -0400, J. Bruce Fields wrote: I'm hoping Neil can take a quick look as well (and make a response to the comment on patch #1 along the way). The exportfs_d_alloc one? I have a patch for the done, I just need to run it through my testing setup before sending it out. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 00/19] export operations rewrite
On Fri, Sep 14, 2007 at 07:00:25PM +0200, Christoph Hellwig wrote: On Fri, Sep 14, 2007 at 12:43:55PM -0400, J. Bruce Fields wrote: I'm hoping Neil can take a quick look as well (and make a response to the comment on patch #1 along the way). The exportfs_d_alloc one? I have a patch for the done, I just need to run it through my testing setup before sending it out. Oops, I meant #19, not #1--there's still a note addressed to Neil sitting in there. Typing faster than I was thinking, sorry--b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2/4] 2.6.23-rc6: known regressions
On Wed, Sep 12, 2007 at 06:58:54PM +0200, Michal Piotrowski wrote: Subject : [NFSv4] 2.6.23-rc4 oops in nfs4_cb_recall References : http://lkml.org/lkml/2007/9/4/53 http://bugzilla.kernel.org/show_bug.cgi?id=9003 Last known good : ? Submitter : Daniel J Blueman [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Workaround : http://bugzilla.kernel.org/attachment.cgi?id=12797 Status : unknown I have patches which fix this, which we're testing. (See bugzilla.) It's a long-standing bug, not a regression. Subject : [NFSD OOPS] 2.6.23-rc1-git10 References : http://lkml.org/lkml/2007/8/2/462 Last known good : ? Submitter : Andrew Clayton [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : unknown Neil's working on this. Also not a regression. --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Fri, 14 Sep 2007, Nick Piggin wrote: [*] ok, this isn't quite true because if you can actually put a hard limit on unmovable allocations then anti-frag will fundamentally help -- get back to me on that when you get patches to move most of the obvious ones. We have this hard limit using ZONE_MOVABLE in 2.6.23. So we're back to 2nd class support. 2nd class support for me means a feature that is not enabled by default but that can be enabled in order to increase performance. The 2nd class support is there because we are not yet sure about the maturity of the memory allocation methods. Reserve pools as handled (by the not yet available) large page pool patches (which again has altogether another purpose) are not a limit. The reserve pools are used to provide a mininum of higher order pages that is not broken down in order to insure that a mininum number of the desired order of pages is even available in your worst case scenario. Mainly I think that is needed during the period when memory defragmentation is still under development. fsblock doesn't need any of those hacks, of course. Nor does mine for the low orders that we are considering. For order MAX_ORDER this is unavoidable since the page allocator cannot manage such large pages. It can be used for lower order if there are issues (that I have not seen yet). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Fri, 14 Sep 2007, Nick Piggin wrote: However fsblock can do everything that higher order pagecache can do in terms of avoiding vmap and giving contiguous memory to block devices by opportunistically allocating higher orders of pages, and falling back to vmap if they cannot be satisfied. fsblock is restricted to the page cache and cannot be used in other contexts where subsystems can benefit from larger linear memory. So if you argue that vmap is a downside, then please tell me how you consider the -ENOMEM of your approach to be better? That is again pretty undifferentiated. Are we talking about low page orders? There we will reclaim the all of reclaimable memory before getting an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine has 256 milllion 4k pages--and the unmovable ratios we see today it would require a very strange setup to get an allocation failure while still be able to allocate order 0 pages. With the ZONE_MOVABLE you can remove the unmovable objects into a defined pool then higher order success rates become reasonable. If, by special software layer, you mean the vmap/vunmap support in fsblock, let's see... that's probably all of a hundred or two lines. Contrast that with anti-fragmentation, lumpy reclaim, higher order pagecache and its new special mmap layer... Hmm, seems like a no brainer to me. You really still want to persue the extra layer argument as a point against fsblock here? Yes sure. You code could not live without these approaches. Without the antifragmentation measures your fsblock code would not be very successful in getting the larger contiguous segments you need to improve performance. (There is no new mmap layer, the higher order pagecache is simply the old API with set_blocksize expanded). Of course I wouldn't state that. On the contrary, I categorically state that I have already solved it :) Well then I guess that you have not read the requirements... Because it has already been rejected in another form and adds more You have rejected it. But they are bogus reasons, as I showed above. Thats not me. I am working on this because many of the filesystem people have repeatedly asked me to do this. I am no expert on filesystems. You also describe some other real (if lesser) issues like number of page structs to manage in the pagecache. But this is hardly enough to reject my patch now... for every downside you can point out in my approach, I can point out one in yours. - fsblock doesn't require changes to virtual memory layer Therefore it is not a generic change but special to the block layer. So other subsystems still have to deal with the single page issues on their own. Maybe we coud get to something like a hybrid that avoids some of these issues? Add support so something like a virtual compound page can be handled transparently in the filesystem layer with special casing if such a beast reaches the block layer? That's conceptually much worse, IMO. Why: It is the same approach that you use. If it is barely ever used and satisfies your concern then I am fine with it. And practically worse as well: vmap space is limited on 32-bit; fsblock approach can avoid vmap completely in many cases; for two reasons. The fsblock data accessor APIs aren't _that_ bad changes. They change zero conceptually in the filesystem, are arguably cleaner, and can be essentially nooped if we wanted to stay with a b_data type approach (but they give you that flexibility to replace it with any implementation). The largeblock changes are generic. They improve general handling of compound pages, they make the existing APIs work for large units of memory, they are not adding additional new API layers. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Fri, 14 Sep 2007, Christoph Lameter wrote: an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine s/1G/1T/ Sigh. has 256 milllion 4k pages--and the unmovable ratios we see today it 256k for 1G. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] JBD slab cleanups
jbd/jbd2: Replace slab allocations with page cache allocations From: Christoph Lameter [EMAIL PROTECTED] JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset. Tested on 2.6.23-rc6 with fsx runs fine. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd/checkpoint.c |2 fs/jbd/commit.c |6 +- fs/jbd/journal.c | 107 - fs/jbd/transaction.c | 10 ++-- fs/jbd2/checkpoint.c |2 fs/jbd2/commit.c |6 +- fs/jbd2/journal.c | 109 -- fs/jbd2/transaction.c | 18 include/linux/jbd.h | 23 +- include/linux/jbd2.h | 28 ++-- 10 files changed, 83 insertions(+), 228 deletions(-) Index: linux-2.6.23-rc5/fs/jbd/journal.c === --- linux-2.6.23-rc5.orig/fs/jbd/journal.c 2007-09-13 13:37:57.0 -0700 +++ linux-2.6.23-rc5/fs/jbd/journal.c 2007-09-13 13:45:39.0 -0700 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit); static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *); static void __journal_abort_soft (journal_t *journal, int errno); -static int journal_create_jbd_slab(size_t slab_size); /* * Helper function used to manage commit timeouts @@ -334,10 +333,10 @@ repeat: char *tmp; jbd_unlock_bh_state(bh_in); - tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS); + tmp = jbd_alloc(bh_in-b_size, GFP_NOFS); jbd_lock_bh_state(bh_in); if (jh_in-b_frozen_data) { - jbd_slab_free(tmp, bh_in-b_size); + jbd_free(tmp, bh_in-b_size); goto repeat; } @@ -679,7 +678,7 @@ static journal_t * journal_init_common ( /* Set up a default-sized revoke table for the new mount. */ err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH); if (err) { - kfree(journal); + jbd_kfree(journal); goto fail; } return journal; @@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); journal = NULL; goto out; } @@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i if (!journal-j_wbuf) { printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i if (err) { printk(KERN_ERR %s: Cannnot locate journal superblock\n, __FUNCTION__); - kfree(journal); + jbd_kfree(journal); return NULL; } @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal) } } - /* -* Create a slab for this blocksize -*/ - err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize)); - if (err) - return err; - /* Let the recovery code check whether it needs to recover any * data from the journal. */ if (journal_recover(journal)) @@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal) if (journal-j_revoke) journal_destroy_revoke(journal); kfree(journal-j_wbuf); - kfree(journal); + jbd_kfree(journal); } @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode } /* - * Simple support for retrying memory allocations. Introduced to help to - * debug different VM deadlock avoidance strategies. - */ -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry) -{ - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0)); -} - -/* - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed - * and allocate frozen and commit buffers from these slabs. - * - * Reason for doing this is to avoid, SLAB_DEBUG - since it could - * cause bh to cross page boundary. - */ - -#define JBD_MAX_SLABS 5 -#define JBD_SLAB_INDEX(size) (size 11) - -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS]; -static const char *jbd_slab_names[JBD_MAX_SLABS] = { - jbd_1k, jbd_2k, jbd_4k, NULL, jbd_8k -}; - -static void journal_destroy_jbd_slabs(void) -{ - int i; - - for (i = 0; i JBD_MAX_SLABS; i++) { - if (jbd_slab[i]) - kmem_cache_destroy(jbd_slab[i]); - jbd_slab[i] = NULL; - }
Re: [PATCH] JBD slab cleanups
Thanks Mingming. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
Evgeniy Polyakov wrote: Hi. I'm pleased to announce fourth release of the distributed storage subsystem, which allows to form a storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages. This release includes new configuration interface (kernel connector over netlink socket) and number of fixes of various bugs found during move to it (in error path). Further TODO list includes: * implement optional saving of mirroring/linear information on the remote nodes (simple) * new redundancy algorithm (complex) * some thoughts about distributed filesystem tightly connected to DST (far-far planes so far) Homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] My thoughts. But first a disclaimer: Perhaps you will recall me as one of the people who really reads all your patches, and examines your code and proposals closely. So, with that in mind... I question the value of distributed block services (DBS), whether its your version or the others out there. DBS are not very useful, because it still relies on a useful filesystem sitting on top of the DBS. It devolves into one of two cases: (1) multi-path much like today's SCSI, with distributed filesystem arbitrarion to ensure coherency, or (2) the filesystem running on top of the DBS is on a single host, and thus, a single point of failure (SPOF). It is quite logical to extend the concepts of RAID across the network, but ultimately you are still bound by the inflexibility and simplicity of the block device. In contrast, a distributed filesystem offers far more scalability, eliminates single points of failure, and offers more room for optimization and redundancy across the cluster. A distributed filesystem is also much more complex, which is why distributed block devices are so appealing :) With a redundant, distributed filesystem, you simply do not need any complexity at all at the block device level. You don't even need RAID. It is my hope that you will put your skills towards a distributed filesystem :) Of the current solutions, GFS (currently in kernel) scales poorly, and NFS v4.1 is amazingly bloated and overly complex. I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. Jeff - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
Jeff Garzik wrote: Evgeniy Polyakov wrote: Hi. I'm pleased to announce fourth release of the distributed storage subsystem, which allows to form a storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages. This release includes new configuration interface (kernel connector over netlink socket) and number of fixes of various bugs found during move to it (in error path). Further TODO list includes: * implement optional saving of mirroring/linear information on the remote nodes (simple) * new redundancy algorithm (complex) * some thoughts about distributed filesystem tightly connected to DST (far-far planes so far) Homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] My thoughts. But first a disclaimer: Perhaps you will recall me as one of the people who really reads all your patches, and examines your code and proposals closely. So, with that in mind... I question the value of distributed block services (DBS), whether its your version or the others out there. DBS are not very useful, because it still relies on a useful filesystem sitting on top of the DBS. It devolves into one of two cases: (1) multi-path much like today's SCSI, with distributed filesystem arbitrarion to ensure coherency, or (2) the filesystem running on top of the DBS is on a single host, and thus, a single point of failure (SPOF). It is quite logical to extend the concepts of RAID across the network, but ultimately you are still bound by the inflexibility and simplicity of the block device. In contrast, a distributed filesystem offers far more scalability, eliminates single points of failure, and offers more room for optimization and redundancy across the cluster. A distributed filesystem is also much more complex, which is why distributed block devices are so appealing :) With a redundant, distributed filesystem, you simply do not need any complexity at all at the block device level. You don't even need RAID. It is my hope that you will put your skills towards a distributed filesystem :) Of the current solutions, GFS (currently in kernel) scales poorly, and NFS v4.1 is amazingly bloated and overly complex. I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. This http://lkml.org/lkml/2007/8/12/159 may provide a fast-path to reaching that goal. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote: J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote: I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. What exactly do you mean by POSIX-only? Don't bother supporting attributes, file modes, and other details not supported by POSIX. The prime example being NFSv4, which is larded down with Windows features. I am sympathetic Cutting those out may still leave you with something pretty complicated, though. NFSv4.1 adds to the fun, by throwing interoperability completely out the window. What parts are you worried about in particular? --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 19/19] exportfs: update documentation
On Fri, 14 Sep 2007 13:49:28 +0200 [EMAIL PROTECTED] wrote: typos only below: Update deocumentation to the current state of affairs. Remove duplicated method descruptions in exportfs.h and point to Documentation/filesystems/ Exporting instead. Add a little file header comment in expfs.c describing what's going on and mentioning Neils and my copyright [1]. [1] Neil, in case you want a different/additional attribution just change the patch in your queue to reflect the preferred version. Signed-off-by: Christoph Hellwig [EMAIL PROTECTED] Index: linux-2.6/Documentation/filesystems/Exporting === --- linux-2.6.orig/Documentation/filesystems/Exporting2007-03-16 15:10:54.0 +0100 +++ linux-2.6/Documentation/filesystems/Exporting 2007-03-16 17:11:50.0 +0100 + encode_fh (optinonal) (optional) +Takes a dentry and creates a filehandle fragment which can later be used +to find/create a dentry for the same object. The default implementation +creates a filehandle fragment that encodes a 32bit inode and generation +number for the inode encoded, and if nessecary the same information for necessary +the parent. + + fh_to_dentry (mandatory) +Given a filehandle fragment, this should find the implied object and +create a dentry for it (possibly with d_alloc_anon). + + fh_to_parent (optional but strongly recommended) +Given a filehandle fragment, this should find the parent of the +implied object and create a dentry for it (possibly with d_alloc_anon). +May simplify fail if the filehandle fragment is too small.\ simply (?) + + get_parent (optional but strongly recommended) +When given a dentry for a directory, this should return a dentry for +the parent. Quite possibly the parent dentry will have been allocated +by d_alloc_anon. The default get_parent function just returns an error +so any filehandle lookup that requires finding a parent will fail. +-lookup(..) is *not* used as a default as it can leave .. entries +in the dcache which are too messy to work with. Index: linux-2.6/fs/exportfs/expfs.c === --- linux-2.6.orig/fs/exportfs/expfs.c2007-03-16 17:06:10.0 +0100 +++ linux-2.6/fs/exportfs/expfs.c 2007-03-16 23:45:14.0 +0100 @@ -1,4 +1,13 @@ - +/* + * Copyright (C) Neil Brown 2002 + * Copyright (C) Christoph Hellwig 2007 + * + * This file contains the code mapping from inodes to NFS file handles, + * and for mapping back from file handles to dentries. + * + * For details on why we doo all the strange and hairy things in here do + * take a look at Documentation/filesystems/Exporting. + */ #include linux/exportfs.h #include linux/fs.h #include linux/file.h --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote: I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. What exactly do you mean by POSIX-only? Don't bother supporting attributes, file modes, and other details not supported by POSIX. The prime example being NFSv4, which is larded down with Windows features. NFSv4.1 adds to the fun, by throwing interoperability completely out the window. Jeff - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote: My thoughts. But first a disclaimer: Perhaps you will recall me as one of the people who really reads all your patches, and examines your code and proposals closely. So, with that in mind... I question the value of distributed block services (DBS), whether its your version or the others out there. DBS are not very useful, because it still relies on a useful filesystem sitting on top of the DBS. It devolves into one of two cases: (1) multi-path much like today's SCSI, with distributed filesystem arbitrarion to ensure coherency, or (2) the filesystem running on top of the DBS is on a single host, and thus, a single point of failure (SPOF). It is quite logical to extend the concepts of RAID across the network, but ultimately you are still bound by the inflexibility and simplicity of the block device. In contrast, a distributed filesystem offers far more scalability, eliminates single points of failure, and offers more room for optimization and redundancy across the cluster. A distributed filesystem is also much more complex, which is why distributed block devices are so appealing :) With a redundant, distributed filesystem, you simply do not need any complexity at all at the block device level. You don't even need RAID. It is my hope that you will put your skills towards a distributed filesystem :) Of the current solutions, GFS (currently in kernel) scales poorly, and NFS v4.1 is amazingly bloated and overly complex. I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. What exactly do you mean by POSIX-only? --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote: J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote: I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. What exactly do you mean by POSIX-only? Don't bother supporting attributes, file modes, and other details not supported by POSIX. The prime example being NFSv4, which is larded down with Windows features. I am sympathetic Cutting those out may still leave you with something pretty complicated, though. Far less complicated than NFSv4.1 though (which is easy :)) NFSv4.1 adds to the fun, by throwing interoperability completely out the window. What parts are you worried about in particular? I'm not worried; I'm stating facts as they exist today (draft 13): NFS v4.1 does something completely without precedent in the history of NFS: the specification is defined such that interoperability is -impossible- to guarantee. pNFS permits private and unspecified layout types. This means it is impossible to guarantee that one NFSv4.1 implementation will be able to talk another NFSv4.1 implementation. Even if Linux supports the entire NFSv4.1 RFC (as it stands in draft 13 anyway), there is no guarantee at all that Linux will be able to store and retrieve data, since it's entirely possible that a proprietary protocol is required to access your data. NFSv4.1 is no longer a completely open architecture. Jeff - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On Fri, Sep 14, 2007 at 06:32:11PM -0400, Jeff Garzik wrote: J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote: J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 03:07:46PM -0400, Jeff Garzik wrote: I've been waiting for years for a smart person to come along and write a POSIX-only distributed filesystem. What exactly do you mean by POSIX-only? Don't bother supporting attributes, file modes, and other details not supported by POSIX. The prime example being NFSv4, which is larded down with Windows features. I am sympathetic Cutting those out may still leave you with something pretty complicated, though. Far less complicated than NFSv4.1 though (which is easy :)) One would hope so. NFSv4.1 adds to the fun, by throwing interoperability completely out the window. What parts are you worried about in particular? I'm not worried; I'm stating facts as they exist today (draft 13): NFS v4.1 does something completely without precedent in the history of NFS: the specification is defined such that interoperability is -impossible- to guarantee. pNFS permits private and unspecified layout types. This means it is impossible to guarantee that one NFSv4.1 implementation will be able to talk another NFSv4.1 implementation. No, servers are required to support ordinary nfs operations to the metadata server. At least, that's the way it was last I heard, which was a while ago. I agree that it'd stink (for any number of reasons) if you ever *had* to get a layout to access some file. Was that your main concern? --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Mel Gorman [EMAIL PROTECTED] writes: On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote: Nick Piggin [EMAIL PROTECTED] writes: In my attack, I cause the kernel to allocate lots of unmovable allocations and deplete movable groups. I theoretically then only need to keep a small number (1/2^N) of these allocations around in order to DoS a page allocation of order N. I'm assuming that when an unmovable allocation hijacks a movable group any further unmovable alloc will evict movable objects out of that group before hijacking another one. right? No eviction takes place. If an unmovable allocation gets placed in a movable group, then steps are taken to ensure that future unmovable allocations will take place in the same range (these decisions take place in __rmqueue_fallback()). When choosing a movable block to pollute, it will also choose the lowest possible block in PFN terms to steal so that fragmentation pollution will be as confined as possible. Evicting the unmovable pages would be one of those expensive steps that have been avoided to date. But then you can have all blocks filled with movable data, free 4K in one group, allocate 4K unmovable to take over the group, free 4k in the next group, take that group and so on. You can end with 4k unmovable in every 64k easily by accident. There should be a lot of preassure for movable objects to vacate a mixed group or you do get fragmentation catastrophs. Looking at my little test program evicting movable objects from a mixed group should not be that expensive as it doesn't happen often. The cost of it should be freeing some pages (or finding free ones in a movable group) and then memcpy. With my simplified simulation it never happens so I expect it to only happen when the work set changes. And it doesn't even have to be a DoS. The natural fragmentation that occurs today in a kernel today has the possibility to slowly push out the movable groups and give you the same situation. How would you cause that? Say you do want to purposefully place one unmovable 4k page into every 64k compund page. So you allocate 4K. First 64k page locked. But now, to get 4K into the second 64K page you have to first use up all the rest of the first 64k page. Meaning one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then will a new 64k chunk be broken and become locked. It would be easier early in the boot to mmap a large area and fault it in in virtual address order then mlock every a page every 64K. Early in the systems lifetime, there will be a rough correlation between physical and virtual memory. Without mlock(), the most successful attack will like mmap() a 60K region and fault it in as an attempt to get pagetable pages placed in every 64K region. This strategy would not work with grouping pages by mobility though as it would group the pagetable pages together. But even with mlock the virtual pages should still be movable. So if you evict movable objects from mixed group when needed all the pagetable pages would end up in the same mixed group slowly taking it over completly. No fragmentation at all. See how essential that feature is. :) Targetted attacks on grouping pages by mobility are not very easy and not that interesting either. As Nick suggests, the natural fragmentation over long periods of time is what is interesting. So to get the last 64k chunk used all previous 32k chunks need to be blocked and you need to allocate 32k (or less if more is blocked). For all previous 32k chunks to be blocked every second 16k needs to be blocked. To block the last of those 16k chunks all previous 8k chunks need to be blocked and you need to allocate 8k. For all previous 8k chunks to be blocked every second 4k page needs to be used. To alloc the last of those 4k pages all previous 4k pages need to be used. So to construct a situation where no continious 64k chunk is free you have to allocate total mem - 64k - 32k - 16k - 8k - 4k (or there about) of memory first. Only then could you free memory again while still keeping every 64k page blocked. Does that occur naturally given enough ram to start with? I believe it's very difficult to craft an attack that will work in a short period of time. An attack that worked on 2.6.22 as well may have no success on 2.6.23-rc4-mm1 for example as grouping pages by mobility does it make it exceedingly hard to craft an attack unless the attacker can mlock large amounts of memory. Too see how bad fragmentation could be I wrote a little progamm to simulate allocations with the following simplified alogrithm: Memory management: - Free pages are kept in buckets, one per order, and sorted by address. - alloc() the front page (smallest address) out of the bucket of the right order or recursively splits the next higher bucket. - free() recursively tries to merge a page with its neighbour and puts the result back into the proper bucket (sorted
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Christoph Lameter [EMAIL PROTECTED] writes: On Fri, 14 Sep 2007, Christoph Lameter wrote: an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine s/1G/1T/ Sigh. has 256 milllion 4k pages--and the unmovable ratios we see today it 256k for 1G. 256k == 64 pages for 1GB ram or 256k pages == 1Mb? MfG Goswin - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2/3] 2.6.23-rc6: known regressions v2
Hi all, Here is a list of some known regressions in 2.6.23-rc6. Feel free to add new regressions/remove fixed etc. http://kernelnewbies.org/known_regressions List of Aces NameRegressions fixed since 21-Jun-2007 Adrian Bunk10 Andi Kleen 7 Linus Torvalds 6 Alan Stern 5 Hugh Dickins 5 Trond Myklebust5 Andrew Morton 4 David S. Miller4 Al Viro3 Alexey Starikovskiy3 Cornelia Huck 3 Jens Axboe 3 Stephen Hemminger 3 Tejun Heo 3 FS Subject : hanging ext3 dbench tests References : http://lkml.org/lkml/2007/9/11/176 Last known good : ? Submitter : Andy Whitcroft [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : under test -- unreproducible at present Subject : umount triggers a warning in jfs and takes almost a minute References : http://lkml.org/lkml/2007/9/4/73 Last known good : ? Submitter : Oliver Neukum [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : unknown Networking Subject : build #301 failed for 2.6.23-rc6-g0d4cbb5 in linux/drivers/net/wireless/libertas/ References : http://lkml.org/lkml/2007/9/11/150 Last known good : ? Submitter : Toralf Förster [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : unknown Subject : zd1211rw regression, device does not enumerate References : http://marc.info/?l=linux-usb-develm=118854967709322w=2 http://bugzilla.kernel.org/show_bug.cgi?id=8972 Last known good : ? Submitter : Oliver Neukum [EMAIL PROTECTED] Caused-By : Daniel Drake [EMAIL PROTECTED] commit 74553aedd46b3a2cae986f909cf2a3f99369decc Handled-By : ? Status : unknown Subject : NETDEV WATCHDOG: eth0: transmit timed out References : http://lkml.org/lkml/2007/8/13/737 Last known good : ? Submitter : Karl Meyer [EMAIL PROTECTED] Caused-By : ? Handled-By : Francois Romieu [EMAIL PROTECTED] Status : problem is being debugged Subject : Weird network problems with 2.6.23-rc2 References : http://lkml.org/lkml/2007/8/11/40 Last known good : ? Submitter : Shish [EMAIL PROTECTED] Caused-By : ? Handled-By : ? Status : unknown Farewell! Michal -- LOGOUT http://www.stardust.webpages.pl/ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On 9/14/07, Jeff Garzik [EMAIL PROTECTED] wrote: Evgeniy Polyakov wrote: Hi. I'm pleased to announce fourth release of the distributed storage subsystem, which allows to form a storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages. This release includes new configuration interface (kernel connector over netlink socket) and number of fixes of various bugs found during move to it (in error path). Further TODO list includes: * implement optional saving of mirroring/linear information on the remote nodes (simple) * new redundancy algorithm (complex) * some thoughts about distributed filesystem tightly connected to DST (far-far planes so far) Homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=dst Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] My thoughts. But first a disclaimer: Perhaps you will recall me as one of the people who really reads all your patches, and examines your code and proposals closely. So, with that in mind... I question the value of distributed block services (DBS), whether its your version or the others out there. DBS are not very useful, because it still relies on a useful filesystem sitting on top of the DBS. It devolves into one of two cases: (1) multi-path much like today's SCSI, with distributed filesystem arbitrarion to ensure coherency, or (2) the filesystem running on top of the DBS is on a single host, and thus, a single point of failure (SPOF). This distributed storage is very much needed; even if it were to act as a more capable/performant replacement for NBD (or MD+NBD) in the near term. Many high availability applications don't _need_ all the additional complexity of a full distributed filesystem. So given that, its discouraging to see you trying to gently push Evgeniy away from all the promising work he has published. Evgeniy, please continue your current work. Mike - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 06:32:11PM -0400, Jeff Garzik wrote: J. Bruce Fields wrote: On Fri, Sep 14, 2007 at 05:14:53PM -0400, Jeff Garzik wrote: NFSv4.1 adds to the fun, by throwing interoperability completely out the window. What parts are you worried about in particular? I'm not worried; I'm stating facts as they exist today (draft 13): NFS v4.1 does something completely without precedent in the history of NFS: the specification is defined such that interoperability is -impossible- to guarantee. pNFS permits private and unspecified layout types. This means it is impossible to guarantee that one NFSv4.1 implementation will be able to talk another NFSv4.1 implementation. No, servers are required to support ordinary nfs operations to the metadata server. At least, that's the way it was last I heard, which was a while ago. I agree that it'd stink (for any number of reasons) if you ever *had* to get a layout to access some file. Was that your main concern? I just sorta assumed you could fall back to the NFSv4.0 mode of operation, going through the metadata server for all data accesses. But look at that choice in practice: you can either ditch pNFS completely, or use a proprietary solution. The market incentives are CLEARLY tilted in favor of makers of proprietary solutions. But it's a poor choice (really little choice at all). Overall, my main concern is that NFSv4.1 is no longer an open architecture solution. The no-pNFS or proprietary platform choice merely illustrate one of many negative aspects of this architecture. One of NFS's biggest value propositions is its interoperability. To quote some Wall Street guys, NFS is like crack. It Just Works. We love it. Now, for the first time in NFS's history (AFAIK), the protocol is no longer completely specified, completely known. No longer a closed loop. Private layout types mean that it is _highly_ unlikely that any OS or appliance or implementation will be able to claim full NFS compatibility. And when the proprietary portion of the spec involves something as basic as accessing one's own data, I consider that a fundamental flaw. NFS is no longer completely open. Jeff - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
On Sat, Sep 15, 2007 at 12:08:42AM -0400, Jeff Garzik wrote: J. Bruce Fields wrote: No, servers are required to support ordinary nfs operations to the metadata server. At least, that's the way it was last I heard, which was a while ago. I agree that it'd stink (for any number of reasons) if you ever *had* to get a layout to access some file. Was that your main concern? I just sorta assumed you could fall back to the NFSv4.0 mode of operation, going through the metadata server for all data accesses. Right. So any two pNFS implementations *will* be able to talk to each other; they just may not be able to use the (possibly higher-bandwidth) read/write path that pNFS gives them. But look at that choice in practice: you can either ditch pNFS completely, or use a proprietary solution. The market incentives are CLEARLY tilted in favor of makers of proprietary solutions. I doubt somebody would go to all the trouble to implement pNFS and then present their customers with that kind of choice. But maybe I'm missing something. What market incentives do you see that would make that more attractive than either 1) using a standard fully-specified layout type, or 2) just implementing your own proprietary protocol instead of pNFS? Overall, my main concern is that NFSv4.1 is no longer an open architecture solution. The no-pNFS or proprietary platform choice merely illustrate one of many negative aspects of this architecture. It's always been possible to extend NFS in various ways if you want. You could use sideband protocols with v2 and v3, for example. People have done that. Some of them have been standardized and widely implemented, some haven't. You could probably add your own compound ops to v4 if you wanted, I guess. And there's advantages to experimenting with extensions first and then standardizing when you figure out what works. I wish it happened that way more often. Now, for the first time in NFS's history (AFAIK), the protocol is no longer completely specified, completely known. No longer a closed loop. Private layout types mean that it is _highly_ unlikely that any OS or appliance or implementation will be able to claim full NFS compatibility. Do you know of any such private layout types? This is kind of a boring argument, isn't it? I'd rather hear whatever ideas you have for a new distributed filesystem protocol. --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html