Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree
Junjiro Okajima, first of all thanks for the feedback on my union mount patches. On Tue, Nov 06, [EMAIL PROTECTED] wrote: Whiteouts in your code can be a serious memory pressure, since they are kept in dcache. I know the inode for whiteouts exists only one and it is shared, but dentries for whiteouts are not. They are created for each name and resident in dcache. I am afraid it can be a problem easily when you create and unlink a temporary file many times. Generally their filenames are unique. The problem that you describe is only existing on tmpfs as the topmost union layer. In all other cases the whiteout dentries can be shrinked like the dentries of other filetypes too. This is the price you have to pay for using union mounts because somewhere this information must be stored. With ext3 or other diskbased filesystems the whiteouts are stored on disk like normal files. Therefore the dentry cache can be shrinked and reread by a lookup. Regarding to struct path in nameidata, I have no objection basically. But I think it is better to create macros for backward compatibility as struct file did. In case of f_dentry and f_mnt that was easy because you could use macros for it. Still people tend to be lazy and don't change their code if you don't force them (or do it for them). Anyway, in nameidata we used dentry and mnt as the field names. Therefore it isn't possible to use macros except of stuff like ND2DENTRY(nd) kind of stuff which is even worse. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree
On Mon, Nov 05, Jörn Engel wrote: Subject: Embed a struct path into struct nameidata instead of nd-{dentry,mnt} From: Jan Blunck [EMAIL PROTECTED] Switch from nd-{dentry,mnt} to nd-path.{dentry,mnt} everywhere. Signed-off-by: Jan Blunck [EMAIL PROTECTED] Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED] Acked-by: Christoph Hellwig [EMAIL PROTECTED] Cc: Al Viro [EMAIL PROTECTED] CC: linux-fsdevel@vger.kernel.org Signed-off-by: Andrew Morton [EMAIL PROTECTED] Frowned-upon-by: Joern Engel [EMAIL PROTECTED] This patch changes some 400 lines, most if not all of which get longer and more complicated to read. 23 get sufficiently longer to require an additional linebreak. I can't remember complexity being invited into the kernel without good reasoning, yet the patch description is surprisingly low on reasoning: Switch from nd-{dentry,mnt} to nd-path.{dentry,mnt} everywhere. I don't measure complexity by lines of code or length of lines. Maybe I was not verbose enough in the description, fair. This is a cleanup series. In mostly no case there is a reason why someone would want to use a dentry for itself. This series reflects that fact in nameidata where there is absolutly no reason at all. It enforced the correct order of getting/releasing refcount on dentry,vfsmount pairs. It enables us to do some more cleanups wrt lookup (which are coming later). For stacking support in VFS it is essential to have the dentry,vfsmount pair in every place where you want to traverse the stack. If churn is the only effect of this, please considere it NAKed again. I wonder why you didn't speak up when this series was posted to LKML. It was at least posted three times before. Did I break your COW link patches? ;) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree
On Tue, Nov 06, Jörn Engel wrote: This is a cleanup series. In mostly no case there is a reason why someone would want to use a dentry for itself. This series reflects that fact in nameidata where there is absolutly no reason at all. 400+ lines changed in this patch, some 10 in a followup patch that combines dentry/vfsmount assignments into a single path assignment. If your argument above was valid, I would expect more simplifications and fewer complications. Call me a sceptic until further patches show up to support your point. This are the patches currently in -mm related to the struct path cleanup: move-struct-path-into-its-own-header.patch embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.patch introduce-path_put.patch use-path_put-in-a-few-places-instead-of-mntdput.patch introduce-path_get.patch use-struct-path-in-fs_struct.patch make-set_fs_rootpwd-take-a-struct-path.patch introduce-path_get-unionfs.patch embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs.patch introduce-path_put-unionfs.patch one-less-parameter-to-__d_path.patch d_path-kerneldoc-cleanup.patch d_path-use-struct-path-in-struct-avc_audit_data.patch d_path-make-proc_get_link-use-a-struct-path-argument.patch d_path-make-get_dcookie-use-a-struct-path-argument.patch use-struct-path-in-struct-svc_export.patch use-struct-path-in-struct-svc_expkey.patch d_path-make-seq_path-use-a-struct-path-argument.patch d_path-make-d_path-use-a-struct-path.patch It enables us to do some more cleanups wrt lookup (which are coming later). Please send those patches. I invite cleanups that do clean things up and won't argue against then. ;) I'll send them in a later series. For stacking support in VFS it is essential to have the dentry,vfsmount pair in every place where you want to traverse the stack. True, but unrelated to this patch. I start sending out the patches in multiple chunks because nobody reviewed the union mount series except for coding style violations. So this is the prework for the changes that come with my union mount series. So they are related but not a part of the union mount patch series. It seems that people tend to like the patch series with small changes for itself instead of a big fat series. I wonder why you didn't speak up when this series was posted to LKML. It was at least posted three times before. I did speak up. Once. If you missed that thread, please forgive me missing those in which the same patch I disapproved of were resent without me on Cc. Sorry for missing your feedback but now I found your mail (mental masturbation that complicates the source). I guess this is what happens when multiple people start posting the same patch series. Coming back to the mental stuff: the savings of the first bunch of patches that already hit -mm: Textsize without patches: 0x20e572 Textsize with patches:0x20e042 -- 0x530 = 1328 bytes Cheers, Jan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 0/2] readdir() as an inode operation
This is a first try to move readdir() to become an inode operation. This is necessary for a VFS implementation of something like union-mounts where a readdir() needs to read the directory contents of multiple directories. Besides that the new interface is no longer giving the struct file to the filesystem implementations anymore. Comments, please? Jan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 1/2] i_op-readdir: Change readdir() to be an inode operation
This patch adds a new readdir() inode operation. The purpose of this patch is to enable the VFS to support directory reading on a stack of directories. The new interface isn't passing the struct file to the filesystem implementation anymore. Normally the filesystem implementation shouldn't depend on any information in struct file except for the dentry, the cookie (f_pos) and the users credentials. The new interface for the readdir inode operation is as follows: int (*readdir) (struct dentry *dentry, loff_t *pos, void *private, filldir_t filler, void *dirent); @dentry: the dentry of the directory @pos: pointer to the cookie @private: the credentials (at the moment it is still filp-private_data @filler: the filldir to call @dirent: the dirent buffer Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/readdir.c | 14 -- include/linux/fs.h |2 ++ 2 files changed, 14 insertions(+), 2 deletions(-) Index: b/fs/readdir.c === --- a/fs/readdir.c +++ b/fs/readdir.c @@ -16,6 +16,7 @@ #include linux/security.h #include linux/syscalls.h #include linux/unistd.h +#include linux/kallsyms.h #include asm/uaccess.h @@ -23,7 +24,8 @@ int vfs_readdir(struct file *file, filld { struct inode *inode = file-f_path.dentry-d_inode; int res = -ENOTDIR; - if (!file-f_op || !file-f_op-readdir) + if ((!file-f_op || !file-f_op-readdir) + (!inode-i_op || !inode-i_op-readdir)) goto out; res = security_file_permission(file, MAY_READ); @@ -33,7 +35,15 @@ int vfs_readdir(struct file *file, filld mutex_lock(inode-i_mutex); res = -ENOENT; if (!IS_DEADDIR(inode)) { - res = file-f_op-readdir(file, buf, filler); + if (inode-i_op-readdir) { + printk(KERN_DEBUG i_op-readdir @ ); + print_ip_sym((unsigned long)inode-i_op); + res = inode-i_op-readdir(file-f_path.dentry, + file-f_pos, + file-private_data, + filler, buf); + } else + res = file-f_op-readdir(file, buf, filler); file_accessed(file); } mutex_unlock(inode-i_mutex); Index: b/include/linux/fs.h === --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1214,6 +1214,8 @@ struct inode_operations { int (*mkdir) (struct inode *,struct dentry *,int); int (*rmdir) (struct inode *,struct dentry *); int (*mknod) (struct inode *,struct dentry *,int,dev_t); + /* readdir(dentry, position, private/credential, filler, buffer) */ + int (*readdir) (struct dentry *, loff_t *, void *, filldir_t, void *); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); int (*readlink) (struct dentry *, char __user *,int); -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 2/2] i_op-readdir: Change libfs users to the new interface
This patch changes dcache_readdir() to the new inode operations readdir interface. Hence all the users of libfs.c are changed to use the new interface too. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/autofs4/autofs_i.h |5 ++--- fs/autofs4/root.c | 41 - fs/cifs/inode.c |1 + fs/hugetlbfs/inode.c |1 + fs/libfs.c| 27 ++- fs/ocfs2/dlm/dlmfs.c |1 + fs/ramfs/inode.c |1 + include/linux/fs.h|3 ++- mm/shmem.c|1 + 9 files changed, 47 insertions(+), 34 deletions(-) Index: b/fs/autofs4/autofs_i.h === --- a/fs/autofs4/autofs_i.h +++ b/fs/autofs4/autofs_i.h @@ -168,10 +168,9 @@ static inline int autofs4_ispending(stru return pending; } -static inline void autofs4_copy_atime(struct file *src, struct file *dst) +static inline void autofs4_copy_atime(struct inode *src, struct inode *dst) { - dst-f_path.dentry-d_inode-i_atime = - src-f_path.dentry-d_inode-i_atime; + dst-i_atime = src-i_atime; return; } Index: b/fs/autofs4/root.c === --- a/fs/autofs4/root.c +++ b/fs/autofs4/root.c @@ -35,7 +35,6 @@ const struct file_operations autofs4_roo .open = dcache_dir_open, .release= dcache_dir_close, .read = generic_read_dir, - .readdir= autofs4_root_readdir, .ioctl = autofs4_root_ioctl, }; @@ -43,7 +42,6 @@ const struct file_operations autofs4_dir .open = autofs4_dir_open, .release= autofs4_dir_close, .read = generic_read_dir, - .readdir= autofs4_dir_readdir, }; const struct inode_operations autofs4_indirect_root_inode_operations = { @@ -52,6 +50,7 @@ const struct inode_operations autofs4_in .symlink= autofs4_dir_symlink, .mkdir = autofs4_dir_mkdir, .rmdir = autofs4_dir_rmdir, + .readdir= autofs4_root_readdir, }; const struct inode_operations autofs4_direct_root_inode_operations = { @@ -59,6 +58,7 @@ const struct inode_operations autofs4_di .unlink = autofs4_dir_unlink, .mkdir = autofs4_dir_mkdir, .rmdir = autofs4_dir_rmdir, + .readdir= autofs4_root_readdir, .follow_link= autofs4_follow_link, }; @@ -68,15 +68,17 @@ const struct inode_operations autofs4_di .symlink= autofs4_dir_symlink, .mkdir = autofs4_dir_mkdir, .rmdir = autofs4_dir_rmdir, + .readdir= autofs4_dir_readdir, }; -static int autofs4_root_readdir(struct file *file, void *dirent, - filldir_t filldir) +static int autofs4_root_readdir(struct dentry *dentry, loff_t *pos, + void *private, + filldir_t filldir, void *dirent) { - struct autofs_sb_info *sbi = autofs4_sbi(file-f_path.dentry-d_sb); + struct autofs_sb_info *sbi = autofs4_sbi(dentry-d_sb); int oz_mode = autofs4_oz_mode(sbi); - DPRINTK(called, filp-f_pos = %lld, file-f_pos); + DPRINTK(called, filp-f_pos = %lld, *pos); /* * Don't set reghost flag if: @@ -84,12 +86,12 @@ static int autofs4_root_readdir(struct f * 2) we haven't even enabled reghosting in the 1st place. * 3) this is the daemon doing a readdir */ - if (oz_mode file-f_pos == 0 sbi-reghost_enabled) + if (oz_mode *pos == 0 sbi-reghost_enabled) sbi-needs_reghost = 1; DPRINTK(needs_reghost = %d, sbi-needs_reghost); - return dcache_readdir(file, dirent, filldir); + return dcache_inode_readdir(dentry, pos, private,filldir, dirent); } static int autofs4_dir_open(struct inode *inode, struct file *file) @@ -201,15 +203,16 @@ out: return status; } -static int autofs4_dir_readdir(struct file *file, void *dirent, filldir_t filldir) +static int autofs4_dir_readdir(struct dentry *dentry, loff_t *pos, + void *private, + filldir_t filldir, void *dirent) { - struct dentry *dentry = file-f_path.dentry; struct autofs_sb_info *sbi = autofs4_sbi(dentry-d_sb); - struct dentry *cursor = file-private_data; + struct dentry *cursor = private; int status; - DPRINTK(file=%p dentry=%p %.*s, - file, dentry, dentry-d_name.len, dentry-d_name.name); + DPRINTK(dentry=%p %.*s, dentry, dentry-d_name.len, + dentry-d_name.name); if (autofs4_oz_mode(sbi)) goto out; @@ -221,21 +224,25 @@ static int autofs4_dir_readdir(struct fi if (d_mountpoint(dentry)) { struct file *fp = cursor
[RFC 10/26] VFS white-out handling
Introduce white-out handling in the VFS. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/inode.c | 22 ++ fs/namei.c | 417 +++-- fs/readdir.c |6 include/linux/fs.h |7 4 files changed, 441 insertions(+), 11 deletions(-) --- a/fs/inode.c +++ b/fs/inode.c @@ -1410,6 +1410,26 @@ void __init inode_init(unsigned long mem INIT_HLIST_HEAD(inode_hashtable[loop]); } +/* + * Dummy default file-operations: + * Never open a whiteout. This is always a bug. + */ +static int whiteout_no_open(struct inode *irrelevant, struct file *dontcare) +{ + printk(WARNING: at %s:%d %s(): Attempted to open a whiteout!\n, + __FILE__, __LINE__, __FUNCTION__); + /* +* Nobody should ever be able to open a whiteout. On the other hand +* this isn't fatal so lets just print a warning message. +*/ + WARN_ON(1); + return -ENXIO; +} + +static struct file_operations def_wht_fops = { + .open = whiteout_no_open, +}; + void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev) { inode-i_mode = mode; @@ -1423,6 +1443,8 @@ void init_special_inode(struct inode *in inode-i_fop = def_fifo_fops; else if (S_ISSOCK(mode)) inode-i_fop = bad_sock_fops; + else if (S_ISWHT(mode)) + inode-i_fop = def_wht_fops; else printk(KERN_DEBUG init_special_inode: bogus i_mode (%o)\n, mode); --- a/fs/namei.c +++ b/fs/namei.c @@ -887,7 +887,7 @@ static fastcall int __link_path_walk(con err = -ENOENT; inode = next.dentry-d_inode; - if (!inode) + if (!inode || S_ISWHT(inode-i_mode)) goto out_dput; err = -ENOTDIR; if (!inode-i_op) @@ -951,6 +951,8 @@ last_component: err = -ENOENT; if (!inode) break; + if (S_ISWHT(inode-i_mode)) + break; if (lookup_flags LOOKUP_DIRECTORY) { err = -ENOTDIR; if (!inode-i_op || !inode-i_op-lookup) @@ -1434,13 +1436,10 @@ static inline int check_sticky(struct in * 10. We don't allow removal of NFS sillyrenamed files; it's handled by * nfs_async_unlink(). */ -static int may_delete(struct inode *dir,struct dentry *victim,int isdir) +static int __may_delete(struct inode *dir, struct dentry *victim, int isdir) { int error; - if (!victim-d_inode) - return -ENOENT; - BUG_ON(victim-d_parent-d_inode != dir); audit_inode_child(victim-d_name.name, victim-d_inode, dir); @@ -1466,6 +1465,14 @@ static int may_delete(struct inode *dir, return 0; } +static int may_delete(struct inode *dir, struct dentry *victim, int isdir) +{ + if (!victim-d_inode || S_ISWHT(victim-d_inode-i_mode)) + return -ENOENT; + + return __may_delete(dir, victim, isdir); +} + /* Check whether we can create an object with dentry child in directory * dir. * 1. We can't do it if child already exists (open has special treatment for @@ -1477,7 +1484,7 @@ static int may_delete(struct inode *dir, static inline int may_create(struct inode *dir, struct dentry *child, struct nameidata *nd) { - if (child-d_inode) + if (child-d_inode !S_ISWHT(child-d_inode-i_mode)) return -EEXIST; if (IS_DEADDIR(dir)) return -ENOENT; @@ -1559,6 +1566,13 @@ int vfs_create(struct inode *dir, struct error = security_inode_create(dir, dentry, mode); if (error) return error; + + if (dentry-d_inode S_ISWHT(dentry-d_inode-i_mode)) { + error = vfs_unlink_whiteout(dir, dentry); + if (error) + return error; + } + DQUOT_INIT(dir); error = dir-i_op-create(dir, dentry, mode, nd); if (!error) @@ -1741,7 +1755,7 @@ do_last: } /* Negative dentry, just create the file */ - if (!path.dentry-d_inode) { + if (!path.dentry-d_inode || S_ISWHT(path.dentry-d_inode-i_mode)) { error = open_namei_create(nd, path, flag, mode); if (error) goto exit; @@ -1903,6 +1917,12 @@ int vfs_mknod(struct inode *dir, struct if (error) return error; + if (dentry-d_inode S_ISWHT(dentry-d_inode-i_mode)) { + error = vfs_unlink_whiteout(dir, dentry); + if (error) + return error; + } + DQUOT_INIT(dir); error = dir-i_op-mknod(dir, dentry, mode, dev); if (!error) @@ -1969,6 +1989,7 @@ asmlinkage long sys_mknod(const char __u int vfs_mkdir(struct inode *dir, struct dentry *dentry, int
[RFC 17/26] union-mount: Drive the union cache via dcache
If a dentry is removed from dentry cache because its usage count drops to zero, the references to the underlying layer of the unions the dentry is in are droped too. Therefore the union cache is driven by the dentry cache. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/dcache.c|8 + fs/union.c | 72 + include/linux/dcache.h |8 + include/linux/union.h |6 4 files changed, 94 insertions(+) --- a/fs/dcache.c +++ b/fs/dcache.c @@ -18,6 +18,7 @@ #include linux/string.h #include linux/mm.h #include linux/fs.h +#include linux/union.h #include linux/fsnotify.h #include linux/slab.h #include linux/init.h @@ -142,11 +143,14 @@ static struct dentry *__d_kill(struct de list_add(dentry-d_lru, list); spin_unlock(dentry-d_lock); spin_unlock(dcache_lock); + __shrink_d_unions(dentry, list); return NULL; } /* drops the locks, at that point nobody can reach this dentry */ dentry_iput(dentry); + /* If the dentry was in an union delete them */ + shrink_d_unions(dentry); parent = dentry-d_parent; d_free(dentry); return dentry == parent ? NULL : parent; @@ -721,6 +725,7 @@ static void shrink_dcache_for_umount_sub iput(inode); } + shrink_d_unions(dentry); d_free(dentry); /* finished when we fall off the top of the tree, @@ -1464,7 +1469,9 @@ void d_delete(struct dentry * dentry) spin_lock(dentry-d_lock); isdir = S_ISDIR(dentry-d_inode-i_mode); if (atomic_read(dentry-d_count) == 1) { + __d_drop_unions(dentry); dentry_iput(dentry); + shrink_d_unions(dentry); fsnotify_nameremove(dentry, isdir); /* remove this and other inotify debug checks after 2.6.18 */ @@ -1478,6 +1485,7 @@ void d_delete(struct dentry * dentry) spin_unlock(dentry-d_lock); spin_unlock(dcache_lock); + shrink_d_unions(dentry); fsnotify_nameremove(dentry, isdir); } --- a/fs/union.c +++ b/fs/union.c @@ -258,6 +258,8 @@ int append_to_union(struct vfsmount *mnt union_put(this); return 0; } + list_add(this-u_unions, dentry-d_unions); + dest_dentry-d_unionized++; __union_hash(this); spin_unlock(union_lock); return 0; @@ -333,3 +335,73 @@ int follow_union_mount(struct vfsmount * return res; } + +/* + * This must be called when unhashing a dentry. This is called with dcache_lock + * and unhashes all unions this dentry is in. + */ +void __d_drop_unions(struct dentry *dentry) +{ + struct union_mount *this, *next; + + spin_lock(union_lock); + list_for_each_entry_safe(this, next, dentry-d_unions, u_unions) + __union_unhash(this); + spin_unlock(union_lock); +} + +/* + * This must be called after __d_drop_unions() without holding any locks. + * Note: The dentry might still be reachable via a lookup but at that time it + * already a negative dentry. Otherwise it would be unhashed. The union_mount + * structure itself is still reachable through mnt-mnt_unions (which we + * protect against with union_lock). + */ +void shrink_d_unions(struct dentry *dentry) +{ + struct union_mount *this, *next; + +repeat: + spin_lock(union_lock); + list_for_each_entry_safe(this, next, dentry-d_unions, u_unions) { + BUG_ON(!hlist_unhashed(this-u_hash)); + BUG_ON(!hlist_unhashed(this-u_rhash)); + list_del(this-u_unions); + this-u_next.dentry-d_unionized--; + spin_unlock(union_lock); + union_put(this); + goto repeat; + } + spin_unlock(union_lock); +} + +extern void __dput(struct dentry *, struct list_head *); + +/* + * This is the special variant for use in dput() only. + */ +void __shrink_d_unions(struct dentry *dentry, struct list_head *list) +{ + struct union_mount *this, *next; + + BUG_ON(!d_unhashed(dentry)); + +repeat: + spin_lock(union_lock); + list_for_each_entry_safe(this, next, dentry-d_unions, u_unions) { + struct dentry *n_dentry = this-u_next.dentry; + struct vfsmount *n_mnt = this-u_next.mnt; + + BUG_ON(!hlist_unhashed(this-u_hash)); + BUG_ON(!hlist_unhashed(this-u_rhash)); + list_del(this-u_unions); + this-u_next.dentry-d_unionized--; + spin_unlock(union_lock); + if (__union_put(this)) { + __dput(n_dentry, list); + mntput(n_mnt); + } + goto repeat; + } + spin_unlock(union_lock); +} --- a/include/linux
[RFC 01/26] [PATCH 14/18] shmem: convert to using splice instead of sendfile()
From: Hugh Dickins [EMAIL PROTECTED] Remove shmem_file_sendfile and resurrect shmem_readpage, as used by tmpfs to support loop and sendfile in 2.4 and 2.5. Now tmpfs can support splice, loop and sendfile in the simplest way, using generic_file_splice_read and generic_file_splice_write (with the aid of shmem_prepare_write). We could make some efficiency tweaks later, if there's a real need; but this is stable and works well as is. Signed-off-by: Hugh Dickins [EMAIL PROTECTED] Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- mm/shmem.c | 40 1 file changed, 16 insertions(+), 24 deletions(-) --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1109,8 +1109,8 @@ static int shmem_getpage(struct inode *i * Normally, filepage is NULL on entry, and either found * uptodate immediately, or allocated and zeroed, or read * in under swappage, which is then assigned to filepage. -* But shmem_write_begin passes in a locked filepage, -* which may be found not uptodate by other callers too, +* But shmem_readpage and shmem_write_begin passes in a locked +* filepage, which may be found not uptodate by other callers too, * and may need to be copied from the swappage read in. */ repeat: @@ -1454,9 +1454,18 @@ static const struct inode_operations shm static const struct inode_operations shmem_symlink_inline_operations; /* - * Normally tmpfs makes no use of shmem_write_begin, but it - * lets a tmpfs file be used read-write below the loop driver. + * Normally tmpfs avoids the use of shmem_readpage and shmem_write_begin; + * but providing them allows a tmpfs file to be used for splice, sendfile, and + * below the loop driver, in the generic fashion that many filesystems support. */ +static int shmem_readpage(struct file *file, struct page *page) +{ + struct inode *inode = page-mapping-host; + int error = shmem_getpage(inode, page-index, page, SGP_CACHE, NULL); + unlock_page(page); + return error; +} + static int shmem_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, @@ -1701,25 +1710,6 @@ static ssize_t shmem_file_read(struct fi return desc.error; } -static ssize_t shmem_file_sendfile(struct file *in_file, loff_t *ppos, -size_t count, read_actor_t actor, void *target) -{ - read_descriptor_t desc; - - if (!count) - return 0; - - desc.written = 0; - desc.count = count; - desc.arg.data = target; - desc.error = 0; - - do_shmem_file_read(in_file, ppos, desc, actor); - if (desc.written) - return desc.written; - return desc.error; -} - static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf) { struct shmem_sb_info *sbinfo = SHMEM_SB(dentry-d_sb); @@ -2376,6 +2366,7 @@ static const struct address_space_operat .writepage = shmem_writepage, .set_page_dirty = __set_page_dirty_no_writeback, #ifdef CONFIG_TMPFS + .readpage = shmem_readpage, .write_begin= shmem_write_begin, .write_end = shmem_write_end, #endif @@ -2389,7 +2380,8 @@ static const struct file_operations shme .read = shmem_file_read, .write = shmem_file_write, .fsync = simple_sync_file, - .sendfile = shmem_file_sendfile, + .splice_read= generic_file_splice_read, + .splice_write = generic_file_splice_write, #endif }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 20/26] union-mount: Simple union-mount readdir implementation
This is a very simple union mount readdir implementation. It modifies the readdir routine to merge the entries of union mounted directories and eliminate duplicates while walking the union stack. FIXME: This patch needs to be reworked! At the moment this only works for ext2 and tmpfs. All kind of index directories that return d_off i_size don't work with this. The directory entries are read starting from the top layer and they are maintained in a cache. Subsequently when the entries from the bottom layers of the union stack are read they are checked for duplicates (in the cache) before being passed out to the user space. There can be multiple calls to readdir/getdents routines for reading the entries of a single directory. But union directory cache is not maitained across these calls. Instead for every call, the previously read entries are re-read into the cache and newly read entires are compared against these for duplicates before being they are returned to user space. Signed-off-by: Jan Blunck [EMAIL PROTECTED] Signed-off-by: Bharata B Rao [EMAIL PROTECTED] --- fs/readdir.c | 11 - fs/union.c| 336 ++ include/linux/union.h | 25 +++ 3 files changed, 364 insertions(+), 8 deletions(-) --- a/fs/readdir.c +++ b/fs/readdir.c @@ -16,13 +16,14 @@ #include linux/security.h #include linux/syscalls.h #include linux/unistd.h +#include linux/union.h #include asm/uaccess.h int vfs_readdir(struct file *file, filldir_t filler, void *buf) { - struct inode *inode = file-f_path.dentry-d_inode; int res = -ENOTDIR; + if (!file-f_op || !file-f_op-readdir) goto out; @@ -30,13 +31,7 @@ int vfs_readdir(struct file *file, filld if (res) goto out; - mutex_lock(inode-i_mutex); - res = -ENOENT; - if (!IS_DEADDIR(inode)) { - res = file-f_op-readdir(file, buf, filler); - file_accessed(file); - } - mutex_unlock(inode-i_mutex); + res = do_readdir(file, buf, filler); out: return res; } --- a/fs/union.c +++ b/fs/union.c @@ -18,6 +18,8 @@ #include linux/hash.h #include linux/fs.h #include linux/union.h +#include linux/module.h +#include linux/file.h /* * This is borrowed from fs/inode.c. The hashtable for lookups. Somebody @@ -462,3 +464,337 @@ void detach_mnt_union(struct vfsmount *m union_put(um); return; } + + +/* + * Union mounts support for readdir. + */ + +/* This is a copy from fs/readdir.c */ +struct getdents_callback { + struct linux_dirent __user *current_dir; + struct linux_dirent __user *previous; + int count; + int error; +}; + +/* The readdir union cache object */ +struct union_cache_entry { + struct list_head list; + struct qstr name; +}; + +static int union_cache_add_entry(struct list_head *list, +const char *name, int namelen) +{ + struct union_cache_entry *this; + char *tmp_name; + + this = kmalloc(sizeof(*this), GFP_KERNEL); + if (!this) { + printk(KERN_CRIT + union_cache_add_entry(): out of kernel memory\n); + return -ENOMEM; + } + + tmp_name = kmalloc(namelen + 1, GFP_KERNEL); + if (!tmp_name) { + printk(KERN_CRIT + union_cache_add_entry(): out of kernel memory\n); + kfree(this); + return -ENOMEM; + } + + this-name.name = tmp_name; + this-name.len = namelen; + this-name.hash = 0; + memcpy(tmp_name, name, namelen); + tmp_name[namelen] = 0; + INIT_LIST_HEAD(this-list); + list_add(this-list, list); + return 0; +} + +static void union_cache_free(struct list_head *uc_list) +{ + struct list_head *p; + struct list_head *ptmp; + int count = 0; + + list_for_each_safe(p, ptmp, uc_list) { + struct union_cache_entry *this; + + this = list_entry(p, struct union_cache_entry, list); + list_del_init(this-list); + kfree(this-name.name); + kfree(this); + count++; + } + return; +} + +static int union_cache_find_entry(struct list_head *uc_list, + const char *name, int namelen) +{ + struct union_cache_entry *p; + int ret = 0; + + list_for_each_entry(p, uc_list, list) { + if (p-name.len != namelen) + continue; + if (strncmp(p-name.name, name, namelen) == 0) { + ret = 1; + break; + } + } + + return ret; +} + +/* + * There are four filldir() wrapper necessary for the union mount readdir + * implementation: + * + * - filldir_topmost(): fills the union's readdir cache and the user space + * buffer
[RFC 16/26] union-mount: Introduce union_mount structure
This patch adds the basic structures of VFS based union mounts. It is a new implementation based on some of my old idea's that influenced Bharata B Rao [EMAIL PROTECTED] who came up with the proposal to let the union_mount struct only point to the next layer in the union stack. I rewrote nearly all of the central patches around lookup and the dcache interaction. Advantages of the new implementation: - the new union stack is no longer tied directly to one dentry - the union stack enables dentries to be part of more than one union (bind mounts) - it is unnecessary to traverse the union stack when de/referencing a dentry - caching of union stack information still driven by dentry cache Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/Kconfig |8 + fs/Makefile|2 fs/dcache.c|4 fs/union.c | 335 + include/linux/dcache.h |9 + include/linux/union.h | 61 6 files changed, 419 insertions(+) --- a/fs/Kconfig +++ b/fs/Kconfig @@ -551,6 +551,14 @@ config INOTIFY_USER If unsure, say Y. +config UNION_MOUNT + bool Union mount support (EXPERIMENTAL) + depends on EXPERIMENTAL + ---help--- + If you say Y here, you will be able to mount file systems as + union mount stacks. This is a VFS based implementation and + should work with all file systems. If unsure, say N. + config QUOTA bool Quota support help --- a/fs/Makefile +++ b/fs/Makefile @@ -49,6 +49,8 @@ obj-$(CONFIG_FS_POSIX_ACL)+= posix_acl. obj-$(CONFIG_NFS_COMMON) += nfs_common/ obj-$(CONFIG_GENERIC_ACL) += generic_acl.o +obj-$(CONFIG_UNION_MOUNT) += union.o + obj-$(CONFIG_QUOTA)+= dquot.o obj-$(CONFIG_QFMT_V1) += quota_v1.o obj-$(CONFIG_QFMT_V2) += quota_v2.o --- a/fs/dcache.c +++ b/fs/dcache.c @@ -985,6 +985,10 @@ struct dentry *d_alloc(struct dentry * p #ifdef CONFIG_PROFILING dentry-d_cookie = NULL; #endif +#ifdef CONFIG_UNION_MOUNT + INIT_LIST_HEAD(dentry-d_unions); + dentry-d_unionized = 0; +#endif INIT_HLIST_NODE(dentry-d_hash); INIT_LIST_HEAD(dentry-d_lru); INIT_LIST_HEAD(dentry-d_subdirs); --- /dev/null +++ b/fs/union.c @@ -0,0 +1,335 @@ +/* + * VFS based union mount for Linux + * + * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH. + * Copyright (C) 2007 Novell Inc. + * + * Author(s): Jan Blunck ([EMAIL PROTECTED]) + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + */ + +#include linux/bootmem.h +#include linux/init.h +#include linux/types.h +#include linux/hash.h +#include linux/fs.h +#include linux/union.h + +/* + * This is borrowed from fs/inode.c. The hashtable for lookups. Somebody + * should try to make this good - I've just made it work. + */ +static unsigned int union_hash_mask __read_mostly; +static unsigned int union_hash_shift __read_mostly; +static struct hlist_head *union_hashtable __read_mostly; +static unsigned int union_rhash_mask __read_mostly; +static unsigned int union_rhash_shift __read_mostly; +static struct hlist_head *union_rhashtable __read_mostly; + +/* + * Locking Rules: + * - dcache_lock (for union_rlookup() only) + * - union_lock + */ +DEFINE_SPINLOCK(union_lock); + +static struct kmem_cache *union_cache __read_mostly; + +static unsigned long hash(struct dentry *dentry, struct vfsmount *mnt) +{ + unsigned long tmp; + + tmp = ((unsigned long)mnt * (unsigned long)dentry) ^ + (GOLDEN_RATIO_PRIME + (unsigned long)mnt) / L1_CACHE_BYTES; + tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) union_hash_shift); + return tmp union_hash_mask; +} + +static __initdata unsigned long union_hash_entries; + +static int __init set_union_hash_entries(char *str) +{ + if (!str) + return 0; + union_hash_entries = simple_strtoul(str, str, 0); + return 1; +} + +__setup(union_hash_entries=, set_union_hash_entries); + +static int __init init_union(void) +{ + int loop; + + union_cache = kmem_cache_create(union_mount, + sizeof(struct union_mount), 0, + SLAB_HWCACHE_ALIGN | SLAB_PANIC, + NULL, NULL); + + union_hashtable = alloc_large_system_hash(Union-cache, + sizeof(struct hlist_head), + union_hash_entries, + 14, + 0, + union_hash_shift
[RFC 18/26] union-mount: Changes to the namespace handling
Creates the proper struct union_mount when mounting something into a union. If the topmost filesystem isn't capable of handling the white-out filetype it could only be mount read-only. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/namespace.c| 46 ++-- fs/union.c| 57 ++ include/linux/mount.h |3 ++ include/linux/union.h |6 + 4 files changed, 110 insertions(+), 2 deletions(-) --- a/fs/namespace.c +++ b/fs/namespace.c @@ -25,6 +25,7 @@ #include linux/security.h #include linux/mount.h #include linux/ramfs.h +#include linux/union.h #include asm/uaccess.h #include asm/unistd.h #include pnode.h @@ -68,6 +69,9 @@ struct vfsmount *alloc_vfsmnt(const char INIT_LIST_HEAD(mnt-mnt_share); INIT_LIST_HEAD(mnt-mnt_slave_list); INIT_LIST_HEAD(mnt-mnt_slave); +#ifdef CONFIG_UNION_MOUNT + INIT_LIST_HEAD(mnt-mnt_unions); +#endif if (name) { int size = strlen(name) + 1; char *newname = kmalloc(size, GFP_KERNEL); @@ -157,6 +161,7 @@ static void __touch_mnt_namespace(struct static void detach_mnt(struct vfsmount *mnt, struct nameidata *old_nd) { + detach_mnt_union(mnt); old_nd-dentry = mnt-mnt_mountpoint; old_nd-mnt = mnt-mnt_parent; mnt-mnt_parent = mnt; @@ -180,6 +185,7 @@ static void attach_mnt(struct vfsmount * list_add_tail(mnt-mnt_hash, mount_hashtable + hash(nd-mnt, nd-dentry)); list_add_tail(mnt-mnt_child, nd-mnt-mnt_mounts); + attach_mnt_union(mnt, nd-mnt, nd-dentry); } /* @@ -202,6 +208,7 @@ static void commit_tree(struct vfsmount list_add_tail(mnt-mnt_hash, mount_hashtable + hash(parent, mnt-mnt_mountpoint)); list_add_tail(mnt-mnt_child, parent-mnt_mounts); + attach_mnt_union(mnt, mnt-mnt_parent, mnt-mnt_mountpoint); touch_mnt_namespace(n); } @@ -577,6 +584,7 @@ void release_mounts(struct list_head *he struct dentry *dentry; struct vfsmount *m; spin_lock(vfsmount_lock); + detach_mnt_union(mnt); dentry = mnt-mnt_mountpoint; m = mnt-mnt_parent; mnt-mnt_mountpoint = mnt-mnt_root; @@ -999,6 +1007,10 @@ static int do_change_type(struct nameida if (nd-dentry != nd-mnt-mnt_root) return -EINVAL; + /* Don't change the type of union mounts */ + if (IS_MNT_UNION(nd-mnt)) + return -EINVAL; + down_write(namespace_sem); spin_lock(vfsmount_lock); for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL)) @@ -1011,7 +1023,8 @@ static int do_change_type(struct nameida /* * do loopback mount. */ -static int do_loopback(struct nameidata *nd, char *old_name, int flags) +static int do_loopback(struct nameidata *nd, char *old_name, int flags, + int mnt_flags) { int clone_flags = 0; uid_t owner = 0; @@ -1049,6 +1062,18 @@ static int do_loopback(struct nameidata if (IS_ERR(mnt)) goto out; + /* +* Unions couldn't be writable if the filesystem doesn't know about +* whiteouts +*/ + err = -ENOTSUPP; + if ((mnt_flags MNT_UNION) + !(mnt-mnt_sb-s_flags (MS_WHITEOUT|MS_RDONLY))) + goto out; + + if (mnt_flags MNT_UNION) + mnt-mnt_flags |= MNT_UNION; + err = graft_tree(mnt, nd); if (err) { LIST_HEAD(umount_list); @@ -1121,6 +1146,13 @@ static int do_move_mount(struct nameidat if (err) return err; + /* moving to or from a union mount is not supported */ + err = -EINVAL; + if (IS_MNT_UNION(nd-mnt)) + goto exit; + if (IS_MNT_UNION(old_nd.mnt)) + goto exit; + down_write(namespace_sem); while (d_mountpoint(nd-dentry) follow_down(nd-mnt, nd-dentry)) ; @@ -1176,6 +1208,7 @@ out: up_write(namespace_sem); if (!err) path_release(parent_nd); +exit: path_release(old_nd); return err; } @@ -1253,6 +1286,15 @@ int do_add_mount(struct vfsmount *newmnt if (S_ISLNK(newmnt-mnt_root-d_inode-i_mode)) goto unlock; + /* +* Unions couldn't be writable if the filesystem doesn't know about +* whiteouts +*/ + err = -ENOTSUPP; + if ((mnt_flags MNT_UNION) + !(newmnt-mnt_sb-s_flags (MS_WHITEOUT|MS_RDONLY))) + goto unlock; + /* some flags may have been set earlier */ newmnt-mnt_flags |= mnt_flags; if ((err = graft_tree(newmnt, nd))) @@ -1579,7 +1621,7 @@ long do_mount(char
[RFC 15/26] union-mount: Add union-mount mount flag
Introduce MNT_UNION and MS_UNION flags. You need additional patches for util-linux for that to work. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/namespace.c|6 +- include/linux/fs.h|1 + include/linux/mount.h |1 + 3 files changed, 7 insertions(+), 1 deletion(-) --- a/fs/namespace.c +++ b/fs/namespace.c @@ -437,6 +437,7 @@ static int show_vfsmnt(struct seq_file * { MNT_NODIRATIME, ,nodiratime }, { MNT_RELATIME, ,relatime }, { MNT_NOMNT, ,nomnt }, + { MNT_UNION, ,union }, { 0, NULL } }; struct proc_fs_info *fs_infop; @@ -1558,9 +1559,12 @@ long do_mount(char *dev_name, char *dir_ mnt_flags |= MNT_RELATIME; if (flags MS_NOMNT) mnt_flags |= MNT_NOMNT; + if (flags MS_UNION) + mnt_flags |= MNT_UNION; flags = ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | - MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_NOMNT); + MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_NOMNT | + MS_UNION ); /* ... and get the mountpoint */ retval = path_lookup(dir_name, LOOKUP_FOLLOW, nd); --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -114,6 +114,7 @@ extern int dir_notify_enable; #define MS_REMOUNT 32 /* Alter flags of a mounted FS */ #define MS_MANDLOCK64 /* Allow mandatory locks on an FS */ #define MS_DIRSYNC 128 /* Directory modifications are synchronous */ +#define MS_UNION 256 #define MS_NOATIME 1024/* Do not update access times. */ #define MS_NODIRATIME 2048/* Do not update directory access times */ #define MS_BIND4096 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -36,6 +36,7 @@ struct mnt_namespace; #define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */ #define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */ #define MNT_PNODE_MASK 0x3000 /* propagation flag mask */ +#define MNT_UNION 0x4000 /* if the vfsmount is a union mount */ struct vfsmount { struct list_head mnt_hash; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 26/26] union-mount: Debug code
Some debugging code itself. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/namei.c| 26 ++ fs/union.c| 27 +++ include/linux/namei.h |4 3 files changed, 57 insertions(+) --- a/fs/namei.c +++ b/fs/namei.c @@ -32,6 +32,7 @@ #include linux/fcntl.h #include linux/namei.h #include linux/union.h +#include linux/union_debug.h #include asm/namei.h #include asm/uaccess.h @@ -1794,11 +1795,15 @@ int hash_lookup_union(struct nameidata * struct path safe = { .dentry = nd-dentry, .mnt = nd-mnt }; int res ; + UM_DEBUG_LOOKUP(name = \%*s\\n, name-len, name-name); + pathget(safe); res = __hash_lookup_topmost(nd, name, path); if (res) goto out; + UM_DEBUG_LOOKUP_DENTRY(path-dentry); + /* only directories can be part of a union stack */ if (!path-dentry-d_inode || !S_ISDIR(path-dentry-d_inode-i_mode)) @@ -1813,6 +1818,7 @@ int hash_lookup_union(struct nameidata * goto out; } + UM_DEBUG_LOOKUP_DENTRY(path-dentry); out: path_release(nd); nd-dentry = safe.dentry; @@ -2765,6 +2771,8 @@ out_freename: kfree(name.name); out: pathput(safe); + UM_DEBUG(err = %d\n, err); + UM_DEBUG_DENTRY(dentry); return err; } @@ -2802,6 +2810,9 @@ int vfs_unlink_whiteout(struct inode *di } mutex_unlock(dentry-d_inode-i_mutex); + UM_DEBUG(err = %d\n, error); + UM_DEBUG_DENTRY(dentry); + /* * We can call dentry_iput() since nobody could actually do something * useful with a whiteout. So dropping the reference to the inode @@ -3490,6 +3501,10 @@ int vfs_rename_union(struct nameidata *o struct dentry *dentry; int error; + UM_DEBUG_DENTRY(old-dentry); + UM_DEBUG_DENTRY(new-dentry); +/* return -EPERM; */ + if (old-dentry-d_inode == new-dentry-d_inode) return 0; @@ -3530,6 +3545,9 @@ int vfs_rename_union(struct nameidata *o /* possibly delete the existing new file */ if ((newnd-dentry == new-dentry-d_parent) new-dentry-d_inode) { + UM_DEBUG(unlink:\n); + UM_DEBUG_DENTRY(new-dentry); + /* FIXME: inode may be truncated while we hold a lock */ error = vfs_unlink(new_dir, new-dentry); if (error) @@ -3540,6 +3558,9 @@ int vfs_rename_union(struct nameidata *o if (IS_ERR(dentry)) goto freename; + UM_DEBUG(new target:\n); + UM_DEBUG_DENTRY(new-dentry); + dput(new-dentry); new-dentry = dentry; } @@ -3554,6 +3575,10 @@ int vfs_rename_union(struct nameidata *o error = PTR_ERR(dentry); if (IS_ERR(dentry)) goto freename; + + UM_DEBUG(whiteout:\n); + UM_DEBUG_DENTRY(dentry); + error = vfs_whiteout(old_dir, dentry); dput(dentry); @@ -3567,6 +3592,7 @@ int vfs_rename_union(struct nameidata *o */ freename: kfree(old_name.name); + UM_DEBUG(err = %d\n, error); return error; } --- a/fs/union.c +++ b/fs/union.c @@ -18,6 +18,7 @@ #include linux/hash.h #include linux/fs.h #include linux/union.h +#include linux/union_debug.h #include linux/module.h #include linux/file.h #include linux/mm.h @@ -253,6 +254,9 @@ int append_to_union(struct vfsmount *mnt BUG_ON(!IS_MNT_UNION(mnt)); + UM_DEBUG_DENTRY(dentry); + UM_DEBUG_DENTRY(dest_dentry); + this = union_alloc(dentry, mnt, dest_dentry, dest_mnt); if (!this) return -ENOMEM; @@ -822,6 +826,8 @@ int union_relookup_topmost(struct nameid char *kbuf, *name; struct nameidata this; + UM_DEBUG_DENTRY(nd-dentry); + kbuf = (char *)__get_free_page(GFP_KERNEL); if (!kbuf) return -ENOMEM; @@ -838,6 +844,7 @@ int union_relookup_topmost(struct nameid path_release(nd); nd-dentry = this.dentry; nd-mnt = this.mnt; + UM_DEBUG_DENTRY(nd-dentry); /* * the nd-flags should be unchanged @@ -846,6 +853,7 @@ int union_relookup_topmost(struct nameid nd-um_flags = ~LAST_LOWLEVEL; free_page: free_page((unsigned long)kbuf); + UM_DEBUG(err = %d\n, err); return err; } @@ -895,6 +903,8 @@ struct dentry *union_create_topmost(stru if (IS_ERR(dentry)) goto out_unlock; + UM_DEBUG_DENTRY(dentry); + switch (mode S_IFMT) { case S_IFREG: /* @@ -916,6 +926,9 @@ struct dentry *union_create_topmost(stru dentry = ERR_PTR(res); goto out_unlock; } + + UM_DEBUG_DENTRY(dentry); + break; case S_IFDIR: res = vfs_mkdir
[RFC 19/26] union-mount: Make lookup work for union-mounted file systems
On union-mounted file systems the lookup function must also visit lower layers of the union-stack when doing a lookup. This patches add support for union-mounts to cached lookups and real lookups. We have 3 different styles of lookup functions now: - multiple pathname components, follow mounts, follow union, follow symlinks - single pathname component, doesn't follow mounts, follow union, doesn't follow symlinks - single pathname component doesn't follow mounts, doesn't follow unions, doesn't follow symlinks Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/namei.c| 467 +- include/linux/namei.h |6 2 files changed, 465 insertions(+), 8 deletions(-) --- a/fs/namei.c +++ b/fs/namei.c @@ -31,6 +31,7 @@ #include linux/file.h #include linux/fcntl.h #include linux/namei.h +#include linux/union.h #include asm/namei.h #include asm/uaccess.h @@ -415,6 +416,167 @@ static struct dentry *cache_lookup(struc } /* + * cache_lookup_topmost - lookup the topmost (non-)negative dentry + * + * This is used for union mount lookups from dcache. The first non-negative + * dentry is searched on all layers of the union stack. Otherwise the topmost + * negative dentry is return. + */ +static int __cache_lookup_topmost(struct nameidata *nd, struct qstr *name, + struct path *path) +{ + struct dentry *dentry; + + dentry = d_lookup(nd-dentry, name); + if (dentry dentry-d_op dentry-d_op-d_revalidate) + dentry = do_revalidate(dentry, nd); + + /* +* Remember the topmost negative dentry in case we don't find anything +*/ + path-dentry = dentry; + path-mnt = dentry ? nd-mnt : NULL; + + if (!dentry || dentry-d_inode) + return !dentry; + + /* look for the first non-negative dentry */ + + while (follow_union_down(nd-mnt, nd-dentry)) { + dentry = d_hash_and_lookup(nd-dentry, name); + + /* +* If parts of the union stack are not in the dcache we need +* to do a real lookup +*/ + if (!dentry) + goto out_dput; + + /* +* If parts of the union don't survive the revalidation we +* need to do a real lookup +*/ + if (dentry-d_op dentry-d_op-d_revalidate) { + dentry = do_revalidate(dentry, nd); + if (!dentry) + goto out_dput; + } + + if (dentry-d_inode) + goto out_dput; + + dput(dentry); + } + + return !dentry; + +out_dput: + dput(path-dentry); + path-dentry = dentry; + path-mnt = dentry ? mntget(nd-mnt) : NULL; + return !dentry; +} + +/* + * cache_lookup_union - lookup the rest of the union stack + * + * This is called after you have the topmost dentry in @path. + */ +static int __cache_lookup_union(struct nameidata *nd, struct qstr *name, + struct path *path) +{ + struct path last = *path; + struct dentry *dentry; + + while (follow_union_down(nd-mnt, nd-dentry)) { + dentry = d_hash_and_lookup(nd-dentry, name); + if (!dentry) + return 1; + + if (dentry-d_op dentry-d_op-d_revalidate) { + dentry = do_revalidate(dentry, nd); + if (!dentry) + return 1; + } + + if (!dentry-d_inode) { + dput(dentry); + continue; + } + + /* only directories can be part of a union stack */ + if (!S_ISDIR(dentry-d_inode-i_mode)) { + dput(dentry); + break; + } + + /* now we know we found something real */ + append_to_union(last.mnt, last.dentry, nd-mnt, dentry); + + if (last.dentry != path-dentry) + pathput(last); + last.dentry = dentry; + last.mnt = mntget(nd-mnt); + } + + if (last.dentry != path-dentry) + pathput(last); + + return 0; +} + +/* + * cache_lookup - lookup a single pathname part from dcache + * + * This is a union mount capable version of what d_lookup() revalidate() + * would do. This function returns a valid (union) dentry on success. + * + * Remember: On failure it means that parts of the union aren't cached. You + * should call real_lookup() afterwards to find the proper (union) dentry. + */ +static int cache_lookup_union(struct nameidata *nd, struct qstr *name, + struct path *path) +{ + int res ; + + if (!IS_MNT_UNION(nd-mnt)) { + path-dentry = cache_lookup(nd
[RFC 21/26] union-mount: in-kernel file copy between union mounted filesystems
This patch introduces in-kernel file copy between union mounted filesystems. When a file is opened for writing but resides on a lower (thus read-only) layer of the union stack it is copied to the topmost union layer first. This patch uses the do_splice() for doing the in-kernel file copy. Signed-off-by: Bharata B Rao [EMAIL PROTECTED] Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/namei.c| 73 ++- fs/union.c| 312 ++ include/linux/union.h |9 + 3 files changed, 389 insertions(+), 5 deletions(-) --- a/fs/namei.c +++ b/fs/namei.c @@ -994,7 +994,7 @@ static int __follow_mount(struct path *p return res; } -static void follow_mount(struct vfsmount **mnt, struct dentry **dentry) +void follow_mount(struct vfsmount **mnt, struct dentry **dentry) { while (d_mountpoint(*dentry)) { struct vfsmount *mounted = lookup_mnt(*mnt, *dentry); @@ -1213,6 +1213,21 @@ static fastcall int __link_path_walk(con if (err) break; + if ((nd-flags LOOKUP_TOPMOST) + (nd-um_flags LAST_LOWLEVEL)) { + struct dentry *dentry; + + dentry = union_create_topmost(nd, this, next); + if (IS_ERR(dentry)) { + err = PTR_ERR(dentry); + goto out_dput; + } + dput_path(next, nd); + next.mnt = nd-mnt; + next.dentry = dentry; + nd-um_flags = ~LAST_LOWLEVEL; + } + err = -ENOENT; inode = next.dentry-d_inode; if (!inode || S_ISWHT(inode-i_mode)) @@ -1267,6 +1282,22 @@ last_component: err = do_lookup(nd, this, next); if (err) break; + + if ((nd-flags LOOKUP_TOPMOST) + (nd-um_flags LAST_LOWLEVEL)) { + struct dentry *dentry; + + dentry = union_create_topmost(nd, this, next); + if (IS_ERR(dentry)) { + err = PTR_ERR(dentry); + goto out_dput; + } + dput_path(next, nd); + next.mnt = nd-mnt; + next.dentry = dentry; + nd-um_flags = ~LAST_LOWLEVEL; + } + inode = next.dentry-d_inode; if ((lookup_flags LOOKUP_FOLLOW) inode inode-i_op inode-i_op-follow_link) { @@ -1755,7 +1786,7 @@ out: return err; } -static int hash_lookup_union(struct nameidata *nd, struct qstr *name, +int hash_lookup_union(struct nameidata *nd, struct qstr *name, struct path *path) { struct path safe = { .dentry = nd-dentry, .mnt = nd-mnt }; @@ -2169,6 +2200,11 @@ int open_namei(int dfd, const char *path nd, flag); if (error) return error; + if (flag FMODE_WRITE) { + error = union_copyup(nd, flag); + if (error) + return error; + } goto ok; } @@ -2188,6 +2224,16 @@ int open_namei(int dfd, const char *path if (nd-last_type != LAST_NORM || nd-last.name[nd-last.len]) goto exit; + /* +* If this dentry is on an union mount we need the topmost dentry here. +* This creates all topmost directories on the path to this dentry too. +*/ + if (is_unionized(nd-dentry, nd-mnt)) { + error = union_relookup_topmost(nd, nd-flags ~LOOKUP_PARENT); + if (error) + goto exit; + } + dir = nd-dentry; nd-flags = ~LOOKUP_PARENT; mutex_lock(dir-d_inode-i_mutex); @@ -2235,10 +2281,21 @@ do_last: if (path.dentry-d_inode-i_op path.dentry-d_inode-i_op-follow_link) goto do_link; - path_to_nameidata(path, nd); error = -EISDIR; if (path.dentry-d_inode S_ISDIR(path.dentry-d_inode-i_mode)) - goto exit; + goto exit_dput; + + /* +* If this file is on a lower layer of the union stack, copy it to the +* topmost layer before opening it +*/ + if (path.dentry-d_inode (path.dentry-d_parent != dir)) { + error = __union_copyup(path, nd, path); + if (error) + goto exit_dput; + } + + path_to_nameidata(path, nd); ok: error = may_open(nd, acc_mode, flag); if (error) @@ -3437,9 +3494,15 @@ static int do_rename(int olddfd, const c error = -ENOTEMPTY; if (new.dentry
[RFC 25/26] union-mount: Debug Infrastructure
This adds debugfs/relay based debugging infrastructure helpful when doing development of the union-mount code itself. The debgging output can be enabled during runtime by: echo 1 /proc/sys/fs/union-debug This registers the relayfs files where the debug code is writing its output to. There are different levels of debugging output available which can be ORed together. For the valid sysctl values see include/linux/union_debug.h. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- include/linux/union_debug.h | 91 ++ lib/Kconfig.debug |9 + lib/Makefile|2 lib/union_debug.c | 268 4 files changed, 370 insertions(+) --- /dev/null +++ b/include/linux/union_debug.h @@ -0,0 +1,91 @@ +/* + * VFS based union mount for Linux + * + * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH. + * Copyright (C) 2007 Novell Inc. + * Author(s): Jan Blunck ([EMAIL PROTECTED]) + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + */ +#ifndef __LINUX_UNION_DEBUG_H +#define __LINUX_UNION_DEBUG_H + +#ifdef __KERNEL__ + +#ifdef CONFIG_DEBUG_UNION_MOUNT + +#include linux/sched.h + +/* This is taken from klog debugging facility */ +extern void klog(const void *data, int len); +extern void klog_printk(const char *fmt, ...); +extern void klog_printk_dentry(const char *func, struct dentry *dentry); + +extern int sysctl_union_debug; + +#define UNION_MOUNT_DEBUG 1 +#define UNION_MOUNT_DEBUG_DCACHE 2 +#define UNION_MOUNT_DEBUG_LOCK 4 +#define UNION_MOUNT_DEBUG_READDIR 8 +#define UNION_MOUNT_DEBUG_LOOKUP 16 + +#define UM_DEBUG(fmt, args...) \ +do { \ + if (sysctl_union_debug UNION_MOUNT_DEBUG) \ + klog_printk(%s: fmt, __FUNCTION__, ## args); \ +} while (0) +#define UM_DEBUG_DENTRY(dentry) \ +do { \ + if (sysctl_union_debug UNION_MOUNT_DEBUG) \ + klog_printk_dentry(__FUNCTION__, (dentry)); \ +} while (0) +#define UM_DEBUG_DCACHE(fmt, args...) \ +do { \ + if (sysctl_union_debug UNION_MOUNT_DEBUG_DCACHE) \ + klog_printk(%s: fmt, __FUNCTION__, ## args); \ +} while (0) +#define UM_DEBUG_DCACHE_DENTRY(dentry) \ +do { \ + if (sysctl_union_debug UNION_MOUNT_DEBUG_DCACHE) \ + klog_printk_dentry(__FUNCTION__, (dentry)); \ +} while (0) +#define UM_DEBUG_LOCK(fmt, args...)\ +do { \ + if (sysctl_union_debug UNION_MOUNT_DEBUG_LOCK)\ + klog_printk(%s: fmt, __FUNCTION__, ## args); \ +} while (0) +#define UM_DEBUG_READDIR(fmt, args...) \ +do { \ + if (sysctl_union_debug UNION_MOUNT_DEBUG_READDIR) \ + klog_printk(%s: fmt, __FUNCTION__, ## args); \ +} while (0) +#define UM_DEBUG_LOOKUP(fmt, args...) \ +do { \ + if (sysctl_union_debug UNION_MOUNT_DEBUG_LOOKUP) \ + klog_printk(%s: fmt, __FUNCTION__, ## args); \ +} while (0) +#define UM_DEBUG_LOOKUP_DENTRY(dentry) \ +do { \ + if (sysctl_union_debug UNION_MOUNT_DEBUG_LOOKUP) \ + klog_printk_dentry(__FUNCTION__, (dentry)); \ +} while (0) + +#else /* CONFIG_DEBUG_UNION_MOUNT */ + +#define UM_DEBUG(fmt, args...) do { /* empty */ } while (0) +#define UM_DEBUG_DENTRY(fmt, args...) do { /* empty */ } while (0) +#define UM_DEBUG_DCACHE(fmt, args...) do { /* empty */ } while (0) +#define UM_DEBUG_DCACHE_DENTRY(fmt, args...) do { /* empty */ } while (0) +#define UM_DEBUG_LOCK(fmt, args...)do { /* empty */ } while (0) +#define UM_DEBUG_READDIR(fmt, args...) do { /* empty */ } while (0) +#define UM_DEBUG_LOOKUP_DENTRY(fmt, args...) do { /* empty */ } while (0) +#define UM_DEBUG_LOOKUP_DENTRY(fmt, args...) do { /* empty */ } while
[RFC 06/26] VFS: Make real_lookup() return a struct path
This patch changes real_lookup() into returning a struct path. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/namei.c | 77 ++--- 1 file changed, 48 insertions(+), 29 deletions(-) --- a/fs/namei.c +++ b/fs/namei.c @@ -462,10 +462,11 @@ ok: * make sure that nobody added the entry to the dcache in the meantime.. * SMP-safe */ -static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, struct nameidata *nd) +static int real_lookup(struct nameidata *nd, struct qstr *name, + struct path *path) { - struct dentry * result; - struct inode *dir = parent-d_inode; + struct inode *dir = nd-dentry-d_inode; + int res = 0; mutex_lock(dir-i_mutex); /* @@ -482,19 +483,27 @@ static struct dentry * real_lookup(struc * * so doing d_lookup() (with seqlock), instead of lockfree __d_lookup */ - result = d_lookup(parent, name); - if (!result) { - struct dentry * dentry = d_alloc(parent, name); - result = ERR_PTR(-ENOMEM); + path-dentry = d_lookup(nd-dentry, name); + path-mnt = nd-mnt; + if (!path-dentry) { + struct dentry *dentry = d_alloc(nd-dentry, name); if (dentry) { - result = dir-i_op-lookup(dir, dentry, nd); - if (result) + path-dentry = dir-i_op-lookup(dir, dentry, nd); + if (path-dentry) { dput(dentry); - else - result = dentry; + if (IS_ERR(path-dentry)) { + res = PTR_ERR(path-dentry); + path-dentry = NULL; + path-mnt = NULL; + } + } else + path-dentry = dentry; + } else { + res = -ENOMEM; + path-mnt = NULL; } mutex_unlock(dir-i_mutex); - return result; + return res; } /* @@ -502,12 +511,20 @@ static struct dentry * real_lookup(struc * we waited on the semaphore. Need to revalidate. */ mutex_unlock(dir-i_mutex); - if (result-d_op result-d_op-d_revalidate) { - result = do_revalidate(result, nd); - if (!result) - result = ERR_PTR(-ENOENT); + if (path-dentry-d_op path-dentry-d_op-d_revalidate) { + path-dentry = do_revalidate(path-dentry, nd); + if (!path-dentry) { + res = -ENOENT; + path-mnt = NULL; + } + if (IS_ERR(path-dentry)) { + res = PTR_ERR(path-dentry); + path-dentry = NULL; + path-mnt = NULL; + } } - return result; + + return res; } static int __emul_lookup_dentry(const char *, struct nameidata *); @@ -748,35 +765,37 @@ static __always_inline void follow_dotdo static int do_lookup(struct nameidata *nd, struct qstr *name, struct path *path) { - struct vfsmount *mnt = nd-mnt; - struct dentry *dentry = __d_lookup(nd-dentry, name); + int err; - if (!dentry) + path-dentry = __d_lookup(nd-dentry, name); + path-mnt = nd-mnt; + if (!path-dentry) goto need_lookup; - if (dentry-d_op dentry-d_op-d_revalidate) + if (path-dentry-d_op path-dentry-d_op-d_revalidate) goto need_revalidate; + done: - path-mnt = mnt; - path-dentry = dentry; __follow_mount(path); return 0; need_lookup: - dentry = real_lookup(nd-dentry, name, nd); - if (IS_ERR(dentry)) + err = real_lookup(nd, name, path); + if (err) goto fail; goto done; need_revalidate: - dentry = do_revalidate(dentry, nd); - if (!dentry) + path-dentry = do_revalidate(path-dentry, nd); + if (!path-dentry) goto need_lookup; - if (IS_ERR(dentry)) + if (IS_ERR(path-dentry)) { + err = PTR_ERR(path-dentry); goto fail; + } goto done; fail: - return PTR_ERR(dentry); + return err; } /* -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 04/26] VFS: Make lookup_create() return a struct path
This patch changes lookup_create() into returning a struct path. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- arch/powerpc/platforms/cell/spufs/inode.c | 15 ++ fs/namei.c| 75 +- include/linux/dcache.h|1 include/linux/namei.h |1 net/unix/af_unix.c| 17 +++--- 5 files changed, 50 insertions(+), 59 deletions(-) --- a/arch/powerpc/platforms/cell/spufs/inode.c +++ b/arch/powerpc/platforms/cell/spufs/inode.c @@ -456,7 +456,7 @@ static struct file_system_type spufs_typ long spufs_create(struct nameidata *nd, unsigned int flags, mode_t mode) { - struct dentry *dentry; + struct path path; int ret; ret = -EINVAL; @@ -475,26 +475,25 @@ long spufs_create(struct nameidata *nd, goto out; } - dentry = lookup_create(nd, 1); - ret = PTR_ERR(dentry); - if (IS_ERR(dentry)) + ret = lookup_create(nd, 1, path); + if (ret) goto out_dir; ret = -EEXIST; - if (dentry-d_inode) + if (path.dentry-d_inode) goto out_dput; mode = ~current-fs-umask; if (flags SPU_CREATE_GANG) return spufs_create_gang(nd-dentry-d_inode, - dentry, nd-mnt, mode); +path.dentry, path.mnt, mode); else return spufs_create_context(nd-dentry-d_inode, - dentry, nd-mnt, flags, mode); + path.dentry, path.mnt, flags, mode); out_dput: - dput(dentry); + dput_path(path, nd); out_dir: mutex_unlock(nd-dentry-d_inode-i_mutex); out: --- a/fs/namei.c +++ b/fs/namei.c @@ -1833,10 +1833,9 @@ do_link: * * Returns with nd-dentry-d_inode-i_mutex locked. */ -struct dentry *lookup_create(struct nameidata *nd, int is_dir) +int lookup_create(struct nameidata *nd, int is_dir, struct path *path) { - struct path path = { .dentry = ERR_PTR(-EEXIST) } ; - int err; + int err = -EEXIST; mutex_lock_nested(nd-dentry-d_inode-i_mutex, I_MUTEX_PARENT); /* @@ -1852,11 +1851,9 @@ struct dentry *lookup_create(struct name /* * Do the final lookup. */ - err = lookup_hash(nd, nd-last, path); - if (err) { - path.dentry = ERR_PTR(err); + err = lookup_hash(nd, nd-last, path); + if (err) goto fail; - } /* * Special case - lookup gave negative, but... we had foo/bar/ @@ -1864,16 +1861,14 @@ struct dentry *lookup_create(struct name * all is fine. Let's be bastards - you had / on the end, you've * been asking for (non-existent) directory. -ENOENT for you. */ - if (!is_dir nd-last.name[nd-last.len] !path.dentry-d_inode) + if (!is_dir nd-last.name[nd-last.len] !path-dentry-d_inode) goto enoent; - if (nd-mnt != path.mnt) - mntput(path.mnt); - return path.dentry; + return 0; enoent: - dput_path(path, nd); - path.dentry = ERR_PTR(-ENOENT); + dput_path(path, nd); + err = -ENOENT; fail: - return path.dentry; + return err; } EXPORT_SYMBOL_GPL(lookup_create); @@ -1906,7 +1901,7 @@ asmlinkage long sys_mknodat(int dfd, con { int error = 0; char * tmp; - struct dentry * dentry; + struct path path; struct nameidata nd; if (S_ISDIR(mode)) @@ -1918,22 +1913,23 @@ asmlinkage long sys_mknodat(int dfd, con error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, nd); if (error) goto out; - dentry = lookup_create(nd, 0); - error = PTR_ERR(dentry); + error = lookup_create(nd, 0, path); if (!IS_POSIXACL(nd.dentry-d_inode)) mode = ~current-fs-umask; - if (!IS_ERR(dentry)) { + if (!error) { switch (mode S_IFMT) { case 0: case S_IFREG: - error = vfs_create(nd.dentry-d_inode,dentry,mode,nd); + error = vfs_create(nd.dentry-d_inode, path.dentry, + mode, nd); break; case S_IFCHR: case S_IFBLK: - error = vfs_mknod(nd.dentry-d_inode,dentry,mode, - new_decode_dev(dev)); + error = vfs_mknod(nd.dentry-d_inode, path.dentry, + mode, new_decode_dev(dev)); break; case S_IFIFO: case S_IFSOCK: - error = vfs_mknod(nd.dentry-d_inode,dentry,mode,0); + error = vfs_mknod(nd.dentry-d_inode, path.dentry, + mode, 0
[RFC 23/26] union-mount: copyup on rename
Add copyup renaming of regular files on union mounts. Directories are still lazyly copied with the help of user-space. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/namei.c | 133 - fs/union.c |8 ++- 2 files changed, 129 insertions(+), 12 deletions(-) --- a/fs/namei.c +++ b/fs/namei.c @@ -1491,6 +1491,8 @@ static int fastcall do_path_lookup(int d nd-mnt = mntget(fs-pwdmnt); nd-dentry = dget(fs-pwd); read_unlock(fs-lock); + /* Force a union_relookup() */ + nd-um_flags = LAST_LOWLEVEL; } else { struct dentry *dentry; @@ -3478,6 +3480,97 @@ int vfs_rename(struct inode *old_dir, st return error; } +int vfs_rename_union(struct nameidata *oldnd, struct path *old, +struct nameidata *newnd, struct path *new) +{ + struct inode *old_dir = oldnd-dentry-d_inode; + struct inode *new_dir = newnd-dentry-d_inode; + struct qstr old_name; + char *name; + struct dentry *dentry; + int error; + + if (old-dentry-d_inode == new-dentry-d_inode) + return 0; + + error = may_whiteout(old-dentry, 0); + if (error) + return error; + if (!old_dir-i_op || !old_dir-i_op-whiteout) + return -EPERM; + + if (!new-dentry-d_inode) + error = may_create(new_dir, new-dentry, NULL); + else + error = may_delete(new_dir, new-dentry, 0); + if (error) + return error; + + DQUOT_INIT(old_dir); + DQUOT_INIT(new_dir); + + error = security_inode_rename(old_dir, old-dentry, + new_dir, new-dentry); + if (error) + return error; + + error = -EBUSY; + if (d_mountpoint(old-dentry) || d_mountpoint(new-dentry)) + return error; + + error = -ENOMEM; + name = kmalloc(old-dentry-d_name.len, GFP_KERNEL); + if (!name) + return error; + strncpy(name, old-dentry-d_name.name, old-dentry-d_name.len); + name[old-dentry-d_name.len] = 0; + old_name.len = old-dentry-d_name.len; + old_name.hash = old-dentry-d_name.hash; + old_name.name = name; + + /* possibly delete the existing new file */ + if ((newnd-dentry == new-dentry-d_parent) new-dentry-d_inode) { + /* FIXME: inode may be truncated while we hold a lock */ + error = vfs_unlink(new_dir, new-dentry); + if (error) + goto freename; + + dentry = __lookup_hash_kern(new-dentry-d_name, + newnd-dentry, newnd); + if (IS_ERR(dentry)) + goto freename; + + dput(new-dentry); + new-dentry = dentry; + } + + /* copyup to the new file */ + error = __union_copyup(old, newnd, new); + if (error) + goto freename; + + /* whiteout the old file */ + dentry = __lookup_hash_kern(old_name, oldnd-dentry, oldnd); + error = PTR_ERR(dentry); + if (IS_ERR(dentry)) + goto freename; + error = vfs_whiteout(old_dir, dentry); + dput(dentry); + + /* FIXME: This is acutally unlink() create() ... */ +/* + if (!error) { + const char *new_name = old_dentry-d_name.name; + fsnotify_move(old_dir, new_dir, old_name.name, new_name, 0, + new_dentry-d_inode, old_dentry-d_inode); + } +*/ +freename: + kfree(old_name.name); + return error; +} + + static int do_rename(int olddfd, const char *oldname, int newdfd, const char *newname) { @@ -3495,10 +3588,7 @@ static int do_rename(int olddfd, const c if (error) goto exit1; - error = -EXDEV; - if (oldnd.mnt != newnd.mnt) - goto exit2; - +lock: old_dir = oldnd.dentry; error = -EBUSY; if (oldnd.last_type != LAST_NORM) @@ -3536,15 +3626,40 @@ static int do_rename(int olddfd, const c error = -ENOTEMPTY; if (new.dentry == trap) goto exit5; - /* renaming on unions is done by the user-space */ + /* renaming of directories on unions is done by the user-space */ error = -EXDEV; - if (is_unionized(oldnd.dentry, oldnd.mnt)) + if (is_unionized(oldnd.dentry, oldnd.mnt) + S_ISDIR(old.dentry-d_inode-i_mode)) goto exit5; - if (is_unionized(newnd.dentry, newnd.mnt)) + /* renameing of other files on unions is done by copyup */ + if ((is_unionized(oldnd.dentry, oldnd.mnt) +(oldnd.um_flags LAST_LOWLEVEL)) || + (is_unionized(newnd.dentry, newnd.mnt) +(newnd.um_flags LAST_LOWLEVEL))) { + dput_path(new
[RFC 11/26] tmpfs white-out support
Introduce white-out support to tmpfs. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- include/linux/shmem_fs.h |1 mm/shmem.c | 54 +++ 2 files changed, 55 insertions(+) --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -33,6 +33,7 @@ struct shmem_sb_info { int policy; /* Default NUMA memory alloc policy */ nodemask_t policy_nodes;/* nodemask for preferred and bind */ spinlock_tstat_lock; + struct inode *whiteout_inode; }; static inline struct shmem_inode_info *SHMEM_I(struct inode *inode) --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1784,6 +1784,42 @@ static int shmem_create(struct inode *di } /* + * This is the whiteout support for tmpfs. It uses one singleton whiteout + * inode per superblock thus it is very similar to shmem_link(). + */ +static int shmem_whiteout(struct inode *dir, struct dentry *dentry) +{ + struct shmem_sb_info *sbinfo = SHMEM_SB(dir-i_sb); + struct inode *inode = sbinfo-whiteout_inode; + + if (!(dir-i_sb-s_flags MS_WHITEOUT)) + return -EPERM; + + /* +* No ordinary (disk based) filesystem counts whiteouts as inodes; +* but each new link needs a new dentry, pinning lowmem, and +* tmpfs dentries cannot be pruned until they are unlinked. +*/ + if (sbinfo-max_inodes) { + spin_lock(sbinfo-stat_lock); + if (!sbinfo-free_inodes) { + spin_unlock(sbinfo-stat_lock); + return -ENOSPC; + } + sbinfo-free_inodes--; + spin_unlock(sbinfo-stat_lock); + } + + dir-i_size += BOGO_DIRENT_SIZE; + inode-i_ctime = dir-i_ctime = dir-i_mtime = CURRENT_TIME; + inc_nlink(inode); + atomic_inc(inode-i_count);/* New dentry reference */ + dget(dentry); /* Extra pinning count for the created dentry */ + d_instantiate(dentry, inode); + return 0; +} + +/* * Link a file.. */ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry) @@ -2231,6 +2267,9 @@ out: static void shmem_put_super(struct super_block *sb) { + struct shmem_sb_info *sbinfo = sb-s_fs_info; + + iput(sbinfo-whiteout_inode); kfree(sb-s_fs_info); sb-s_fs_info = NULL; } @@ -2305,6 +2344,19 @@ static int shmem_fill_super(struct super if (!root) goto failed_iput; sb-s_root = root; + +#ifdef CONFIG_TMPFS + if (!(sb-s_flags MS_NOUSER)) { + inode = shmem_get_inode(sb, S_IRUGO | S_IWUGO | S_IFWHT, 0); + if (!inode) { + dput(root); + goto failed; + } + sbinfo-whiteout_inode = inode; + sb-s_flags |= MS_WHITEOUT; + } +#endif + return 0; failed_iput: @@ -2410,6 +2462,7 @@ static const struct inode_operations shm .rmdir = shmem_rmdir, .mknod = shmem_mknod, .rename = shmem_rename, + .whiteout = shmem_whiteout, #endif #ifdef CONFIG_TMPFS_POSIX_ACL .setattr= shmem_notify_change, @@ -2464,6 +2517,7 @@ static struct file_system_type tmpfs_fs_ .name = tmpfs, .get_sb = shmem_get_sb, .kill_sb= kill_litter_super, + .fs_flags = FS_WHT, }; static struct vfsmount *shm_mnt; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 24/26] union-mount: dont report EROFS for union mounts
SuS v2 requires we report a read only fs too. For union-mounts this is a very expensive check. So I'm lazy and just disable the check if we are on a lower layer of an union. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/open.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/fs/open.c +++ b/fs/open.c @@ -483,7 +483,7 @@ asmlinkage long sys_faccessat(int dfd, c special_file(nd.dentry-d_inode-i_mode)) goto out_path_release; - if(IS_RDONLY(nd.dentry-d_inode)) + if (!(nd.um_flags LAST_LOWLEVEL) IS_RDONLY(nd.dentry-d_inode)) res = -EROFS; out_path_release: -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 03/26] VFS: Make lookup_hash() return a struct path
This patch changes lookup_hash() into returning a struct path. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/namei.c | 113 ++--- 1 file changed, 57 insertions(+), 56 deletions(-) --- a/fs/namei.c +++ b/fs/namei.c @@ -1297,27 +1297,27 @@ out: * needs parent already locked. Doesn't follow mounts. * SMP-safe. */ -static inline struct dentry * __lookup_hash(struct qstr *name, struct dentry *base, struct nameidata *nd) +static int lookup_hash(struct nameidata *nd, struct qstr *name, + struct path *path) { - struct dentry *dentry; struct inode *inode; int err; - inode = base-d_inode; + inode = nd-dentry-d_inode; err = permission(inode, MAY_EXEC, nd); - dentry = ERR_PTR(err); if (err) goto out; - dentry = __lookup_hash_kern(name, base, nd); + path-mnt = nd-mnt; + path-dentry = __lookup_hash_kern(name, nd-dentry, nd); + if (IS_ERR(path-dentry)) { + err = PTR_ERR(path-dentry); + path-dentry = NULL; + path-mnt = NULL; + } out: - return dentry; -} - -static struct dentry *lookup_hash(struct nameidata *nd) -{ - return __lookup_hash(nd-last, nd-dentry, nd); + return err; } /* SMP-safe */ @@ -1351,7 +1351,10 @@ struct dentry *lookup_one_len_nd(const c err = __lookup_one_len(name, this, base, len); if (err) return ERR_PTR(err); - return __lookup_hash(this, base, nd); + err = permission(base-d_inode, MAY_EXEC, nd); + if (err) + return ERR_PTR(err); + return __lookup_hash_kern(this, base, nd); } struct dentry *lookup_one_len_kern(const char *name, struct dentry *base, int len) @@ -1709,12 +1712,10 @@ int open_namei(int dfd, const char *path dir = nd-dentry; nd-flags = ~LOOKUP_PARENT; mutex_lock(dir-d_inode-i_mutex); - path.dentry = lookup_hash(nd); - path.mnt = nd-mnt; + error = lookup_hash(nd, nd-last, path); do_last: - error = PTR_ERR(path.dentry); - if (IS_ERR(path.dentry)) { + if (error) { mutex_unlock(dir-d_inode-i_mutex); goto exit; } @@ -1817,8 +1818,7 @@ do_link: } dir = nd-dentry; mutex_lock(dir-d_inode-i_mutex); - path.dentry = lookup_hash(nd); - path.mnt = nd-mnt; + error = lookup_hash(nd, nd-last, path); __putname(nd-last.name); goto do_last; } @@ -1835,7 +1835,8 @@ do_link: */ struct dentry *lookup_create(struct nameidata *nd, int is_dir) { - struct dentry *dentry = ERR_PTR(-EEXIST); + struct path path = { .dentry = ERR_PTR(-EEXIST) } ; + int err; mutex_lock_nested(nd-dentry-d_inode-i_mutex, I_MUTEX_PARENT); /* @@ -1851,9 +1852,11 @@ struct dentry *lookup_create(struct name /* * Do the final lookup. */ - dentry = lookup_hash(nd); - if (IS_ERR(dentry)) + err = lookup_hash(nd, nd-last, path); + if (err) { + path.dentry = ERR_PTR(err); goto fail; + } /* * Special case - lookup gave negative, but... we had foo/bar/ @@ -1861,14 +1864,16 @@ struct dentry *lookup_create(struct name * all is fine. Let's be bastards - you had / on the end, you've * been asking for (non-existent) directory. -ENOENT for you. */ - if (!is_dir nd-last.name[nd-last.len] !dentry-d_inode) + if (!is_dir nd-last.name[nd-last.len] !path.dentry-d_inode) goto enoent; - return dentry; + if (nd-mnt != path.mnt) + mntput(path.mnt); + return path.dentry; enoent: - dput(dentry); - dentry = ERR_PTR(-ENOENT); + dput_path(path, nd); + path.dentry = ERR_PTR(-ENOENT); fail: - return dentry; + return path.dentry; } EXPORT_SYMBOL_GPL(lookup_create); @@ -2075,7 +2080,7 @@ static long do_rmdir(int dfd, const char { int error = 0; char * name; - struct dentry *dentry; + struct path path; struct nameidata nd; name = getname(pathname); @@ -2098,12 +2103,11 @@ static long do_rmdir(int dfd, const char goto exit1; } mutex_lock_nested(nd.dentry-d_inode-i_mutex, I_MUTEX_PARENT); - dentry = lookup_hash(nd); - error = PTR_ERR(dentry); - if (IS_ERR(dentry)) + error = lookup_hash(nd, nd.last, path); + if (error) goto exit2; - error = vfs_rmdir(nd.dentry-d_inode, dentry); - dput(dentry); + error = vfs_rmdir(nd.dentry-d_inode, path.dentry); + dput_path(path, nd); exit2: mutex_unlock(nd.dentry-d_inode-i_mutex); exit1: @@ -2158,7 +2162,7 @@ static long do_unlinkat(int dfd, const c { int error = 0; char * name; - struct
[RFC 07/26] VFS: Introduce dput() variante that maintains a kill-list
This patch introduces a new variant of dput(). This becomes necessary to prevent a recursive call to dput() from the union mount code. void __dput(struct dentry *dentry, struct list_head *list); __dput() works mostly like the original dput() did. The main difference is that it doesn't do a full d_kill() at the end but puts the dentry on a list as soon as it isn't reachable anymore. Therefore the union mount code can savely call __dput() when it wants to get rid of underlying dentry references during a dput(). After calling __dput() the caller must make sure that on all dentries __d_kill_final() is called. __d_kill_final() is actually doing the dentry_iput() and is also dereferencing the parent. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/dcache.c | 60 +++- 1 file changed, 55 insertions(+), 5 deletions(-) --- a/fs/dcache.c +++ b/fs/dcache.c @@ -129,19 +129,56 @@ static void dentry_iput(struct dentry * * * If this is the root of the dentry tree, return NULL. */ -static struct dentry *d_kill(struct dentry *dentry) +static struct dentry *__d_kill(struct dentry *dentry, struct list_head *list) { struct dentry *parent; list_del(dentry-d_u.d_child); dentry_stat.nr_dentry--;/* For d_free, below */ - /*drops the locks, at that point nobody can reach this dentry */ + + if (list) { + list_del_init(dentry-d_alias); + /* at this point nobody can reach this dentry */ + list_add(dentry-d_lru, list); + spin_unlock(dentry-d_lock); + spin_unlock(dcache_lock); + return NULL; + } + + /* drops the locks, at that point nobody can reach this dentry */ dentry_iput(dentry); parent = dentry-d_parent; d_free(dentry); return dentry == parent ? NULL : parent; } +void __dput(struct dentry *, struct list_head *); + +static void __d_kill_final(struct dentry *dentry, struct list_head *list) +{ + struct dentry *parent = dentry-d_parent; + struct inode *inode = dentry-d_inode; + + if (inode) { + dentry-d_inode = NULL; + if (!inode-i_nlink) + fsnotify_inoderemove(inode); + if (dentry-d_op dentry-d_op-d_iput) + dentry-d_op-d_iput(dentry, inode); + else + iput(inode); + } + + d_free(dentry); + if (dentry != parent) + __dput(parent, list); +} + +static struct dentry *d_kill(struct dentry *dentry) +{ + return __d_kill(dentry, NULL); +} + /* * This is dput * @@ -171,7 +208,7 @@ static struct dentry *d_kill(struct dent * no dcache lock, please. */ -void dput(struct dentry *dentry) +void __dput(struct dentry *dentry, struct list_head *list) { if (!dentry) return; @@ -215,14 +252,27 @@ kill_it: * delete it from there */ if (!list_empty(dentry-d_lru)) { - list_del(dentry-d_lru); + list_del_init(dentry-d_lru); dentry_stat.nr_unused--; } - dentry = d_kill(dentry); + + dentry = __d_kill(dentry, list); if (dentry) goto repeat; } +void dput(struct dentry *dentry) +{ + LIST_HEAD(mortuary); + + __dput(dentry, mortuary); + while (!list_empty(mortuary)) { + dentry = list_entry(mortuary.next, struct dentry, d_lru); + list_del(dentry-d_lru); + __d_kill_final(dentry, mortuary); + } +} + /** * d_invalidate - invalidate a dentry * @dentry: dentry to invalidate -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 13/26] ext3 whiteout support
Introduce whiteout support for ext3. - Needs a reserved inode number for white-outs - S_OPAQUE isn't persistently stored Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/ext3/dir.c |3 ++- fs/ext3/namei.c | 33 + fs/ext3/super.c |5 - include/linux/ext3_fs.h |5 - 4 files changed, 43 insertions(+), 3 deletions(-) --- a/fs/ext3/dir.c +++ b/fs/ext3/dir.c @@ -29,7 +29,8 @@ #include linux/rbtree.h static unsigned char ext3_filetype_table[] = { - DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK + DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK, + DT_WHT }; static int ext3_readdir(struct file *, void *, filldir_t); --- a/fs/ext3/namei.c +++ b/fs/ext3/namei.c @@ -1081,6 +1081,7 @@ static unsigned char ext3_type_by_mode[S [S_IFIFO S_SHIFT]= EXT3_FT_FIFO, [S_IFSOCK S_SHIFT] = EXT3_FT_SOCK, [S_IFLNK S_SHIFT]= EXT3_FT_SYMLINK, + [S_IFWHT S_SHIFT]= EXT3_FT_WHT, }; static inline void ext3_set_de_type(struct super_block *sb, @@ -2070,6 +2071,37 @@ end_rmdir: return retval; } +static int ext3_whiteout(struct inode *dir, struct dentry *dentry) +{ + struct inode *inode; + int err, retries = 0; + handle_t *handle; + +retry: + handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir-i_sb) + + EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 + + 2*EXT3_QUOTA_INIT_BLOCKS(dir-i_sb)); + if (IS_ERR(handle)) + return PTR_ERR(handle); + + if (IS_DIRSYNC(dir)) + handle-h_sync = 1; + + inode = ext3_new_inode (handle, dir, S_IFWHT | S_IRUGO); + err = PTR_ERR(inode); + if (IS_ERR(inode)) + goto out_stop; + + init_special_inode(inode, inode-i_mode, 0); + err = ext3_add_nondir(handle, dentry, inode); + +out_stop: + ext3_journal_stop(handle); + if (err == -ENOSPC ext3_should_retry_alloc(dir-i_sb, retries)) + goto retry; + return err; +} + static int ext3_unlink(struct inode * dir, struct dentry *dentry) { int retval; @@ -2387,6 +2419,7 @@ const struct inode_operations ext3_dir_i .mkdir = ext3_mkdir, .rmdir = ext3_rmdir, .mknod = ext3_mknod, + .whiteout = ext3_whiteout, .rename = ext3_rename, .setattr= ext3_setattr, #ifdef CONFIG_EXT3_FS_XATTR --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -1500,6 +1500,9 @@ static int ext3_fill_super (struct super sb-s_flags = (sb-s_flags ~MS_POSIXACL) | ((sbi-s_mount_opt EXT3_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); + if (EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_WHITEOUT)) + sb-s_flags |= MS_WHITEOUT; + if (le32_to_cpu(es-s_rev_level) == EXT3_GOOD_OLD_REV (EXT3_HAS_COMPAT_FEATURE(sb, ~0U) || EXT3_HAS_RO_COMPAT_FEATURE(sb, ~0U) || @@ -2764,7 +2767,7 @@ static struct file_system_type ext3_fs_t .name = ext3, .get_sb = ext3_get_sb, .kill_sb= kill_block_super, - .fs_flags = FS_REQUIRES_DEV, + .fs_flags = FS_REQUIRES_DEV | FS_WHT, }; static int __init init_ext3_fs(void) --- a/include/linux/ext3_fs.h +++ b/include/linux/ext3_fs.h @@ -63,6 +63,7 @@ #define EXT3_UNDEL_DIR_INO 6 /* Undelete directory inode */ #define EXT3_RESIZE_INO 7 /* Reserved group descriptors inode */ #define EXT3_JOURNAL_INO8 /* Journal inode */ +#define EXT3_WHT_INO9 /* Whiteout inode */ /* First non-reserved inode for old ext3 filesystems */ #define EXT3_GOOD_OLD_FIRST_INO11 @@ -582,6 +583,7 @@ static inline int ext3_valid_inum(struct #define EXT3_FEATURE_INCOMPAT_RECOVER 0x0004 /* Needs recovery */ #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008 /* Journal device */ #define EXT3_FEATURE_INCOMPAT_META_BG 0x0010 +#define EXT3_FEATURE_INCOMPAT_WHITEOUT 0x0020 #define EXT3_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR #define EXT3_FEATURE_INCOMPAT_SUPP (EXT3_FEATURE_INCOMPAT_FILETYPE| \ @@ -648,8 +650,9 @@ struct ext3_dir_entry_2 { #define EXT3_FT_FIFO 5 #define EXT3_FT_SOCK 6 #define EXT3_FT_SYMLINK7 +#define EXT3_FT_WHT8 -#define EXT3_FT_MAX8 +#define EXT3_FT_MAX9 /* * EXT3_DIR_PAD defines the directory entries boundaries -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 12/26] ext2 white-out support
Introduce white-out support to ext2. Known Bugs: - Needs a reserved inode number for white-outs - S_OPAQUE isn't persistently stored Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/ext2/dir.c |2 ++ fs/ext2/namei.c | 18 ++ fs/ext2/super.c |5 - include/linux/ext2_fs.h |4 4 files changed, 28 insertions(+), 1 deletion(-) --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -230,6 +230,7 @@ static unsigned char ext2_filetype_table [EXT2_FT_FIFO] = DT_FIFO, [EXT2_FT_SOCK] = DT_SOCK, [EXT2_FT_SYMLINK] = DT_LNK, + [EXT2_FT_WHT] = DT_WHT, }; #define S_SHIFT 12 @@ -241,6 +242,7 @@ static unsigned char ext2_type_by_mode[S [S_IFIFO S_SHIFT]= EXT2_FT_FIFO, [S_IFSOCK S_SHIFT] = EXT2_FT_SOCK, [S_IFLNK S_SHIFT]= EXT2_FT_SYMLINK, + [S_IFWHT S_SHIFT]= EXT2_FT_WHT, }; static inline void ext2_set_de_type(ext2_dirent *de, struct inode *inode) --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -288,6 +288,23 @@ static int ext2_rmdir (struct inode * di return err; } +static int ext2_whiteout(struct inode *dir, struct dentry *dentry) +{ + struct inode *inode; + int err; + + inode = ext2_new_inode (dir, S_IFWHT | S_IRUGO); + err = PTR_ERR(inode); + if (IS_ERR(inode)) + goto out; + + init_special_inode(inode, inode-i_mode, 0); + mark_inode_dirty(inode); + err = ext2_add_nondir(dentry, inode); +out: + return err; +} + static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry, struct inode * new_dir, struct dentry * new_dentry ) { @@ -382,6 +399,7 @@ const struct inode_operations ext2_dir_i .mkdir = ext2_mkdir, .rmdir = ext2_rmdir, .mknod = ext2_mknod, + .whiteout = ext2_whiteout, .rename = ext2_rename, #ifdef CONFIG_EXT2_FS_XATTR .setxattr = generic_setxattr, --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -752,6 +752,9 @@ static int ext2_fill_super(struct super_ ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset EXT2_MOUNT_XIP if not */ + if (EXT2_HAS_INCOMPAT_FEATURE(sb, EXT2_FEATURE_INCOMPAT_WHITEOUT)) + sb-s_flags |= MS_WHITEOUT; + if (le32_to_cpu(es-s_rev_level) == EXT2_GOOD_OLD_REV (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) || EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) || @@ -1299,7 +1302,7 @@ static struct file_system_type ext2_fs_t .name = ext2, .get_sb = ext2_get_sb, .kill_sb= kill_block_super, - .fs_flags = FS_REQUIRES_DEV, + .fs_flags = FS_REQUIRES_DEV | FS_WHT, }; static int __init init_ext2_fs(void) --- a/include/linux/ext2_fs.h +++ b/include/linux/ext2_fs.h @@ -61,6 +61,7 @@ #define EXT2_ROOT_INO 2 /* Root inode */ #define EXT2_BOOT_LOADER_INO5 /* Boot loader inode */ #define EXT2_UNDEL_DIR_INO 6 /* Undelete directory inode */ +#define EXT2_WHT_INO7 /* Whiteout inode */ /* First non-reserved inode for old ext2 filesystems */ #define EXT2_GOOD_OLD_FIRST_INO11 @@ -479,10 +480,12 @@ struct ext2_super_block { #define EXT3_FEATURE_INCOMPAT_RECOVER 0x0004 #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008 #define EXT2_FEATURE_INCOMPAT_META_BG 0x0010 +#define EXT2_FEATURE_INCOMPAT_WHITEOUT 0x0020 #define EXT2_FEATURE_INCOMPAT_ANY 0x #define EXT2_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR #define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE| \ +EXT2_FEATURE_INCOMPAT_WHITEOUT| \ EXT2_FEATURE_INCOMPAT_META_BG) #define EXT2_FEATURE_RO_COMPAT_SUPP(EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \ EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \ @@ -549,6 +552,7 @@ enum { EXT2_FT_FIFO, EXT2_FT_SOCK, EXT2_FT_SYMLINK, + EXT2_FT_WHT, EXT2_FT_MAX }; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 22/26] union-mount: white-out changes for copy-on-open
When files on an upper layer of the union stack are removed we need to white-out the removed filename. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/namei.c | 46 -- 1 file changed, 44 insertions(+), 2 deletions(-) --- a/fs/namei.c +++ b/fs/namei.c @@ -2253,6 +2253,13 @@ do_last: /* Negative dentry, just create the file */ if (!path.dentry-d_inode || S_ISWHT(path.dentry-d_inode-i_mode)) { + if (path.dentry-d_parent != dir) { + dput_path(path, nd); + path.dentry = __lookup_hash_kern(nd-last, dir, nd); + path.mnt = nd-mnt; + goto do_last; + } + error = open_namei_create(nd, path, flag, mode); if (error) goto exit; @@ -2373,6 +2380,16 @@ int lookup_create(struct nameidata *nd, { int err = -EEXIST; + if (is_unionized(nd-dentry, nd-mnt)) { + err = union_relookup_topmost(nd, nd-flags ~LOOKUP_PARENT); + if (err) { + /* FIXME: This really sucks */ + mutex_lock_nested(nd-dentry-d_inode-i_mutex, + I_MUTEX_PARENT); + goto fail; + } + } + mutex_lock_nested(nd-dentry-d_inode-i_mutex, I_MUTEX_PARENT); /* * Yucky last component or no last component at all? @@ -2391,6 +2408,16 @@ int lookup_create(struct nameidata *nd, if (err) goto fail; + /* Special case - we found a whiteout */ + if (path-dentry-d_inode S_ISWHT(path-dentry-d_inode-i_mode)) { + if (path-dentry-d_parent != nd-dentry) { + dput_path(path, nd); + path-dentry = __lookup_hash_kern(nd-last, nd-dentry, + nd); + path-mnt = nd-mnt; + } + } + /* * Special case - lookup gave negative, but... we had foo/bar/ * From the vfs_mknod() POV we just have a negative dentry - @@ -2682,6 +2709,15 @@ static int do_whiteout(struct nameidata if (isdir !directory_is_empty(path-dentry, path-mnt)) goto out; + mutex_unlock(nd-dentry-d_inode-i_mutex); + err = union_relookup_topmost(nd, nd-flags ~LOOKUP_PARENT); + if (err) { + mutex_lock_nested(nd-dentry-d_inode-i_mutex, + I_MUTEX_PARENT); + goto out; + } + mutex_lock_nested(nd-dentry-d_inode-i_mutex, I_MUTEX_PARENT); + /* safe the name for a later lookup */ err = -ENOMEM; name.name = kmalloc(dentry-d_name.len, GFP_KERNEL); @@ -3012,7 +3048,10 @@ static long do_rmdir(int dfd, const char error = hash_lookup_union(nd, nd.last, path); if (error) goto exit2; - error = vfs_rmdir(nd.dentry-d_inode, path.dentry); + if (is_unionized(nd.dentry, nd.mnt)) + error = do_whiteout(nd, path, 1); + else + error = vfs_rmdir(nd.dentry-d_inode, path.dentry); dput_path(path, nd); exit2: mutex_unlock(nd.dentry-d_inode-i_mutex); @@ -3091,7 +3130,10 @@ static long do_unlinkat(int dfd, const c inode = path.dentry-d_inode; if (inode) atomic_inc(inode-i_count); - error = vfs_unlink(nd.dentry-d_inode, path.dentry); + if (is_unionized(nd.dentry, nd.mnt)) + error = do_whiteout(nd, path, 0); + else + error = vfs_unlink(nd.dentry-d_inode, path.dentry); exit2: dput_path(path, nd); } -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 00/26] VFS based Union Mount (V2)
Here is another post of the VFS based union mount implementation. Unlike the traditional mount which hides the contents of the mount point, union mounts present the merged view of the mount point and the mounted filesytem. Recent changes: - brand new union structure no longer tied to the dentryn, now works with bind mounts - generic part of the whiteout patches extracted - introduces MS_WHITEOUT to make the white-out patches independant of the union-mount stuff - uses a singleton whiteout inode for the tmpfs filesystem (I need to fix this for ext2/3, too) - renaming files on unions uses copyup now - rewrote the union mount debugging code: it is now debugfs/relay based. - random cleanups I'm able to compile the kernel with this patches applied on a 3 layer union mount with the seperate layers bind mounted to different locations. I haven't done any performance tests since I think there is a more important topic ahead: better readdir() support. This series is against 2.6.22-rc6-mm1. Comments are welcome, Jan -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 14/26] union-mount: Documentation
Add simple documentation about union mounting in general and this implementation in specific. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- Documentation/filesystems/union-mounts.txt | 172 + 1 file changed, 172 insertions(+) --- /dev/null +++ b/Documentation/filesystems/union-mounts.txt @@ -0,0 +1,172 @@ +VFS based Union Mounts +-- + + 1. What are Union Mounts + 2. The Union Stack + 3. The White-out Filetype + 4. Renaming Unions + 5. Directory Reading + 6. Known Problems + 7. References + +--- + +1. What are Union Mounts +== + +Please note: this is NOT about UnionFS and it is NOT derived work! + +Traditionally the mount operation is opaque, which means that the content of +the mount point, the directory where the file system is mounted on, is hidden +by the content of the mounted file system's root directory until the file +system is unmounted again. Unlike the traditional UNIX mount mechanism, that +hides the contents of the mount point, a union mount presents a view as if +both filesystems are merged together. Although only the topmost layer of the +mount stack can be altered, it appears as if transparent file system mounts +allow any file to be created, modified or deleted. + +Most people know the concepts and features of union mounts from other +operating systems like Sun's Translucent Filesystem, Plan9 or BSD. + +Here are the key features of this implementation: +- completely VFS based +- does not change the namespace stacking +- directory listings have duplicate entries removed +- writable unions: only the topmost file system layer may be writable +- writable unions: new white-out filetype handled inside the kernel + +--- + +2. The Union Stack +== + +The mounted file systems are organized in the file system hierarchy (tree of +vfsmount structures), which keeps track about the stacking of file systems +upon each other. The per-directory view on the file system hierarchy is called +mount stack and reflects the order of file systems, which are mounted on a +specific directory. + +Union mounts present a single unified view of the contents of two or more file +systems as if they are merged together. Since the information which file +system objects are part of a unified view is not directly available from the +file system hierachy there is a need for a new structure. The file system +objects, which are part of a unified view are ordered in a so-called union +stack. Only directoties can be part of a unified view. + +The link between two layers of the union stack is maintained using the +union_mount structure (#include linux/union.h): + +struct union_mount { + atomic_t u_count; /* reference count */ + struct mutex u_mutex; + struct list_head u_unions; /* list head for d_unions */ + struct hlist_node u_hash; /* list head for seaching */ + struct hlist_node u_rhash; /* list head for reverse seaching */ + + struct path u_this; /* this is me */ + struct path u_next; /* this is what I overlay */ +}; + +The union_mount structure holds a reference (dget,mntget) to the next lower +layer of the union stack. Since a dentry can be part of multiple unions +(e.g. with bind mounts) they are tied together via the d_unions field of the +dentry structure. + +All union_mount structures are cached in two hash tables, one for lookups of +the next lower layer of the union stack and one for reverse lookups of the +next upper layer of the union stack. The reverse lookup is necessary to +resolve CWD relative path lookups. For calculation of the hash value, the +(dentry,vfsmount) pair is used. The u_this field is used for the hash table +which is used in forward lookups and the u_next field for the reverse lookups. + +During every new mount (or mount propagation), a new union_mount structure is +allocated. A reference to the mountpoint's vfsmount and dentry is taken and +stored in the u_next field. In almost the same manner an union_mount +structure is created during the first time lookup of a directory within a +union mount point. In this case the lookup proceeds to all lower layers of the +union. Therefore the complete union stack is constructed during lookups. + +The union_mount structures of a dentry are destroyed when the dentry itself is +destroyed. Therefore the dentry cache is indirectly driving the union_mount +cache like this is done for inodes too. Please note that lower layer +union_mount structures are kept in memory until the topmost dentry is +destroyed. + +--- + +3. Writable Unions: The White-out Filetype and Copy-On-Open +=== + +The white-out
[RFC 08/26] VFS: Export lives_below_in_same_fs()
Export lives_below_in_same_fs() for use in union mount code. Signed-off-by: Jan Blunck [EMAIL PROTECTED] --- fs/namespace.c|3 ++- include/linux/mount.h |1 + 2 files changed, 3 insertions(+), 1 deletion(-) --- a/fs/namespace.c +++ b/fs/namespace.c @@ -793,7 +793,7 @@ static bool permit_mount(struct nameidat return true; } -static int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry) +int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry) { while (1) { if (d == dentry) @@ -803,6 +803,7 @@ static int lives_below_in_same_fs(struct d = d-d_parent; } } +EXPORT_SYMBOL_GPL(lives_below_in_same_fs); struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry, int flag, uid_t owner) --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -106,6 +106,7 @@ extern void shrink_submounts(struct vfsm extern spinlock_t vfsmount_lock; extern dev_t name_to_dev_t(char *name); +extern int lives_below_in_same_fs(struct dentry *, struct dentry *); #endif #endif /* _LINUX_MOUNT_H */ -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/4] Union mount documentation.
On Wed, 20 Jun 2007 11:21:57 +0530, Bharata B Rao wrote: +4. Union stack: building and traversal +-- +Union stack needs to be built from two places: during an explicit union +mount (or mount propagation) and during the lookup of a directory that +appears in more than one layer of the union. + +The link between two layers of union stack is maintained using the +union_mount structure: + +struct union_mount { + /* vfsmount and dentry of this layer */ + struct vfsmount *src_mnt; + struct dentry *src_dentry; + + /* vfsmount and dentry of the next lower layer */ + struct vfsmount *dst_mnt; + struct dentry *dst_dentry; + + /* + * This list_head hashes this union_mount based on this layer's + * vfsmount and dentry. This is used to get to the next layer of +* the stack (dst_mnt, dst_dentry) given the (src_mnt, src_dentry) + * and is used for stack traversal. +*/ + struct list_head hash; + + /* + * All union_mounts under a vfsmount(src_mnt) are linked together + * at mnt-mnt_union using this list_head. This is needed to destroy +* all the union_mounts when the mnt goes away. + */ + struct list_head list; +}; + +These union mount structures are stored in a hash table(union_mount_hashtable) +which uses the same hash as used for mount_hashtable since both of them use +(vfsmount, dentry) pairs to calculate the hash. + +During a new mount (or mount propagation), a new union_mount structure is +created. A reference to the mountpoint's vfsmount and dentry is taken and +stored in the union_mount (as dst_mnt, dst_dentry). And this union_mount +is inserted in the union_mount_hashtable based on the hash generated by +the mount root's vfsmount and dentry. + +Similar method is employed to create a union stack during first time lookup +of a common named directory within a union mount point. But here, the top +level directory's vfsmount and dentry are hashed to get to the lower level +directory's vfsmount and dentry. + +The insertion, deletion and lookup of union_mounts in the +union_mount_hashtable is protected by vfsmount_lock. While traversing the +stack, we hold this spinlock only briefly during lookup time and release +it as soon as we get the next union stack member. The top level of the +stack holds a reference to the next level (via union_mount structure) and +so on. Therefore, as long as we hold a reference to a union stack member, +its lower layers can't go away. And since we don't do the complete +traversal under any lock, it is possible for the stack to change over the +level from where we started traversing. For eg. when traversing the stack +downwards, a new filesystem can be mounted on top of it. When this happens, +the user who had a reference to the old top wouldn't have visibility to +the new top and would continue as if the new top didn't exist for him. +I believe this is fine as long as members of the stack don't go away from +under us(CHECK). And to be sure of this, we need to hold a reference to the +level from where we start the traversal and should continue to hold it +till we are done with the traversal. Well done. I like your approach much more than the simple chaining of dentries. When I told you about the idea of maintaining a list of dentry,vfsmount objects I always though about one big structure for all the layers of an union. Smaller objects that only point to the next layer seem to be better but make the search for the topmost layer impossible. You should maintain a reference to the topmost struct union_mount though. +5. Union stack: destroying +-- +In addition to storing the union_mounts in a hash table for quick lookups, +they are also stored as a list, headed at vsmount-mnt_union. So, all +union_mounts that occur under a vfsmount (starting from the mountpoint +followed by the subdir unions) are stored within the vfsmount. During +umount (specifically, during the last mntput()), this list is traversed +to destroy all union stacks under this vfsmount. + +Hence, all union stacks under a vfsmount continue to exist until the +vfsmount is unmounted. It may be noted that the union_mount structure +holds a reference to the current dentry also. Becasue of this, for +subdir unions, both the top and bottom level dentries become pinned +till the upper layer filesystem is unmounted. Is this behaviour +acceptable ? Would this lead to a lot of pinned dentries over a period +of time ? (CHECK) If we don't do this, the top layer dentry might go +out of cache, during which time we have no means to release the +corresponding union_mount and the union_mount becomes stale. Would it +be necessary and worthwhile to add intelligence to prune_dcache() to +prune unused union_mounts thereby releasing the dentries ? + +As noted above, we hold the refernce to current dentry from union_mount
Re: [RFC PATCH] file as directory
On 5/23/07, Miklos Szeredi [EMAIL PROTECTED] wrote: So your question is, which mount takes priority on the lookup? It probably should be the propagated real mount, rather than the dir-on-file one, shouldn't it? Maybe this might belong into __link_path_walk() similar to the handling of symbolic links. If the real mount has always higher priority why do we bother in follow_mount() about it. Jan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] file as directory
On 5/23/07, Miklos Szeredi [EMAIL PROTECTED] wrote: As for unlink... How do you deal with having that thing mounted, mounting something _under_ it (so that vfsmount would be kept busy) and then unlinking that sucker? Yeah, that's a good point. Current patch doesn't deal with that. Simplest solution could be to disallow submounting these. Don't think it makes much sense anyway. Hmm, think about /your/path/qemu-disk1.img/boot , /your/path/qemu-disk2.img/usr , ... Jan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 3/14] Add the whiteout file type
On 5/14/07, Jan Engelhardt [EMAIL PROTECTED] wrote: On May 14 2007 15:09, Bharata B Rao wrote: A white-out stops the VFS from further lookups of the white-outs name and returns -ENOENT. This is the same behaviour as if the filename isn't found. This can be used in combination with union mounts to virtually delete (white-out) files by creating a file with this file type. Signed-off-by: Jan Blunck [EMAIL PROTECTED] Signed-off-by: Bharata B Rao [EMAIL PROTECTED] --- include/linux/stat.h |2 ++ 1 files changed, 2 insertions(+) --- a/include/linux/stat.h +++ b/include/linux/stat.h @@ -10,6 +10,7 @@ #if defined(__KERNEL__) || !defined(__GLIBC__) || (__GLIBC__ 2) #define S_IFMT 0017 +#define S_IFWHT 016 /* whiteout */ #define S_IFSOCK 014 #define S_IFLNK012 #define S_IFREG 010 I wonder why 11, 13 or 15 could not also be used? I used the S_IFWHT definition like it is referenced in stat(2). I guess it would be a good idea to use the same flag on BSD and Linux. As you can see in stat(2) other OS use 011, 013 and 015. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 5/14] Introduce union stack
On 5/14/07, Jan Engelhardt [EMAIL PROTECTED] wrote: +static inline void union_lock(struct dentry *dentry) +{ + if (unlikely(dentry dentry-d_union)) { + struct union_info *ui = dentry-d_union; + + UM_DEBUG_LOCK(\%s\ locking %p (count=%d)\n, +dentry-d_name.name, ui, +atomic_read(ui-u_count)); + __union_lock(dentry-d_union); + } +} + +static inline void union_unlock(struct dentry *dentry) +{ + if (unlikely(dentry dentry-d_union)) { + struct union_info *ui = dentry-d_union; + + UM_DEBUG_LOCK(\%s\ unlocking %p (count=%d)\n, +dentry-d_name.name, ui, +atomic_read(ui-u_count)); + __union_unlock(dentry-d_union); + } +} Do we really need the unlikely()? d_union may be a new feature, but it may very well be possible that someone puts the bigger part of his/her files under a union. And when d_unions get stable, people will probably begin making their root filesystem unioned for livecds, and then unlikely() will rather be a likely penalty. My stance: just if (dentry != NULL dentry-d_union != NULL) This also goes for union_trylock. Good question. My intention was that since most of the union code costs performance (stack traversal, readdir) I optimize for the normal (not unified) case. +static inline int union_trylock(struct dentry *dentry) +{ + int locked = 1; + + if (unlikely(dentry dentry-d_union)) { + UM_DEBUG_LOCK(\%s\ try locking %p (count=%d)\n, +dentry-d_name.name, dentry-d_union, +atomic_read(dentry-d_union-u_count)); + BUG_ON(!atomic_read(dentry-d_union-u_count)); + locked = mutex_trylock(dentry-d_union-u_mutex); + UM_DEBUG_LOCK(\%s\ trylock %p %s\n, dentry-d_name.name, +dentry-d_union, +locked ? succeeded : failed); + } + return (locked ? 1 : 0); +} return locked ? 1 : 0 or even return !!locked; or since we're just passing up from mutex_trylock: return locked; ? Ahh, this seems to be a left-over of the semaphore - mutex conversion. +/* + * This is a *I can't get no sleep* helper More commonly known as insomnia. :) :) Before I forget this: thank you (and Badari) for reviewing the patches! Cheers, Jan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 13/14] ext3 whiteout support
On 5/15/07, Bharata B Rao [EMAIL PROTECTED] wrote: On Mon, May 14, 2007 at 01:16:57PM -0700, Badari Pulavarty wrote: On Mon, 2007-05-14 at 15:14 +0530, Bharata B Rao wrote: From: Bharata B Rao [EMAIL PROTECTED] +static int ext3_whiteout(struct inode *dir, struct dentry *dentry) +{ + struct inode * inode; + int err, retries = 0; + handle_t *handle; + +retry: + handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir-i_sb) + + EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 + + 2*EXT3_QUOTA_INIT_BLOCKS(dir-i_sb)); + if (IS_ERR(handle)) + return PTR_ERR(handle); + + if (IS_DIRSYNC(dir)) + handle-h_sync = 1; + + inode = ext3_new_inode (handle, dir, S_IFWHT | S_IRUGO); + err = PTR_ERR(inode); + if (IS_ERR(inode)) + goto out_stop; Don't you need to call init_special_inode() here ? Or this is handled somewhere else ? Whiteout doesn't have any attributes and hence we are not explicitly doing init_special_inode() on this. Accesses to whiteout files are trapped at the VFS lookup itself and creation and deletion of whiteouts are handled automatically by VFS. So I believe init_special_inode() isn't necessary on a whiteout file. I added default whiteout file operations. So calling init_special_inode() seems to make sense. I know the ext2/ext3 whiteout patches are not really where they should be. I plan to use a reserved inode number to reflect the case that the inode itself doesn't have any attributes itself. It makes sense to have a singleton whiteout inode per superblock. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 14/14] tmpfs whiteout support
On 5/14/07, Hugh Dickins [EMAIL PROTECTED] wrote: On Mon, 14 May 2007, Bharata B Rao wrote: From: Jan Blunck [EMAIL PROTECTED] Subject: tmpfs whiteout support Introduce whiteout support to tmpfs. Signed-off-by: Jan Blunck [EMAIL PROTECTED] Signed-off-by: Bharata B Rao [EMAIL PROTECTED] --- mm/shmem.c |9 - 1 files changed, 8 insertions(+), 1 deletion(-) --- a/mm/shmem.c +++ b/mm/shmem.c @@ -74,7 +74,7 @@ #define LATENCY_LIMIT 64 /* Pretend that each entry is of this size in directory's i_size */ -#define BOGO_DIRENT_SIZE 20 +#define BOGO_DIRENT_SIZE 1 Why would that change be needed for whiteout support? Good question. It seems that this a survivor of the changes necessary for union readdir. This isn't necessary for white-outs. BTW: Why do we claim this to be 20??? Is there any meaning behind this? Cheers, Jan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 13/14] ext3 whiteout support
On 5/14/07, Andreas Dilger [EMAIL PROTECTED] wrote: On May 14, 2007 15:14 +0530, Bharata B Rao wrote: #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV0x0008 /* Journal device */ #define EXT3_FEATURE_INCOMPAT_META_BG0x0010 +#define EXT3_FEATURE_INCOMPAT_WHITEOUT 0x0020 Is this flag reserved with Ted? It isn't listed in the e2fsprogs repo. I don't know. I tried to contact him a few weeks ago but failed. Guess, maybe he isn't reading the @thunk.org email anymore which was reference in the e2fsprogs source I used. Ted, from ext2_fs.h I learn that the value 0x0020 is left unused. #define EXT2_FEATURE_INCOMPAT_META_BG 0x0010 #define EXT3_FEATURE_INCOMPAT_EXTENTS 0x0040 #define EXT4_FEATURE_INCOMPAT_64BIT 0x0080 Is this intentionally? Cheers, Jan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible bug in ext3 filesystem
On 1/4/07, Dave Kleikamp [EMAIL PROTECTED] wrote: On Thu, 2007-01-04 at 12:34 +0100, Jens Nie wrote: I think i found a bug in the ext3 filesystem. It deals with dereferencing symlinks. I have installed a fresh openSUSE 10.2 on an ext3 filesystem. No, it seems to be a bug in the coreutils. I did a little playing around with strace and I suspect that it may have something to do with ext3 returning DT_LNK to the filldir routine (which gets returned through getdents64). A lot of file systems always return DT_UNKNOWN. Maybe ls is handling the DT_UNKNOWN case alright, but not the DT_LNK case. (When DT_UNKNOWN is returned, ls calls stat64() which would identify the symlink as a directory rather than a symlink.) I guess your analysis is right. The problem is that ls only calls stat() on DT_UNKNOWN entries but it should call it on DT_LNK too. I reported this on the opensuse mailing list first. Someone there was kind enough to point me directly to this list. They forgot to tell you that not every bug is a kernel bug ;) I opened a bug in bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=231916 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: readdir behaviour
This was also topic on lkml 2 weeks ago. Zitat von Tomas Hruby [EMAIL PROTECTED]: First of all I would like to know what exactly is the meaning of the 'offset' parameter of filldir and whether it is used somewhere? Unlike ext2, our directories are not easily read sequentially and this value (copied by filldir to dirent-d_off) seems to be quite useless outside our fs code. The offset of the dirent has no common meaning. Think of it as a cookie or something like that. It should not be interpreted either by VFS or by the user-space. Related question is what is the correct behaviour of readdir in case of user's seeking in the directory? If I understand correctly, in case of ext3 (indexed directories), when seeking is detected, readdir starts reading from the directory beginning again. On different archs the libc is seeking (to d_off) after each call to getdents(). Therefore the implementation should honor it. The last is about concurrency. How is solved problem when a directory is read by readdir and between two readdir calls the same directory is changed? This is the filesystems duty to seek to the next valid dentry. Although it is not defined if the new directories contents is returned or the one on opendir(). Although I think it would be nice (and convenient to the everything is a file paradigm) when the directory is presented like a sequential file this is not the common practice. Due to the fact that there are no applications which are reading and seeking the directories directly this is a good tradeoff to achieve high performance for readdir(). Jan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] pdirops: vfs patch
Quoting Alex Tomas [EMAIL PROTECTED]: 1) i_sem protects dcache too Where? i_sem is the per-inode lock, and shouldn't be used else. 2) tmpfs has no own data, so we can use it this way (see 2nd patch) 3) I have pdirops patch for ext3, but it needs some cleaning ... I think you didn't get my point. 1) Your approach is duplicating the locking effort for regular filesystem (like ext2): a) locking with s_pdirops_sems b) locking the low-level filesystem data It's cool that it speeds up tmpfs, but I don't think that this legatimate the doubled locking for every other filesystem. I'm not sure that it also increases performance for regular filesystems, if you do the locking right. 2) In my opinion, a superblock-wide semaphore array which allows 1024 different (different names and different operations) accesses to ONE single inode (which is the data it should protect) is not a good idea. Jan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] pdirops: vfs patch
Quoting Alex Tomas [EMAIL PROTECTED]: Jan Blunck (JB) writes: 1) i_sem protects dcache too JB Where? i_sem is the per-inode lock, and shouldn't be used else. read comments in fs/namei.c:read_lookup() i_sem does NOT protect the dcache. Also not in real_lookup(). The lock must be acquired for -lookup() and because we might sleep on i_sem, we have to get it early and check for repopulation of the dcache. we've already done this for ext3. it works. it speeds some loads up significantly. especially on big directories. and you can control this via mount option, so almost zero cost for fs that doesn't support pdirops. Ok. Jan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] pdirops: vfs patch
Quoting Alex Tomas [EMAIL PROTECTED]: Jan Blunck (JB) writes: JB i_sem does NOT protect the dcache. Also not in real_lookup(). The lock must be JB acquired for -lookup() and because we might sleep on i_sem, we have to get it JB early and check for repopulation of the dcache. dentry is part of dcache, right? i_sem protects dentry from being returned with incomplete data inside, right? Nope, d_alloc() is setting d_flags to DCACHE_UNHASHED. Therefore it is not found by __d_lookup() until it is rehashed which is implicit done by -lookup(). Jan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] pdirops: vfs patch
Alex Tomas wrote: +static inline struct semaphore * lock_sem(struct inode *dir, struct qstr *name) +{ + if (IS_PDIROPS(dir)) { + struct super_block *sb; + /* name-hash expected to be already calculated */ + sb = dir-i_sb; + BUG_ON(sb-s_pdirops_sems == NULL); + return sb-s_pdirops_sems + name-hash % sb-s_pdirops_size; + } + return dir-i_sem; +} + +static inline void lock_dir(struct inode *dir, struct qstr *name) +{ + down(lock_sem(dir, name)); +} + @@ -1182,12 +1204,26 @@ /* * p1 and p2 should be directories on the same fs. */ -struct dentry *lock_rename(struct dentry *p1, struct dentry *p2) +struct dentry *lock_rename(struct dentry *p1, struct qstr *n1, +struct dentry *p2, struct qstr *n2) { struct dentry *p; if (p1 == p2) { - down(p1-d_inode-i_sem); + if (IS_PDIROPS(p1-d_inode)) { + unsigned int h1, h2; + h1 = n1-hash % p1-d_inode-i_sb-s_pdirops_size; + h2 = n2-hash % p2-d_inode-i_sb-s_pdirops_size; + if (h1 h2) { +lock_dir(p1-d_inode, n1); +lock_dir(p2-d_inode, n2); + } else if (h1 h2) { +lock_dir(p2-d_inode, n2); +lock_dir(p1-d_inode, n1); + } else +lock_dir(p1-d_inode, n1); + } else + down(p1-d_inode-i_sem); return NULL; } @@ -1195,31 +1231,35 @@ for (p = p1; p-d_parent != p; p = p-d_parent) { if (p-d_parent == p2) { - down(p2-d_inode-i_sem); - down(p1-d_inode-i_sem); + lock_dir(p2-d_inode, n2); + lock_dir(p1-d_inode, n1); return p; } } for (p = p2; p-d_parent != p; p = p-d_parent) { if (p-d_parent == p1) { - down(p1-d_inode-i_sem); - down(p2-d_inode-i_sem); + lock_dir(p1-d_inode, n1); + lock_dir(p2-d_inode, n2); return p; } } - down(p1-d_inode-i_sem); - down(p2-d_inode-i_sem); + lock_dir(p1-d_inode, n1); + lock_dir(p2-d_inode, n2); return NULL; } With luck you have s_pdirops_size (or 1024) different renames altering concurrently one directory inode. Therefore you need a lock protecting your filesystem data. This is basically the job done by i_sem. So in my opinion you only move The Problem from the VFS to the lowlevel filesystems. But then there is no need for i_sem or your s_pdirops_sems anymore. Regards, Jan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html