Enclosed 8 patches that implement shared subtree functionality as detailed in Al Viro's RFC found at http://lwn.net/Articles/119232/
I have incorporated all the comments received earlier in first round. Thanks to Miklos and Pekka for the valuable comments. Also I have optimized lots of code, especially in pnode.c . Code is unit tested. However the code in its current form does not handle ENOMEM error gracefully. I am working on it. The incremental patches provide the following functionality: 1) shared_private_slave.patch : Provides the ability to mark a subtree as shared or private or slave. 2) unclone.patch : provides the ability to mark a subtree as unclonable. NOTE: this feature is an addition to Al Viro's RFC, to solve the vfsmount explosion. The problem is detailed here: http://www.ussg.iu.edu/hypermail/linux/kernel/0502.0/0468.html 3) rbind.patch : this patch adds the ability to propagate binds/rbinds across vfsmounts. 4) move.patch : this patch provides the ability to move a shared/private/slave/unclonable subtree to some other mount-point. It also provides the same feature to pivot_root() 5) umount.patch: this patch provides the ability to propagate unmounts. 6) namespace.patch: this patch provides ability to clone a namespace, with propagation set to vfsmounts in the new namespace. 7) automount.patch: this patch provides the automatic propagation for mounts/unmounts done through automounter. 8) pnode_opt.patch: this patch optimizes the redundant code in pnode.c . Looking forward for comments, RP ----------------------------------------------------------------------- CHANGES DONE IN RESPONSE TO COMMENTS RECEIVED IN 1ST ROUND, response to Pekka J Enberg Comments: >>Inlining the patches to email would be greatly appreciated. Here are >>some comments. done > +int > +_do_make_mounted(struct nameidata *nd, struct vfsmount **mnt) >>Use two underscores to follow naming conventions. Yes done. In fact this function is renamed as make_mounted because that seemed to make more sense. But in general throughout the patches I have changed all newly introduced function that start with one underscore to two underscores. > Index: 2.6.12/fs/pnode.c > =================================================================== > --- /dev/null > +++ 2.6.12/fs/pnode.c > @@ -0,0 +1,362 @@ > + > +#define PNODE_MEMBER_VFS 0x01 > +#define PNODE_SLAVE_VFS 0x02 >>Enums, please. done > + > +static kmem_cache_t * pnode_cachep; > + > +/* spinlock for pnode related operations */ > + __cacheline_aligned_in_smp DEFINE_SPINLOCK(vfspnode_lock); > + > + > +static void > +pnode_init_fn(void *data, kmem_cache_t *cachep, unsigned long flags) > +{ > + struct vfspnode *pnode = (struct vfspnode *)data; >>Redundant cast. yes. removed. > + INIT_LIST_HEAD(&pnode->pnode_vfs); > + INIT_LIST_HEAD(&pnode->pnode_slavevfs); > + INIT_LIST_HEAD(&pnode->pnode_slavepnode); > + INIT_LIST_HEAD(&pnode->pnode_peer_slave); > + pnode->pnode_master = NULL; > + pnode->pnode_flags = 0; > + atomic_set(&pnode->pnode_count,0); > +} > + > +void __init > +pnode_init(unsigned long mempages) > +{ > + pnode_cachep = kmem_cache_create("pnode_cache", > + sizeof(struct vfspnode), 0, > + SLAB_HWCACHE_ALIGN|SLAB_PANIC, pnode_init_fn, NULL); > +} > + > + > +struct vfspnode * > +pnode_alloc(void) > +{ > + struct vfspnode *pnode = (struct vfspnode *)kmem_cache_alloc( > + pnode_cachep, GFP_KERNEL); >>Redundant cast. yes removed. > +struct inoutdata { >>Wants a better name. This datastructure is gone after optimizing pnode.c > + void *my_data; /* produced and consumed by me */ > + void *in_data; /* produced by master, consumed by slave */ > + void *out_data; /* produced by slave, comsume by master */ > +}; > + > +struct pcontext { > + struct vfspnode *start; > + int flag; > + int traversal; > + int level; > + struct vfspnode *master_pnode; > + struct vfspnode *pnode; > + struct vfspnode *slave_pnode; > +}; > + > + > +#define PNODE_UP 1 > +#define PNODE_DOWN 2 > +#define PNODE_MID 3 >>Enums, please. These #defines are gone after the optimizations and cleanup. > + > +/* > + * Walk the pnode tree for each pnode encountered. A given pnode in the tree > + * can be returned a minimum of 2 times. First time the pnode is > encountered, > + * it is returned with the flag PNODE_DOWN. Every time the pnode is > encountered > + * after having traversed through each of its children, it is returned with > the > + * flag PNODE_MID. And finally when the pnode is encountered after having > + * walked all of its children, it is returned with the flag PNODE_UP. > + * > + * @context: provides context on the state of the last walk in the pnode > + * tree. > + */ > +static int inline > +pnode_next(struct pcontext *context) >>Rather large function to be an inline. yes. done. > +{ > + int traversal = context->traversal; > + int ret=0; > + struct vfspnode *pnode = context->pnode, > + *slave_pnode=context->slave_pnode, > + *master_pnode=context->master_pnode; >>Add a separate declaration for each variable. The above is hard to read. yes done. > + struct list_head *next; > + > + spin_lock(&vfspnode_lock); > + /* > + * the first traversal will be on the root pnode > + * with flag PNODE_DOWN > + */ > + if (!pnode) { > + context->pnode = get_pnode(context->start); > + context->master_pnode = NULL; > + context->traversal = PNODE_DOWN; > + context->slave_pnode = NULL; > + context->level = 0; > + ret = 1; > + goto out; > + } > + > + /* > + * if the last traversal was PNODE_UP, than the current > + * traversal is PNODE_MID on the master pnode. > + */ > + if (traversal == PNODE_UP) { > + if (!master_pnode) { > + /* DONE. return */ > + put_pnode(pnode); > + ret = 0; >> Using goto out and dropping the else branch would make this more readable. This code is also gone after optimization and cleanups. > + } else { > + context->traversal = PNODE_MID; > + context->level--; > + context->pnode = master_pnode; > + put_pnode(slave_pnode); > + context->slave_pnode = pnode; > + context->master_pnode = (master_pnode ? > + master_pnode->pnode_master : NULL); > + ret = 1; > + } > + } else { > + if(traversal == PNODE_MID) { >> Missing space before parenthesis. Yes ensured that throughout the patches. Hope I have not missed any. > + next = slave_pnode->pnode_peer_slave.next; > + } else { > + next = pnode->pnode_slavepnode.next; > + } >> Please drop the extra braces. Yes done. > + put_pnode(slave_pnode); > + context->slave_pnode = NULL; > + /* > + * if the last traversal was PNODE_MID or PNODE_DOWN, and the > + * master pnode has some slaves to traverse, the current > + * traversal will be PNODE_DOWN on the slave pnode. > + */ > + if ((next != &pnode->pnode_slavepnode) && > + (traversal == PNODE_DOWN || traversal == PNODE_MID)) { > + context->traversal = PNODE_DOWN; > + context->level++; > + context->pnode = get_pnode(list_entry(next, > + struct vfspnode, pnode_peer_slave)); > + context->master_pnode = pnode; > + ret = 1; > + } else { > + /* > + * since there are no more children, the current > traversal > + * is PNODE_UP on the same pnode > + */ > + context->traversal = PNODE_UP; > + ret = 1; >> Would probably make more sense to check if >> (next == &pnode->pnode_slavepnode && traversal == PNODE_UP) and use goto out >> to >> get rid of the else branch. Again after code-cleanup and optimization, this code is gone. > + } > + } > +out: > + spin_unlock(&vfspnode_lock); > + return ret; > +} > + > + > +static void > +_pnode_disassociate_mnt(struct vfsmount *mnt) >> Two underscores, please. Yes. Did it! > +struct vfsmount * > +pnode_make_mounted(struct vfspnode *pnode, struct vfsmount *mnt, struct > dentry *dentry) > +{ > + struct vfsmount *child_mnt; > + int ret=0,traversal,level; >> Spaces, please. Yes done. > + struct vfspnode *slave_pnode, *master_pnode, *child_pnode, > *slave_child_pnode; > + struct vfsmount *slave_mnt, *member_mnt, *t_m; >> Formatting damage. Yep. Done > + while(pnode_next(&context)) { >> Missing space before parenthesis. Done > + traversal = context.traversal; > + level = context.level; > + pnode = context.pnode; > + slave_pnode = context.slave_pnode; > + master_pnode = context.master_pnode; > + > + if (traversal == PNODE_DOWN ) { >> Use switch statement here. yep! did it. > + if (master_pnode) { > + child_pnode = (struct vfspnode > *)p_array[level-1].in_data; >> Redundant cast. done. > + } else { > + child_pnode = NULL; > + } >> Extra braces. done > + while (!(child_pnode = pnode_alloc())) > + schedule(); >> This looks dangerous. Why this must not fail and in other places you >> return -ENOMEM? I have changed it to return -ENOMEM. But its not perfect yet. The code is not graceful returning. It leaves around some data-structures. I am working on this last aspect currently. > + p_array[level].my_data = (void *)child_pnode; >> Redundant cast. Yes. done. > + child_pnode = (struct vfspnode *)p_array[level].my_data; >> Redundant cast. Yes. done. > +#define MS_PRIVATE 262144 > +#define MS_SLAVE 524288 > +#define MS_SHARED 1048576 >> The expression (1<<bit) would make more sense here. Yes. done. > Index: 2.6.12/include/linux/pnode.h > =================================================================== > --- /dev/null > +++ 2.6.12/include/linux/pnode.h > @@ -0,0 +1,78 @@ > +/* > + * linux/fs/pnode.c > + * > + * (C) Copyright IBM Corporation 2005. > + * Released under GPL v2. > + * > + */ > +#ifndef _LINUX_PNODE_H > +#define _LINUX_PNODE_H > +#ifdef __KERNEL__ >> No need for the above. Kernel headers are not supposed to be included by >> userspace. Removed it. > + > +#include <linux/list.h> > +#include <linux/mount.h> > +#include <linux/spinlock.h> > +#include <asm/atomic.h> > + > +struct vfspnode { > + struct list_head pnode_vfs; /* list of vfsmounts anchored here */ > + struct list_head pnode_slavevfs; /* list of slave vfsmounts */ > + struct list_head pnode_slavepnode;/* list of slave pnode */ > + struct list_head pnode_peer_slave;/* going through master's slave pnode > + list*/ > + struct vfspnode *pnode_master; /* master pnode */ > + int pnode_flags; > + atomic_t pnode_count; > +}; > +#define PNODE_MAX_SLAVE_LEVEL 10 > +#define PNODE_DELETE 0x01 > +#define PNODE_SLAVE 0x02 >> Enums, please. I have left it as is for now. These defines can also go. I will fix them. > + > +#define IS_PNODE_DELETE(pn) ((pn->pnode_flags&PNODE_DELETE)==PNODE_DELETE) > +#define IS_PNODE_SLAVE(pn) ((pn->pnode_flags&PNODE_SLAVE)==PNODE_SLAVE) > +#define SET_PNODE_DELETE(pn) pn->pnode_flags |= PNODE_DELETE > +#define SET_PNODE_SLAVE(pn) pn->pnode_flags |= PNODE_SLAVE >> Static inline functions are preferred over #define. done > + > +extern spinlock_t vfspnode_lock; > +extern void __put_pnode(struct vfspnode *); > + > +static inline struct vfspnode * > +get_pnode(struct vfspnode *pnode) > +{ > + if (!pnode) > + return NULL; >> Can pnode really be NULL here? Looking at the callers in this patch, it >> can't. >> Please remember that you should do NULL checks like this only when it makes >> sense from API point of view to call the function with NULL. Yes there are cases where it they can be called with null. If I don't have these checks, than the check has to be done in the caller. So I chose it here. > +//TOBEDONE WRITE BETTER MACROS. .. >>Please use static inline functions instead. yes done. Response to Miklos comments follows: > -struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) > +struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry, > struct dentry *root) > { >> How about changing it to inline and calling it __lookup_mnt_root(), >> and calling it from lookup_mnt() (which could keep the old signature) >> and lookup_mnt_root(). That way the compiler can optimize away the >> root check for the plain lookup_mnt() case, and there's no need to >> modify callers of lookup_mnt(). Yes. DOne. > > +struct vfsmount *do_make_mounted(struct vfsmount *mnt, struct dentry *dentry) > +{ >> What does this function do? Can we have a header comment? Yes added a header. > +int > +_do_make_mounted(struct nameidata *nd, struct vfsmount **mnt) > +{ >> Ditto. Added a header. > +/* > + * recursively change the type of the mountpoint. > + */ > +static int do_change_type(struct nameidata *nd, int flag) > +{ > + struct vfsmount *m, *mnt; > + struct vfspnode *old_pnode = NULL; > + int err; > + > + if (!(flag & MS_SHARED) && !(flag & MS_PRIVATE) > + && !(flag & MS_SLAVE)) > + return -EINVAL; > + > + if ((err = _do_make_mounted(nd, &mnt))) > + return err; >> Why does this opertation do any mounting? If it's a type change, it >> should just change the type of something already mounted, no? Ok. Now it returns -EINVAL. > + case MS_SHARED: > + SET_MNT_PRIVATE(m); .... > + break; > + >> Can this be split into three functions? Yes split them into 3 functions. > +static int inline > +pnode_next(struct pcontext *context) > +{ >> Is such a generic traversal function really needed? Why? As I said in my earlier mail, this function is equivalent of next_mnt for traversing vfsmount tree. I have optimized this function as well as pnode_traverse(). Hopefully you may find the new code palatable. > +struct vfsmount * > +pnode_make_mounted(struct vfspnode *pnode, struct vfsmount *mnt, struct > dentry *dentry) > +{ >>Again a header comment would be nice, on what exactly this function >>does. Also the implementation is really cryptic, but I can't even >>start to decipher without knowing what it's supposed to do. Yes. added a header comment. > +static inline struct vfspnode * > +get_pnode_n(struct vfspnode *pnode, size_t n) > +{ >> Seems to be unused throughout the patch series True removed. > > + > static inline struct vfsmount *mntget(struct vfsmount *mnt) >> Please don't add empty lines. Yes done. ----------------------------------------------------------------------- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/