Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries
On Sat, Nov 01, 2014 at 06:30:37PM +, Al Viro wrote: > On Sat, Nov 01, 2014 at 03:06:20PM +, Al Viro wrote: > > On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote: > > > OK, interim branch (_completely_ untested, and there's quite a bit of > > > work remaining) is in vfs.git#nsfs. > > > > ... except that what got pushed was completely buggered - the last changes > > not committed *and* include/linux/ns_common.h not git-add'ed at the very > > beginning. Oh, well - it wasn't hard to reconstruct its history... > > > > Anyway, that much got fixed and pushed; sorry about the noise - shouldn't > > have posted before grabbing some sleep... > > And now it even seems to work. Poking in /proc/*/ns/*, mount --bind of those > suckers to regular files, unshare and nsenter (both by PID and by file, > including the ones we'd bound somewhere). Eric, could you hit it with > whatever testsuite you are using and see if it survives? > > It's vfs.git#nsfs (head should be at 2e64120), branched at -rc2. The meat > of that sucker is in the next-to-last commit; the rest is preparation and > a minor cleanup. ping? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries
On Sat, Nov 01, 2014 at 06:30:37PM +, Al Viro wrote: On Sat, Nov 01, 2014 at 03:06:20PM +, Al Viro wrote: On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote: OK, interim branch (_completely_ untested, and there's quite a bit of work remaining) is in vfs.git#nsfs. ... except that what got pushed was completely buggered - the last changes not committed *and* include/linux/ns_common.h not git-add'ed at the very beginning. Oh, well - it wasn't hard to reconstruct its history... Anyway, that much got fixed and pushed; sorry about the noise - shouldn't have posted before grabbing some sleep... And now it even seems to work. Poking in /proc/*/ns/*, mount --bind of those suckers to regular files, unshare and nsenter (both by PID and by file, including the ones we'd bound somewhere). Eric, could you hit it with whatever testsuite you are using and see if it survives? It's vfs.git#nsfs (head should be at 2e64120), branched at -rc2. The meat of that sucker is in the next-to-last commit; the rest is preparation and a minor cleanup. ping? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries
On Sat, Nov 01, 2014 at 06:38:02PM +, Al Viro wrote: > On Sat, Nov 01, 2014 at 11:23:34AM -0700, Eric W.Biederman wrote: > > > >From your description I am concerned about using the letter combination > > >nsfs. I once used nsfd, and that was so close to nfsd that Linus got > > >confused, and hilarity ensued. nsfs isn't quite as bad but the > > >abbreviation still seems close enough to nfs that confusion could result. > > Well, yes, but... the only non-static object in there with "nsfs" in the > name is nsfs_init(). As for the filename itself... any better suggestions > would be welcome, but it doesn't really mess the completion tree. > > I've no strong preferences here - we might even move it into kernel/nsproxy.c. > In the latest variant it's ~160 lines... BTW, as an immediate followup, definition of struct nameidata can be taken to fs/namei.c now. nd_get_link/nd_set_link become exported, actual declaration of struct nameidata goes into fs/namei.c and for everything outside of fs/namei.c it becomes completely opaque. TBH, I'm somewhat tempted to do something even more extreme - try to treat it the way we treat pt_regs. I.e. have a pointer to such sucker in task_struct (NULL most of the time) and chain the instances. That way we'd be rid of passing that sucker even to ->follow_link() (and to nd_{set,get}_link() as well). path_init() would push the sucker to the top of the chain, path_finish(nd, base) (currently open-coded bunch of if (nd->root.mnt && !(nd->flags & LOOKUP_ROOT)) path_put(>root); if (base) fput(base); ) - remove it from the chain. Not sure if it's worth doing - it's slightly less boilerplate in ->follow_link() and ->put_link() instances and it _might_ reduce the stack footprint in a bunch of places in fs/namei.c (including the recursion in there), but it might easily end up increasing it instead... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries
On Sat, Nov 01, 2014 at 11:23:34AM -0700, Eric W.Biederman wrote: > >From your description I am concerned about using the letter combination > >nsfs. I once used nsfd, and that was so close to nfsd that Linus got > >confused, and hilarity ensued. nsfs isn't quite as bad but the > >abbreviation still seems close enough to nfs that confusion could result. Well, yes, but... the only non-static object in there with "nsfs" in the name is nsfs_init(). As for the filename itself... any better suggestions would be welcome, but it doesn't really mess the completion tree. I've no strong preferences here - we might even move it into kernel/nsproxy.c. In the latest variant it's ~160 lines... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries
On Sat, Nov 01, 2014 at 03:06:20PM +, Al Viro wrote: > On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote: > > OK, interim branch (_completely_ untested, and there's quite a bit of > > work remaining) is in vfs.git#nsfs. > > ... except that what got pushed was completely buggered - the last changes > not committed *and* include/linux/ns_common.h not git-add'ed at the very > beginning. Oh, well - it wasn't hard to reconstruct its history... > > Anyway, that much got fixed and pushed; sorry about the noise - shouldn't > have posted before grabbing some sleep... And now it even seems to work. Poking in /proc/*/ns/*, mount --bind of those suckers to regular files, unshare and nsenter (both by PID and by file, including the ones we'd bound somewhere). Eric, could you hit it with whatever testsuite you are using and see if it survives? It's vfs.git#nsfs (head should be at 2e64120), branched at -rc2. The meat of that sucker is in the next-to-last commit; the rest is preparation and a minor cleanup. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries
On November 1, 2014 8:06:20 AM PDT, Al Viro wrote: >On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote: >> OK, interim branch (_completely_ untested, and there's quite a bit of >> work remaining) is in vfs.git#nsfs. > >... except that what got pushed was completely buggered - the last >changes >not committed *and* include/linux/ns_common.h not git-add'ed at the >very >beginning. Oh, well - it wasn't hard to reconstruct its history... > >Anyway, that much got fixed and pushed; sorry about the noise - >shouldn't >have posted before grabbing some sleep... No worries. I am effectively without internet access right from now until about the 3rd week in November as I move so I really can't look anyway. >From your description I am concerned about using the letter combination nsfs. > I once used nsfd, and that was so close to nfsd that Linus got confused, and >hilarity ensued. nsfs isn't quite as bad but the abbreviation still seems >close enough to nfs that confusion could result. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries
On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote: > OK, interim branch (_completely_ untested, and there's quite a bit of > work remaining) is in vfs.git#nsfs. ... except that what got pushed was completely buggered - the last changes not committed *and* include/linux/ns_common.h not git-add'ed at the very beginning. Oh, well - it wasn't hard to reconstruct its history... Anyway, that much got fixed and pushed; sorry about the noise - shouldn't have posted before grabbing some sleep... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries
On Mon, Oct 13, 2014 at 03:42:28PM -0700, Eric W. Biederman wrote: > >From looking at your proposed code that doesn't look like it will be > any more difficult to maintain from the namespace side. So I have no > objects with moving the code in that direction. > > > It's obviously a project for the next cycle and it'll require > > some cooperation between vfs and userns trees, but not too much of it - > > all we really need is a never-changed commit embedding that structure into > > all ...ns and passing _that_ to proc_{alloc,free}_inum() replacements > > Merged both into vfs.git #for-next and usern one. We can do that right > > after -rc1. Incidentally, it might make sense to move refcount into that > > common piece as well, but that's independent from the stuff above. > > That sounds like a reasonable direction to go. I think your schedule > may be a tad optimistic time-wise, if for no other reason that code > reviews take time, but that plan should work. OK, interim branch (_completely_ untested, and there's quite a bit of work remaining) is in vfs.git#nsfs. It will be refactored; moreover, most of that stuff will move out from fs/proc/namespaces.c - either into fs/nsfs.c or into kernel/nsproxy.c. And purely VFS followups are not even touched yet - I know what I want to do, and I think there's no obstacles left, but that's a separate story. "Filesystem for ns dentries/inodes" part is done, though (modulo debugging and moving it out of fs/proc). I wonder if we would be better off not using proc_alloc_inum() there - it's not hard to do a separate allocator, especially since there's no requirements re non-intersecting sets of inumbers for procfs and for this. That's the last part shared with procfs... Anyway, I'm going down right now (nearly 5am here), so further fun will have to wait a bit... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries
On Mon, Oct 13, 2014 at 03:42:28PM -0700, Eric W. Biederman wrote: From looking at your proposed code that doesn't look like it will be any more difficult to maintain from the namespace side. So I have no objects with moving the code in that direction. It's obviously a project for the next cycle and it'll require some cooperation between vfs and userns trees, but not too much of it - all we really need is a never-changed commit embedding that structure into all ...ns and passing _that_ to proc_{alloc,free}_inum() replacements Merged both into vfs.git #for-next and usern one. We can do that right after -rc1. Incidentally, it might make sense to move refcount into that common piece as well, but that's independent from the stuff above. That sounds like a reasonable direction to go. I think your schedule may be a tad optimistic time-wise, if for no other reason that code reviews take time, but that plan should work. OK, interim branch (_completely_ untested, and there's quite a bit of work remaining) is in vfs.git#nsfs. It will be refactored; moreover, most of that stuff will move out from fs/proc/namespaces.c - either into fs/nsfs.c or into kernel/nsproxy.c. And purely VFS followups are not even touched yet - I know what I want to do, and I think there's no obstacles left, but that's a separate story. Filesystem for ns dentries/inodes part is done, though (modulo debugging and moving it out of fs/proc). I wonder if we would be better off not using proc_alloc_inum() there - it's not hard to do a separate allocator, especially since there's no requirements re non-intersecting sets of inumbers for procfs and for this. That's the last part shared with procfs... Anyway, I'm going down right now (nearly 5am here), so further fun will have to wait a bit... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries
On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote: OK, interim branch (_completely_ untested, and there's quite a bit of work remaining) is in vfs.git#nsfs. ... except that what got pushed was completely buggered - the last changes not committed *and* include/linux/ns_common.h not git-add'ed at the very beginning. Oh, well - it wasn't hard to reconstruct its history... Anyway, that much got fixed and pushed; sorry about the noise - shouldn't have posted before grabbing some sleep... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries
On November 1, 2014 8:06:20 AM PDT, Al Viro v...@zeniv.linux.org.uk wrote: On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote: OK, interim branch (_completely_ untested, and there's quite a bit of work remaining) is in vfs.git#nsfs. ... except that what got pushed was completely buggered - the last changes not committed *and* include/linux/ns_common.h not git-add'ed at the very beginning. Oh, well - it wasn't hard to reconstruct its history... Anyway, that much got fixed and pushed; sorry about the noise - shouldn't have posted before grabbing some sleep... No worries. I am effectively without internet access right from now until about the 3rd week in November as I move so I really can't look anyway. From your description I am concerned about using the letter combination nsfs. I once used nsfd, and that was so close to nfsd that Linus got confused, and hilarity ensued. nsfs isn't quite as bad but the abbreviation still seems close enough to nfs that confusion could result. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries
On Sat, Nov 01, 2014 at 03:06:20PM +, Al Viro wrote: On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote: OK, interim branch (_completely_ untested, and there's quite a bit of work remaining) is in vfs.git#nsfs. ... except that what got pushed was completely buggered - the last changes not committed *and* include/linux/ns_common.h not git-add'ed at the very beginning. Oh, well - it wasn't hard to reconstruct its history... Anyway, that much got fixed and pushed; sorry about the noise - shouldn't have posted before grabbing some sleep... And now it even seems to work. Poking in /proc/*/ns/*, mount --bind of those suckers to regular files, unshare and nsenter (both by PID and by file, including the ones we'd bound somewhere). Eric, could you hit it with whatever testsuite you are using and see if it survives? It's vfs.git#nsfs (head should be at 2e64120), branched at -rc2. The meat of that sucker is in the next-to-last commit; the rest is preparation and a minor cleanup. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries
On Sat, Nov 01, 2014 at 11:23:34AM -0700, Eric W.Biederman wrote: From your description I am concerned about using the letter combination nsfs. I once used nsfd, and that was so close to nfsd that Linus got confused, and hilarity ensued. nsfs isn't quite as bad but the abbreviation still seems close enough to nfs that confusion could result. Well, yes, but... the only non-static object in there with nsfs in the name is nsfs_init(). As for the filename itself... any better suggestions would be welcome, but it doesn't really mess the completion tree. I've no strong preferences here - we might even move it into kernel/nsproxy.c. In the latest variant it's ~160 lines... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries
On Sat, Nov 01, 2014 at 06:38:02PM +, Al Viro wrote: On Sat, Nov 01, 2014 at 11:23:34AM -0700, Eric W.Biederman wrote: From your description I am concerned about using the letter combination nsfs. I once used nsfd, and that was so close to nfsd that Linus got confused, and hilarity ensued. nsfs isn't quite as bad but the abbreviation still seems close enough to nfs that confusion could result. Well, yes, but... the only non-static object in there with nsfs in the name is nsfs_init(). As for the filename itself... any better suggestions would be welcome, but it doesn't really mess the completion tree. I've no strong preferences here - we might even move it into kernel/nsproxy.c. In the latest variant it's ~160 lines... BTW, as an immediate followup, definition of struct nameidata can be taken to fs/namei.c now. nd_get_link/nd_set_link become exported, actual declaration of struct nameidata goes into fs/namei.c and for everything outside of fs/namei.c it becomes completely opaque. TBH, I'm somewhat tempted to do something even more extreme - try to treat it the way we treat pt_regs. I.e. have a pointer to such sucker in task_struct (NULL most of the time) and chain the instances. That way we'd be rid of passing that sucker even to -follow_link() (and to nd_{set,get}_link() as well). path_init() would push the sucker to the top of the chain, path_finish(nd, base) (currently open-coded bunch of if (nd-root.mnt !(nd-flags LOOKUP_ROOT)) path_put(nd-root); if (base) fput(base); ) - remove it from the chain. Not sure if it's worth doing - it's slightly less boilerplate in -follow_link() and -put_link() instances and it _might_ reduce the stack footprint in a bunch of places in fs/namei.c (including the recursion in there), but it might easily end up increasing it instead... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries
On Mon, Oct 13, 2014 at 07:29:45AM +0100, Al Viro wrote: > there and get rid of pointless aliases. Oh, and we get a decent chance to > kill d_instantiate_unique(), which is also nice. Or at least fold it into > d_add_unique(), if we can't kill that sucker as well. And if we manage to > kill it outright, we get rid of __d_instantiate_unique() for free - in my > local queue d_materialise_unique() had been subsumed by d_splice_alias(), > getting rid of the other caller of __d_instantiate_unique(). We have *WAY* > too many primitives in that area, and trimming that forest is definitely > a good thing. FWIW, looking at that only caller of d_add_unique(), I really think that it's killable. Call graph for that sucker is d_add_unique() <- _nfs4_open_and_get_state <- _nfs4_do_open <- nfs4_do_open <- nfs4_atomic_open = nfs_rpc_ops.open_context <- nfs_atomic_open = inode_operations.atomic_open <- nfs4_file_open = file_operations.open <- nfs4_proc_create = nfs_rpc_ops.create <- nfs_creat = inode_operations.create Now, to have file_operations.open called, we need dentry to be pinned down and positive. In that case d_add_unique() isn't called, obviously. OTOH, inode_operations.create and inode_operations.atomic_open are called with parent directory having its ->i_mutex held and we have every right to just do d_splice_alias() in both cases. IOW, d_add_unique() can be simply killed, and with aforementioned change to fs/proc/namespaces.c we can kill d_instantiate_unique() and __d_instantiate_unique() along with it. OK, that drives the number of primitives further down... Actually, I suspect that a bunch of places calling d_add() in their ->lookup() instances should call d_splice_alias() instead - it won't cost more unless the inode happens to be a directory with existing aliases. And to hell with "we should use d_splice_alias() if it's exportable, no, make it d_materialise_unique() if tree topology might change unexpectedly, but if neither is true we can just use d_add() and return NULL" - it's too complicated for no good reason. "Always use d_splice_alias() in lookups" make a lot more sense... Oh, and I strongly suspect that d_instantiate() ought to be taught to act the same way wrt directory aliases (i.e. relocate if one is found). Which kills d_instantiate_no_diralias() (only one caller for that one) and closes very narrow nfsd races in ext2_mkdir() and friends... We really have too many primitives in that area; in the beginning it was just d_instantiate() for object creation (when dentry was already hashed) and d_add() for lookups (when dentry was hashed in addition to being tied to inode). It served well for local never-exported filesystems, but once we added knfsd (and then open-by-fhandle) the variants started to breed like rabbits. Filesystems with tree topology that might be silently changed behind our backs made it only worse... The real reason why we didn't just change d_add()/d_instantiate() behaviour back then was that in "the inode is a directory with existing aliases" case we need to go with another dentry (after moving it in place of ours), which changes the calling conventions. OTOH, in that case calling d_add() or d_instantiate() currently means a serious bug - we end up with multiple aliases for a directory, which should never happen. Hell knows, maybe it's feasible to change d_add() and d_instantiate() calling conventions and really fold all that shit back into them... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries
Al Viro writes: > But more fundamentally, this stuff has no business being in procfs. The > *only* reason why we are putting them there (and get those inodes and > dentries duplicated for different procfs instances) is this in > do_loopback(): > if (!check_mnt(parent) || !check_mnt(old)) > goto out2; > You want to be able to bind those suckers by mount --bind. Which is an > odd API design (and it gives you another headache from hell with > mnt_ns_loop()) > but it's too late to change, sadly ;-/ So you want ->follow_link() to > yield dentries matching some vfsmount in our namespace. The things would be > much simpler if we didn't try that. > > Look: do_loopback() is *already* aware of those suckers. Has to be, due to > aforementioned mnt_ns_loop(). So we might as well weaken the constraints on > old - check_mnt(old) || proc_ns_inode(old_path.dentry->d_inode) would do. > The thing is, we calculate that proc_ns_inode() anyway. Truthfully mnt_ns_loop only exists for because we have these for the mount namespace. But I take your point. Your proposal sounds like it should work. After the bind mount is created we have a new struct mount so the existing chk_mnt checks won't get us into trouble. At some point we may need to allow multiple internal superblocks to allow having a preset inode number for the purposes of checkpoint/restart but I don't see that being a bottleneck to process migration anytime in the near future, and your design does not preclude that possibility. There was a potential issue with your earlier suggestion to modify check_mnt to only include mounted filesystems as otherwise there is an issue with references to filesystems unmounted with MNT_DETACH being able to be bind mounted. Having an explicit proc_ns_inode test avoids those issues. > Now, suppose we'd done that. In principle, it allows some things that are > not allowed right now - you could open such file and pass it (via SCM_RIGHTS) > to a process in another namespace. With the current tree you can't bind > that bugger via /proc/self/fd/ - it will have alien vfsmount. After such > change it will become allowed. Is there any problem with that? I don't see > any (after all, recepient might have held it open and anyone who could > get to that sucker via /proc//fd/ could've simply stopped the > recepient > and made it pass the damn thing to them - being able to ptrace is enough > for that). Am I missing anything? The only thing you can't do with a passed file descriptor today is to mount it. Everything else is already doable in any context, and all mounting does is give you the ability to keep the namespace alive. Something you can already do by just keeping the file descriptor open. These filedescriptors are guarded by directory permissions so if /proc//fd had more open permissions I might be concerned, but as it stands only the user of the process possesors of the appropriate capability can access that directory. So I really think we are good. Thank you for making such a careful analysis. > *IF* we can get away with such change, everything becomes much easier. We > can add a filesystem just for those guys - one for the entire system. And > kern_mount() it. Don't bother with register_filesystem() - no point showing > it in the /proc/filesystems, etc. No need to abuse procfs inodes either. > Or procfs icache, for that matter. All we need is a small standard piece > embedded into each {mnt,net,pid,user,...}ns. > > Namely, let's stash a pointer to such dentry (if any) into that standard > piece. > Not a counting reference, of course, and used in a very limited way. > * have ->d_prune() for those guys replace the stashed pointer with NULL. > * on the "get dentry for ns" side, > again: > rcu_read_lock() > d = atomic_long_read(stashed pointer) > if (!d) > goto slow; > if (!lockref_get_not_dead(>d_lockref)) > goto slow; > rcu_read_unlock(); > return d; > slow: > rcu_read_unlock(); > allocate new inode > allocate new dentry > d_instantiate() > d = atomic_long_cmpxchg(stashed pointer, NULL, new dentry); > if (d) { > dput(new dentry); > cpu_relax();// might not be needed > goto again; > } > return new dentry; > I'd suggest having ->i_private point to that standard piece and storing > ns_ops and proc_inum in there. I think it is a shame that we don't have an equivalent of d_materialise_unique that does thoe whole d_find_alias logic non-directories. Sigh. But what should drive which functions to keep are the real and needs of filesystems not my weird namespace case. >From looking at your proposed code that doesn't look like it will be any more difficult to
[RFC] dealing with proc_ns_follow_link() and "namespace" dentries
proc_ns_follow_link() is unique among the ->follow_link() instances in several respects, some of them violating general asserts expected by all kinds of VFS code. First of all, it can lead to pairs with dentry being completely unrelated to vfsmount. Suppose we bind such a symlink over something else. ->follow_link() is called with nd->path set to directory we'd found the symlink in - after all, that's what the normal symlinks are interpreted against. The same symlink present in two different directories can resolve to different places. In this case we generate a dentry/inode in procfs and combine that dentry with vfsmount of parent. Which might be completely unrelated to procfs. The thing is, procfs isn't a good match for those inodes either. For one thing, they really don't give a damn which procfs instance are they in - if anything, they should be associated with ns. For another, using procfs icache for looking them up (by really strange inumbers) solves the problem only partially. d_instantiate_unique(), used there in attempt to keep dentries from breeding, does not do what you seem to expect it to do. Namely, it *never* picks an existing alias. Why? Because it looks for aliases with the same ->d_parent (and name) as the dentry we'd given it. And that dentry comes from d_alloc_pseudo(), i.e. it has itself for ->d_parent. *All* aliases for those inodes happen that way. IOW, none of them have the same ->d_parent, and d_instantiate_unique() here is simply a fancy way to spin through all existing aliases and call d_instantiate() on our new one. Quick experiment: main() { int i; for (i = 0; i < 10; i++) { open("/proc/self/ns/mnt", 0); if (!(i % 1000)) system("grep dentry /proc/slabinfo"); } run with ulimit -n 10 and watch it create a new dentry per open(). Profile of that (sans system(3)) is also rather interesting - each d_instantiate_unique() is O(N), so we spend a quadratic time in that sucker for no good reason whatsoever. If nothing else, d = d_find_any_alias(inode); if (d) {dput(dentry);return d;} d_set_d_op(dentry, ...); d_instantiate(dentry, inode); return dentry; would be a much better strategy - *this* would almost always find an alias when one is to be found. We might occasionally get more than one (when several processes race), but that beats the hell out of easily getting millions of them... But more fundamentally, this stuff has no business being in procfs. The *only* reason why we are putting them there (and get those inodes and dentries duplicated for different procfs instances) is this in do_loopback(): if (!check_mnt(parent) || !check_mnt(old)) goto out2; You want to be able to bind those suckers by mount --bind. Which is an odd API design (and it gives you another headache from hell with mnt_ns_loop()) but it's too late to change, sadly ;-/ So you want ->follow_link() to yield dentries matching some vfsmount in our namespace. The things would be much simpler if we didn't try that. Look: do_loopback() is *already* aware of those suckers. Has to be, due to aforementioned mnt_ns_loop(). So we might as well weaken the constraints on old - check_mnt(old) || proc_ns_inode(old_path.dentry->d_inode) would do. The thing is, we calculate that proc_ns_inode() anyway. Now, suppose we'd done that. In principle, it allows some things that are not allowed right now - you could open such file and pass it (via SCM_RIGHTS) to a process in another namespace. With the current tree you can't bind that bugger via /proc/self/fd/ - it will have alien vfsmount. After such change it will become allowed. Is there any problem with that? I don't see any (after all, recepient might have held it open and anyone who could get to that sucker via /proc//fd/ could've simply stopped the recepient and made it pass the damn thing to them - being able to ptrace is enough for that). Am I missing anything? *IF* we can get away with such change, everything becomes much easier. We can add a filesystem just for those guys - one for the entire system. And kern_mount() it. Don't bother with register_filesystem() - no point showing it in the /proc/filesystems, etc. No need to abuse procfs inodes either. Or procfs icache, for that matter. All we need is a small standard piece embedded into each {mnt,net,pid,user,...}ns. Namely, let's stash a pointer to such dentry (if any) into that standard piece. Not a counting reference, of course, and used in a very limited way. * have ->d_prune() for those guys replace the stashed pointer with NULL. * on the "get dentry for ns" side, again: rcu_read_lock() d = atomic_long_read(stashed pointer) if (!d) goto slow; if (!lockref_get_not_dead(>d_lockref)) goto slow; rcu_read_unlock();
[RFC] dealing with proc_ns_follow_link() and namespace dentries
proc_ns_follow_link() is unique among the -follow_link() instances in several respects, some of them violating general asserts expected by all kinds of VFS code. First of all, it can lead to vfsmount,dentry pairs with dentry being completely unrelated to vfsmount. Suppose we bind such a symlink over something else. -follow_link() is called with nd-path set to directory we'd found the symlink in - after all, that's what the normal symlinks are interpreted against. The same symlink present in two different directories can resolve to different places. In this case we generate a dentry/inode in procfs and combine that dentry with vfsmount of parent. Which might be completely unrelated to procfs. The thing is, procfs isn't a good match for those inodes either. For one thing, they really don't give a damn which procfs instance are they in - if anything, they should be associated with ns. For another, using procfs icache for looking them up (by really strange inumbers) solves the problem only partially. d_instantiate_unique(), used there in attempt to keep dentries from breeding, does not do what you seem to expect it to do. Namely, it *never* picks an existing alias. Why? Because it looks for aliases with the same -d_parent (and name) as the dentry we'd given it. And that dentry comes from d_alloc_pseudo(), i.e. it has itself for -d_parent. *All* aliases for those inodes happen that way. IOW, none of them have the same -d_parent, and d_instantiate_unique() here is simply a fancy way to spin through all existing aliases and call d_instantiate() on our new one. Quick experiment: main() { int i; for (i = 0; i 10; i++) { open(/proc/self/ns/mnt, 0); if (!(i % 1000)) system(grep dentry /proc/slabinfo); } run with ulimit -n 10 and watch it create a new dentry per open(). Profile of that (sans system(3)) is also rather interesting - each d_instantiate_unique() is O(N), so we spend a quadratic time in that sucker for no good reason whatsoever. If nothing else, d = d_find_any_alias(inode); if (d) {dput(dentry);return d;} d_set_d_op(dentry, ...); d_instantiate(dentry, inode); return dentry; would be a much better strategy - *this* would almost always find an alias when one is to be found. We might occasionally get more than one (when several processes race), but that beats the hell out of easily getting millions of them... But more fundamentally, this stuff has no business being in procfs. The *only* reason why we are putting them there (and get those inodes and dentries duplicated for different procfs instances) is this in do_loopback(): if (!check_mnt(parent) || !check_mnt(old)) goto out2; You want to be able to bind those suckers by mount --bind. Which is an odd API design (and it gives you another headache from hell with mnt_ns_loop()) but it's too late to change, sadly ;-/ So you want -follow_link() to yield dentries matching some vfsmount in our namespace. The things would be much simpler if we didn't try that. Look: do_loopback() is *already* aware of those suckers. Has to be, due to aforementioned mnt_ns_loop(). So we might as well weaken the constraints on old - check_mnt(old) || proc_ns_inode(old_path.dentry-d_inode) would do. The thing is, we calculate that proc_ns_inode() anyway. Now, suppose we'd done that. In principle, it allows some things that are not allowed right now - you could open such file and pass it (via SCM_RIGHTS) to a process in another namespace. With the current tree you can't bind that bugger via /proc/self/fd/n - it will have alien vfsmount. After such change it will become allowed. Is there any problem with that? I don't see any (after all, recepient might have held it open and anyone who could get to that sucker via /proc/pid/fd/n could've simply stopped the recepient and made it pass the damn thing to them - being able to ptrace is enough for that). Am I missing anything? *IF* we can get away with such change, everything becomes much easier. We can add a filesystem just for those guys - one for the entire system. And kern_mount() it. Don't bother with register_filesystem() - no point showing it in the /proc/filesystems, etc. No need to abuse procfs inodes either. Or procfs icache, for that matter. All we need is a small standard piece embedded into each {mnt,net,pid,user,...}ns. Namely, let's stash a pointer to such dentry (if any) into that standard piece. Not a counting reference, of course, and used in a very limited way. * have -d_prune() for those guys replace the stashed pointer with NULL. * on the get dentry for ns side, again: rcu_read_lock() d = atomic_long_read(stashed pointer) if (!d) goto slow; if (!lockref_get_not_dead(d-d_lockref)) goto slow;
Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries
Al Viro v...@zeniv.linux.org.uk writes: But more fundamentally, this stuff has no business being in procfs. The *only* reason why we are putting them there (and get those inodes and dentries duplicated for different procfs instances) is this in do_loopback(): if (!check_mnt(parent) || !check_mnt(old)) goto out2; You want to be able to bind those suckers by mount --bind. Which is an odd API design (and it gives you another headache from hell with mnt_ns_loop()) but it's too late to change, sadly ;-/ So you want -follow_link() to yield dentries matching some vfsmount in our namespace. The things would be much simpler if we didn't try that. Look: do_loopback() is *already* aware of those suckers. Has to be, due to aforementioned mnt_ns_loop(). So we might as well weaken the constraints on old - check_mnt(old) || proc_ns_inode(old_path.dentry-d_inode) would do. The thing is, we calculate that proc_ns_inode() anyway. Truthfully mnt_ns_loop only exists for because we have these for the mount namespace. But I take your point. Your proposal sounds like it should work. After the bind mount is created we have a new struct mount so the existing chk_mnt checks won't get us into trouble. At some point we may need to allow multiple internal superblocks to allow having a preset inode number for the purposes of checkpoint/restart but I don't see that being a bottleneck to process migration anytime in the near future, and your design does not preclude that possibility. There was a potential issue with your earlier suggestion to modify check_mnt to only include mounted filesystems as otherwise there is an issue with references to filesystems unmounted with MNT_DETACH being able to be bind mounted. Having an explicit proc_ns_inode test avoids those issues. Now, suppose we'd done that. In principle, it allows some things that are not allowed right now - you could open such file and pass it (via SCM_RIGHTS) to a process in another namespace. With the current tree you can't bind that bugger via /proc/self/fd/n - it will have alien vfsmount. After such change it will become allowed. Is there any problem with that? I don't see any (after all, recepient might have held it open and anyone who could get to that sucker via /proc/pid/fd/n could've simply stopped the recepient and made it pass the damn thing to them - being able to ptrace is enough for that). Am I missing anything? The only thing you can't do with a passed file descriptor today is to mount it. Everything else is already doable in any context, and all mounting does is give you the ability to keep the namespace alive. Something you can already do by just keeping the file descriptor open. These filedescriptors are guarded by directory permissions so if /proc/pid/fd had more open permissions I might be concerned, but as it stands only the user of the process possesors of the appropriate capability can access that directory. So I really think we are good. Thank you for making such a careful analysis. *IF* we can get away with such change, everything becomes much easier. We can add a filesystem just for those guys - one for the entire system. And kern_mount() it. Don't bother with register_filesystem() - no point showing it in the /proc/filesystems, etc. No need to abuse procfs inodes either. Or procfs icache, for that matter. All we need is a small standard piece embedded into each {mnt,net,pid,user,...}ns. Namely, let's stash a pointer to such dentry (if any) into that standard piece. Not a counting reference, of course, and used in a very limited way. * have -d_prune() for those guys replace the stashed pointer with NULL. * on the get dentry for ns side, again: rcu_read_lock() d = atomic_long_read(stashed pointer) if (!d) goto slow; if (!lockref_get_not_dead(d-d_lockref)) goto slow; rcu_read_unlock(); return d; slow: rcu_read_unlock(); allocate new inode allocate new dentry d_instantiate() d = atomic_long_cmpxchg(stashed pointer, NULL, new dentry); if (d) { dput(new dentry); cpu_relax();// might not be needed goto again; } return new dentry; I'd suggest having -i_private point to that standard piece and storing ns_ops and proc_inum in there. I think it is a shame that we don't have an equivalent of d_materialise_unique that does thoe whole d_find_alias logic non-directories. Sigh. But what should drive which functions to keep are the real and needs of filesystems not my weird namespace case. From looking at your proposed code that doesn't look like it will be any more difficult to maintain from the namespace side. So I
Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries
On Mon, Oct 13, 2014 at 07:29:45AM +0100, Al Viro wrote: there and get rid of pointless aliases. Oh, and we get a decent chance to kill d_instantiate_unique(), which is also nice. Or at least fold it into d_add_unique(), if we can't kill that sucker as well. And if we manage to kill it outright, we get rid of __d_instantiate_unique() for free - in my local queue d_materialise_unique() had been subsumed by d_splice_alias(), getting rid of the other caller of __d_instantiate_unique(). We have *WAY* too many primitives in that area, and trimming that forest is definitely a good thing. FWIW, looking at that only caller of d_add_unique(), I really think that it's killable. Call graph for that sucker is d_add_unique() - _nfs4_open_and_get_state - _nfs4_do_open - nfs4_do_open - nfs4_atomic_open = nfs_rpc_ops.open_context - nfs_atomic_open = inode_operations.atomic_open - nfs4_file_open = file_operations.open - nfs4_proc_create = nfs_rpc_ops.create - nfs_creat = inode_operations.create Now, to have file_operations.open called, we need dentry to be pinned down and positive. In that case d_add_unique() isn't called, obviously. OTOH, inode_operations.create and inode_operations.atomic_open are called with parent directory having its -i_mutex held and we have every right to just do d_splice_alias() in both cases. IOW, d_add_unique() can be simply killed, and with aforementioned change to fs/proc/namespaces.c we can kill d_instantiate_unique() and __d_instantiate_unique() along with it. OK, that drives the number of primitives further down... Actually, I suspect that a bunch of places calling d_add() in their -lookup() instances should call d_splice_alias() instead - it won't cost more unless the inode happens to be a directory with existing aliases. And to hell with we should use d_splice_alias() if it's exportable, no, make it d_materialise_unique() if tree topology might change unexpectedly, but if neither is true we can just use d_add() and return NULL - it's too complicated for no good reason. Always use d_splice_alias() in lookups make a lot more sense... Oh, and I strongly suspect that d_instantiate() ought to be taught to act the same way wrt directory aliases (i.e. relocate if one is found). Which kills d_instantiate_no_diralias() (only one caller for that one) and closes very narrow nfsd races in ext2_mkdir() and friends... We really have too many primitives in that area; in the beginning it was just d_instantiate() for object creation (when dentry was already hashed) and d_add() for lookups (when dentry was hashed in addition to being tied to inode). It served well for local never-exported filesystems, but once we added knfsd (and then open-by-fhandle) the variants started to breed like rabbits. Filesystems with tree topology that might be silently changed behind our backs made it only worse... The real reason why we didn't just change d_add()/d_instantiate() behaviour back then was that in the inode is a directory with existing aliases case we need to go with another dentry (after moving it in place of ours), which changes the calling conventions. OTOH, in that case calling d_add() or d_instantiate() currently means a serious bug - we end up with multiple aliases for a directory, which should never happen. Hell knows, maybe it's feasible to change d_add() and d_instantiate() calling conventions and really fold all that shit back into them... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/