Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries

2014-11-24 Thread Al Viro
On Sat, Nov 01, 2014 at 06:30:37PM +, Al Viro wrote:
> On Sat, Nov 01, 2014 at 03:06:20PM +, Al Viro wrote:
> > On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote:
> > > OK, interim branch (_completely_ untested, and there's quite a bit of
> > > work remaining) is in vfs.git#nsfs.
> > 
> > ... except that what got pushed was completely buggered - the last changes
> > not committed *and* include/linux/ns_common.h not git-add'ed at the very
> > beginning.  Oh, well - it wasn't hard to reconstruct its history...
> > 
> > Anyway, that much got fixed and pushed; sorry about the noise - shouldn't
> > have posted before grabbing some sleep...
> 
> And now it even seems to work.  Poking in /proc/*/ns/*, mount --bind of those
> suckers to regular files, unshare and nsenter (both by PID and by file,
> including the ones we'd bound somewhere).  Eric, could you hit it with
> whatever testsuite you are using and see if it survives?
> 
> It's vfs.git#nsfs (head should be at 2e64120), branched at -rc2.  The meat
> of that sucker is in the next-to-last commit; the rest is preparation and
> a minor cleanup.

ping?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries

2014-11-24 Thread Al Viro
On Sat, Nov 01, 2014 at 06:30:37PM +, Al Viro wrote:
 On Sat, Nov 01, 2014 at 03:06:20PM +, Al Viro wrote:
  On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote:
   OK, interim branch (_completely_ untested, and there's quite a bit of
   work remaining) is in vfs.git#nsfs.
  
  ... except that what got pushed was completely buggered - the last changes
  not committed *and* include/linux/ns_common.h not git-add'ed at the very
  beginning.  Oh, well - it wasn't hard to reconstruct its history...
  
  Anyway, that much got fixed and pushed; sorry about the noise - shouldn't
  have posted before grabbing some sleep...
 
 And now it even seems to work.  Poking in /proc/*/ns/*, mount --bind of those
 suckers to regular files, unshare and nsenter (both by PID and by file,
 including the ones we'd bound somewhere).  Eric, could you hit it with
 whatever testsuite you are using and see if it survives?
 
 It's vfs.git#nsfs (head should be at 2e64120), branched at -rc2.  The meat
 of that sucker is in the next-to-last commit; the rest is preparation and
 a minor cleanup.

ping?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries

2014-11-01 Thread Al Viro
On Sat, Nov 01, 2014 at 06:38:02PM +, Al Viro wrote:
> On Sat, Nov 01, 2014 at 11:23:34AM -0700, Eric W.Biederman wrote:
> 
> > >From your description I am concerned about using the letter combination 
> > >nsfs.   I once used nsfd, and that was so close to nfsd that Linus got 
> > >confused, and hilarity ensued.   nsfs isn't quite as bad but the 
> > >abbreviation still seems close enough to nfs that confusion could result.
> 
> Well, yes, but... the only non-static object in there with "nsfs" in the
> name is nsfs_init().  As for the filename itself... any better suggestions
> would be welcome, but it doesn't really mess the completion tree.
> 
> I've no strong preferences here - we might even move it into kernel/nsproxy.c.
> In the latest variant it's ~160 lines...

BTW, as an immediate followup, definition of struct nameidata can be
taken to fs/namei.c now.  nd_get_link/nd_set_link become exported, actual
declaration of struct nameidata goes into fs/namei.c and for everything
outside of fs/namei.c it becomes completely opaque.

TBH, I'm somewhat tempted to do something even more extreme - try to treat
it the way we treat pt_regs.  I.e. have a pointer to such sucker in
task_struct (NULL most of the time) and chain the instances.  That way
we'd be rid of passing that sucker even to ->follow_link() (and to
nd_{set,get}_link() as well).  path_init() would push the sucker to the
top of the chain, path_finish(nd, base) (currently open-coded bunch of
if (nd->root.mnt && !(nd->flags & LOOKUP_ROOT))
path_put(>root);
if (base)
fput(base);
) - remove it from the chain.  Not sure if it's worth doing - it's slightly
less boilerplate in ->follow_link() and ->put_link() instances and it _might_
reduce the stack footprint in a bunch of places in fs/namei.c (including the
recursion in there), but it might easily end up increasing it instead...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries

2014-11-01 Thread Al Viro
On Sat, Nov 01, 2014 at 11:23:34AM -0700, Eric W.Biederman wrote:

> >From your description I am concerned about using the letter combination 
> >nsfs.   I once used nsfd, and that was so close to nfsd that Linus got 
> >confused, and hilarity ensued.   nsfs isn't quite as bad but the 
> >abbreviation still seems close enough to nfs that confusion could result.

Well, yes, but... the only non-static object in there with "nsfs" in the
name is nsfs_init().  As for the filename itself... any better suggestions
would be welcome, but it doesn't really mess the completion tree.

I've no strong preferences here - we might even move it into kernel/nsproxy.c.
In the latest variant it's ~160 lines...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries

2014-11-01 Thread Al Viro
On Sat, Nov 01, 2014 at 03:06:20PM +, Al Viro wrote:
> On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote:
> > OK, interim branch (_completely_ untested, and there's quite a bit of
> > work remaining) is in vfs.git#nsfs.
> 
> ... except that what got pushed was completely buggered - the last changes
> not committed *and* include/linux/ns_common.h not git-add'ed at the very
> beginning.  Oh, well - it wasn't hard to reconstruct its history...
> 
> Anyway, that much got fixed and pushed; sorry about the noise - shouldn't
> have posted before grabbing some sleep...

And now it even seems to work.  Poking in /proc/*/ns/*, mount --bind of those
suckers to regular files, unshare and nsenter (both by PID and by file,
including the ones we'd bound somewhere).  Eric, could you hit it with
whatever testsuite you are using and see if it survives?

It's vfs.git#nsfs (head should be at 2e64120), branched at -rc2.  The meat
of that sucker is in the next-to-last commit; the rest is preparation and
a minor cleanup.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries

2014-11-01 Thread Eric W.Biederman


On November 1, 2014 8:06:20 AM PDT, Al Viro  wrote:
>On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote:
>> OK, interim branch (_completely_ untested, and there's quite a bit of
>> work remaining) is in vfs.git#nsfs.
>
>... except that what got pushed was completely buggered - the last
>changes
>not committed *and* include/linux/ns_common.h not git-add'ed at the
>very
>beginning.  Oh, well - it wasn't hard to reconstruct its history...
>
>Anyway, that much got fixed and pushed; sorry about the noise -
>shouldn't
>have posted before grabbing some sleep...

No worries.  I am effectively without internet access right from now until 
about the 3rd week in November as I move so I really can't look anyway.

>From your description I am concerned about using the letter combination nsfs.  
> I once used nsfd, and that was so close to nfsd that Linus got confused, and 
>hilarity ensued.   nsfs isn't quite as bad but the abbreviation still seems 
>close enough to nfs that confusion could result.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries

2014-11-01 Thread Al Viro
On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote:
> OK, interim branch (_completely_ untested, and there's quite a bit of
> work remaining) is in vfs.git#nsfs.

... except that what got pushed was completely buggered - the last changes
not committed *and* include/linux/ns_common.h not git-add'ed at the very
beginning.  Oh, well - it wasn't hard to reconstruct its history...

Anyway, that much got fixed and pushed; sorry about the noise - shouldn't
have posted before grabbing some sleep...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries

2014-11-01 Thread Al Viro
On Mon, Oct 13, 2014 at 03:42:28PM -0700, Eric W. Biederman wrote:

> >From looking at your proposed code that doesn't look like it will be
> any more difficult to maintain from the namespace side.  So I have no
> objects with moving the code in that direction. 
> 
> > It's obviously a project for the next cycle and it'll require
> > some cooperation between vfs and userns trees, but not too much of it -
> > all we really need is a never-changed commit embedding that structure into
> > all ...ns and passing _that_ to proc_{alloc,free}_inum() replacements
> > Merged both into vfs.git #for-next and usern one.  We can do that right
> > after -rc1.  Incidentally, it might make sense to move refcount into that
> > common piece as well, but that's independent from the stuff above.
> 
> That sounds like a reasonable direction to go.  I think your schedule
> may be a tad optimistic time-wise, if for no other reason that code
> reviews take time, but that plan should work.

OK, interim branch (_completely_ untested, and there's quite a bit of
work remaining) is in vfs.git#nsfs.  It will be refactored; moreover,
most of that stuff will move out from fs/proc/namespaces.c - either into
fs/nsfs.c or into kernel/nsproxy.c.  And purely VFS followups are not even
touched yet - I know what I want to do, and I think there's no obstacles
left, but that's a separate story.  "Filesystem for ns dentries/inodes"
part is done, though (modulo debugging and moving it out of fs/proc).

I wonder if we would be better off not using proc_alloc_inum() there -
it's not hard to do a separate allocator, especially since there's no
requirements re non-intersecting sets of inumbers for procfs and for this.
That's the last part shared with procfs...

Anyway, I'm going down right now (nearly 5am here), so further fun will have
to wait a bit...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries

2014-11-01 Thread Al Viro
On Mon, Oct 13, 2014 at 03:42:28PM -0700, Eric W. Biederman wrote:

 From looking at your proposed code that doesn't look like it will be
 any more difficult to maintain from the namespace side.  So I have no
 objects with moving the code in that direction. 
 
  It's obviously a project for the next cycle and it'll require
  some cooperation between vfs and userns trees, but not too much of it -
  all we really need is a never-changed commit embedding that structure into
  all ...ns and passing _that_ to proc_{alloc,free}_inum() replacements
  Merged both into vfs.git #for-next and usern one.  We can do that right
  after -rc1.  Incidentally, it might make sense to move refcount into that
  common piece as well, but that's independent from the stuff above.
 
 That sounds like a reasonable direction to go.  I think your schedule
 may be a tad optimistic time-wise, if for no other reason that code
 reviews take time, but that plan should work.

OK, interim branch (_completely_ untested, and there's quite a bit of
work remaining) is in vfs.git#nsfs.  It will be refactored; moreover,
most of that stuff will move out from fs/proc/namespaces.c - either into
fs/nsfs.c or into kernel/nsproxy.c.  And purely VFS followups are not even
touched yet - I know what I want to do, and I think there's no obstacles
left, but that's a separate story.  Filesystem for ns dentries/inodes
part is done, though (modulo debugging and moving it out of fs/proc).

I wonder if we would be better off not using proc_alloc_inum() there -
it's not hard to do a separate allocator, especially since there's no
requirements re non-intersecting sets of inumbers for procfs and for this.
That's the last part shared with procfs...

Anyway, I'm going down right now (nearly 5am here), so further fun will have
to wait a bit...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries

2014-11-01 Thread Al Viro
On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote:
 OK, interim branch (_completely_ untested, and there's quite a bit of
 work remaining) is in vfs.git#nsfs.

... except that what got pushed was completely buggered - the last changes
not committed *and* include/linux/ns_common.h not git-add'ed at the very
beginning.  Oh, well - it wasn't hard to reconstruct its history...

Anyway, that much got fixed and pushed; sorry about the noise - shouldn't
have posted before grabbing some sleep...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries

2014-11-01 Thread Eric W.Biederman


On November 1, 2014 8:06:20 AM PDT, Al Viro v...@zeniv.linux.org.uk wrote:
On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote:
 OK, interim branch (_completely_ untested, and there's quite a bit of
 work remaining) is in vfs.git#nsfs.

... except that what got pushed was completely buggered - the last
changes
not committed *and* include/linux/ns_common.h not git-add'ed at the
very
beginning.  Oh, well - it wasn't hard to reconstruct its history...

Anyway, that much got fixed and pushed; sorry about the noise -
shouldn't
have posted before grabbing some sleep...

No worries.  I am effectively without internet access right from now until 
about the 3rd week in November as I move so I really can't look anyway.

From your description I am concerned about using the letter combination nsfs.  
 I once used nsfd, and that was so close to nfsd that Linus got confused, and 
hilarity ensued.   nsfs isn't quite as bad but the abbreviation still seems 
close enough to nfs that confusion could result.

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries

2014-11-01 Thread Al Viro
On Sat, Nov 01, 2014 at 03:06:20PM +, Al Viro wrote:
 On Sat, Nov 01, 2014 at 08:38:04AM +, Al Viro wrote:
  OK, interim branch (_completely_ untested, and there's quite a bit of
  work remaining) is in vfs.git#nsfs.
 
 ... except that what got pushed was completely buggered - the last changes
 not committed *and* include/linux/ns_common.h not git-add'ed at the very
 beginning.  Oh, well - it wasn't hard to reconstruct its history...
 
 Anyway, that much got fixed and pushed; sorry about the noise - shouldn't
 have posted before grabbing some sleep...

And now it even seems to work.  Poking in /proc/*/ns/*, mount --bind of those
suckers to regular files, unshare and nsenter (both by PID and by file,
including the ones we'd bound somewhere).  Eric, could you hit it with
whatever testsuite you are using and see if it survives?

It's vfs.git#nsfs (head should be at 2e64120), branched at -rc2.  The meat
of that sucker is in the next-to-last commit; the rest is preparation and
a minor cleanup.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries

2014-11-01 Thread Al Viro
On Sat, Nov 01, 2014 at 11:23:34AM -0700, Eric W.Biederman wrote:

 From your description I am concerned about using the letter combination 
 nsfs.   I once used nsfd, and that was so close to nfsd that Linus got 
 confused, and hilarity ensued.   nsfs isn't quite as bad but the 
 abbreviation still seems close enough to nfs that confusion could result.

Well, yes, but... the only non-static object in there with nsfs in the
name is nsfs_init().  As for the filename itself... any better suggestions
would be welcome, but it doesn't really mess the completion tree.

I've no strong preferences here - we might even move it into kernel/nsproxy.c.
In the latest variant it's ~160 lines...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries

2014-11-01 Thread Al Viro
On Sat, Nov 01, 2014 at 06:38:02PM +, Al Viro wrote:
 On Sat, Nov 01, 2014 at 11:23:34AM -0700, Eric W.Biederman wrote:
 
  From your description I am concerned about using the letter combination 
  nsfs.   I once used nsfd, and that was so close to nfsd that Linus got 
  confused, and hilarity ensued.   nsfs isn't quite as bad but the 
  abbreviation still seems close enough to nfs that confusion could result.
 
 Well, yes, but... the only non-static object in there with nsfs in the
 name is nsfs_init().  As for the filename itself... any better suggestions
 would be welcome, but it doesn't really mess the completion tree.
 
 I've no strong preferences here - we might even move it into kernel/nsproxy.c.
 In the latest variant it's ~160 lines...

BTW, as an immediate followup, definition of struct nameidata can be
taken to fs/namei.c now.  nd_get_link/nd_set_link become exported, actual
declaration of struct nameidata goes into fs/namei.c and for everything
outside of fs/namei.c it becomes completely opaque.

TBH, I'm somewhat tempted to do something even more extreme - try to treat
it the way we treat pt_regs.  I.e. have a pointer to such sucker in
task_struct (NULL most of the time) and chain the instances.  That way
we'd be rid of passing that sucker even to -follow_link() (and to
nd_{set,get}_link() as well).  path_init() would push the sucker to the
top of the chain, path_finish(nd, base) (currently open-coded bunch of
if (nd-root.mnt  !(nd-flags  LOOKUP_ROOT))
path_put(nd-root);
if (base)
fput(base);
) - remove it from the chain.  Not sure if it's worth doing - it's slightly
less boilerplate in -follow_link() and -put_link() instances and it _might_
reduce the stack footprint in a bunch of places in fs/namei.c (including the
recursion in there), but it might easily end up increasing it instead...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries

2014-10-13 Thread Al Viro
On Mon, Oct 13, 2014 at 07:29:45AM +0100, Al Viro wrote:

> there and get rid of pointless aliases.  Oh, and we get a decent chance to
> kill d_instantiate_unique(), which is also nice.  Or at least fold it into
> d_add_unique(), if we can't kill that sucker as well.  And if we manage to
> kill it outright, we get rid of __d_instantiate_unique() for free - in my
> local queue d_materialise_unique() had been subsumed by d_splice_alias(),
> getting rid of the other caller of __d_instantiate_unique().  We have *WAY*
> too many primitives in that area, and trimming that forest is definitely
> a good thing.

FWIW, looking at that only caller of d_add_unique(), I really think that
it's killable.

Call graph for that sucker is
d_add_unique()
<- _nfs4_open_and_get_state
<- _nfs4_do_open
<- nfs4_do_open
<- nfs4_atomic_open
= nfs_rpc_ops.open_context
<- nfs_atomic_open
= inode_operations.atomic_open
<- nfs4_file_open
= file_operations.open
<- nfs4_proc_create
= nfs_rpc_ops.create
<- nfs_creat
= inode_operations.create

Now, to have file_operations.open called, we need dentry to be pinned down
and positive.  In that case d_add_unique() isn't called, obviously.

OTOH, inode_operations.create and inode_operations.atomic_open are called
with parent directory having its ->i_mutex held and we have every right to
just do d_splice_alias() in both cases.  IOW, d_add_unique() can be
simply killed, and with aforementioned change to fs/proc/namespaces.c we
can kill d_instantiate_unique() and __d_instantiate_unique() along with it.

OK, that drives the number of primitives further down...  Actually, I suspect
that a bunch of places calling d_add() in their ->lookup() instances should
call d_splice_alias() instead - it won't cost more unless the inode happens
to be a directory with existing aliases.  And to hell with "we should use
d_splice_alias() if it's exportable, no, make it d_materialise_unique() if
tree topology might change unexpectedly, but if neither is true we can just
use d_add() and return NULL" - it's too complicated for no good reason.
"Always use d_splice_alias() in lookups" make a lot more sense...

Oh, and I strongly suspect that d_instantiate() ought to be taught to act
the same way wrt directory aliases (i.e. relocate if one is found).  Which
kills d_instantiate_no_diralias() (only one caller for that one) and closes
very narrow nfsd races in ext2_mkdir() and friends...

We really have too many primitives in that area; in the beginning it was just
d_instantiate() for object creation (when dentry was already hashed) and
d_add() for lookups (when dentry was hashed in addition to being tied to
inode).  It served well for local never-exported filesystems, but once we
added knfsd (and then open-by-fhandle) the variants started to breed like
rabbits.  Filesystems with tree topology that might be silently changed
behind our backs made it only worse...  The real reason why we didn't just
change d_add()/d_instantiate() behaviour back then was that in "the inode
is a directory with existing aliases" case we need to go with another dentry
(after moving it in place of ours), which changes the calling conventions.
OTOH, in that case calling d_add() or d_instantiate() currently means a serious
bug - we end up with multiple aliases for a directory, which should never
happen.  Hell knows, maybe it's feasible to change d_add() and d_instantiate()
calling conventions and really fold all that shit back into them...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries

2014-10-13 Thread Eric W. Biederman
Al Viro  writes:

> But more fundamentally, this stuff has no business being in procfs.  The
> *only* reason why we are putting them there (and get those inodes and
> dentries duplicated for different procfs instances) is this in
> do_loopback():
> if (!check_mnt(parent) || !check_mnt(old))
> goto out2;
> You want to be able to bind those suckers by mount --bind.  Which is an
> odd API design (and it gives you another headache from hell with 
> mnt_ns_loop())
> but it's too late to change, sadly ;-/  So you want ->follow_link() to
> yield dentries matching some vfsmount in our namespace.  The things would be
> much simpler if we didn't try that.
>
> Look: do_loopback() is *already* aware of those suckers.  Has to be, due to
> aforementioned mnt_ns_loop().  So we might as well weaken the constraints on
> old - check_mnt(old) || proc_ns_inode(old_path.dentry->d_inode) would do.
> The thing is, we calculate that proc_ns_inode() anyway.

Truthfully mnt_ns_loop only exists for because we have these for the
mount namespace.  But I take your point.

Your proposal sounds like it should work.  After the bind mount is
created we have a new struct mount so the existing chk_mnt checks won't
get us into trouble.

At some point we may need to allow multiple internal superblocks to
allow having a preset inode number for the purposes of
checkpoint/restart but I don't see that being a bottleneck to process
migration anytime in the near future, and your design does not preclude
that possibility.

There was a potential issue with your earlier suggestion to modify
check_mnt to only include mounted filesystems as otherwise there is an
issue with references to filesystems unmounted with MNT_DETACH being
able to be bind mounted.  Having an explicit proc_ns_inode test avoids
those issues.


> Now, suppose we'd done that.  In principle, it allows some things that are
> not allowed right now - you could open such file and pass it (via SCM_RIGHTS)
> to a process in another namespace.  With the current tree you can't bind
> that bugger via /proc/self/fd/ - it will have alien vfsmount.  After such
> change it will become allowed.  Is there any problem with that?  I don't see
> any (after all, recepient might have held it open and anyone who could
> get to that sucker via /proc//fd/ could've simply stopped the 
> recepient
> and made it pass the damn thing to them - being able to ptrace is enough
> for that).  Am I missing anything?

The only thing you can't do with a passed file descriptor today is to
mount it.  Everything else is already doable in any context, and all
mounting does is give you the ability to keep the namespace alive.
Something you can already do by just keeping the file descriptor open.

These filedescriptors are guarded by directory permissions so if
/proc//fd had more open permissions I might be concerned, but
as it stands only the user of the process possesors of the appropriate
capability can access that directory.

So I really think we are good.  Thank you for making such a careful
analysis.

> *IF* we can get away with such change, everything becomes much easier.  We
> can add a filesystem just for those guys - one for the entire system.  And
> kern_mount() it.  Don't bother with register_filesystem() - no point showing
> it in the /proc/filesystems, etc.  No need to abuse procfs inodes either.
> Or procfs icache, for that matter.  All we need is a small standard piece
> embedded into each {mnt,net,pid,user,...}ns.
>
> Namely, let's stash a pointer to such dentry (if any) into that standard 
> piece.
> Not a counting reference, of course, and used in a very limited way.
>   * have ->d_prune() for those guys replace the stashed pointer with NULL.
>   * on the "get dentry for ns" side,
>   again:
>   rcu_read_lock()
>   d = atomic_long_read(stashed pointer)
>   if (!d)
>   goto slow;
>   if (!lockref_get_not_dead(>d_lockref))
>   goto slow;
>   rcu_read_unlock();
>   return d;
>   slow:
>   rcu_read_unlock();
>   allocate new inode
>   allocate new dentry
>   d_instantiate()
>   d = atomic_long_cmpxchg(stashed pointer, NULL, new dentry);
>   if (d) {
>   dput(new dentry);
>   cpu_relax();// might not be needed
>   goto again;
>   }
>   return new dentry;
> I'd suggest having ->i_private point to that standard piece and storing
> ns_ops and proc_inum in there.

I think it is a shame that we don't have an equivalent of
d_materialise_unique that does thoe whole d_find_alias logic
non-directories.  Sigh.  But what should drive which functions to keep
are the real and needs of filesystems not my weird namespace case.

>From looking at your proposed code that doesn't look like it will be
any more difficult to 

[RFC] dealing with proc_ns_follow_link() and "namespace" dentries

2014-10-13 Thread Al Viro
proc_ns_follow_link() is unique among the ->follow_link() instances
in several respects, some of them violating general asserts expected by
all kinds of VFS code.

First of all, it can lead to  pairs with dentry
being completely unrelated to vfsmount.  Suppose we bind such a symlink over
something else.  ->follow_link() is called with nd->path set to directory
we'd found the symlink in - after all, that's what the normal symlinks are
interpreted against.  The same symlink present in two different directories
can resolve to different places.  In this case we generate a dentry/inode in
procfs and combine that dentry with vfsmount of parent.  Which might be
completely unrelated to procfs.

The thing is, procfs isn't a good match for those inodes either.
For one thing, they really don't give a damn which procfs instance are
they in - if anything, they should be associated with ns.  For another,
using procfs icache for looking them up (by really strange inumbers)
solves the problem only partially.  d_instantiate_unique(), used there in
attempt to keep dentries from breeding, does not do what you seem to expect
it to do.  Namely, it *never* picks an existing alias.  Why?  Because it
looks for aliases with the same ->d_parent (and name) as the dentry we'd
given it.  And that dentry comes from d_alloc_pseudo(), i.e. it has
itself for ->d_parent.  *All* aliases for those inodes happen that way.
IOW, none of them have the same ->d_parent, and d_instantiate_unique()
here is simply a fancy way to spin through all existing aliases and call
d_instantiate() on our new one.  Quick experiment:
main()
{
int i;
for (i = 0; i < 10; i++) {
open("/proc/self/ns/mnt", 0);
if (!(i % 1000))
system("grep dentry /proc/slabinfo");
}
run with ulimit -n 10 and watch it create a new dentry per open().
Profile of that (sans system(3)) is also rather interesting -
each d_instantiate_unique() is O(N), so we spend a quadratic time in
that sucker for no good reason whatsoever.

If nothing else, d = d_find_any_alias(inode); if (d) {dput(dentry);return d;}
d_set_d_op(dentry, ...); d_instantiate(dentry, inode); return dentry; would be
a much better strategy - *this* would almost always find an alias when one is
to be found.  We might occasionally get more than one (when several processes
race), but that beats the hell out of easily getting millions of them...

But more fundamentally, this stuff has no business being in procfs.  The
*only* reason why we are putting them there (and get those inodes and
dentries duplicated for different procfs instances) is this in
do_loopback():
if (!check_mnt(parent) || !check_mnt(old))
goto out2;
You want to be able to bind those suckers by mount --bind.  Which is an
odd API design (and it gives you another headache from hell with mnt_ns_loop())
but it's too late to change, sadly ;-/  So you want ->follow_link() to
yield dentries matching some vfsmount in our namespace.  The things would be
much simpler if we didn't try that.

Look: do_loopback() is *already* aware of those suckers.  Has to be, due to
aforementioned mnt_ns_loop().  So we might as well weaken the constraints on
old - check_mnt(old) || proc_ns_inode(old_path.dentry->d_inode) would do.
The thing is, we calculate that proc_ns_inode() anyway.

Now, suppose we'd done that.  In principle, it allows some things that are
not allowed right now - you could open such file and pass it (via SCM_RIGHTS)
to a process in another namespace.  With the current tree you can't bind
that bugger via /proc/self/fd/ - it will have alien vfsmount.  After such
change it will become allowed.  Is there any problem with that?  I don't see
any (after all, recepient might have held it open and anyone who could
get to that sucker via /proc//fd/ could've simply stopped the recepient
and made it pass the damn thing to them - being able to ptrace is enough
for that).  Am I missing anything?

*IF* we can get away with such change, everything becomes much easier.  We
can add a filesystem just for those guys - one for the entire system.  And
kern_mount() it.  Don't bother with register_filesystem() - no point showing
it in the /proc/filesystems, etc.  No need to abuse procfs inodes either.
Or procfs icache, for that matter.  All we need is a small standard piece
embedded into each {mnt,net,pid,user,...}ns.

Namely, let's stash a pointer to such dentry (if any) into that standard piece.
Not a counting reference, of course, and used in a very limited way.
* have ->d_prune() for those guys replace the stashed pointer with NULL.
* on the "get dentry for ns" side,
again:
rcu_read_lock()
d = atomic_long_read(stashed pointer)
if (!d)
goto slow;
if (!lockref_get_not_dead(>d_lockref))
goto slow;
rcu_read_unlock();

[RFC] dealing with proc_ns_follow_link() and namespace dentries

2014-10-13 Thread Al Viro
proc_ns_follow_link() is unique among the -follow_link() instances
in several respects, some of them violating general asserts expected by
all kinds of VFS code.

First of all, it can lead to vfsmount,dentry pairs with dentry
being completely unrelated to vfsmount.  Suppose we bind such a symlink over
something else.  -follow_link() is called with nd-path set to directory
we'd found the symlink in - after all, that's what the normal symlinks are
interpreted against.  The same symlink present in two different directories
can resolve to different places.  In this case we generate a dentry/inode in
procfs and combine that dentry with vfsmount of parent.  Which might be
completely unrelated to procfs.

The thing is, procfs isn't a good match for those inodes either.
For one thing, they really don't give a damn which procfs instance are
they in - if anything, they should be associated with ns.  For another,
using procfs icache for looking them up (by really strange inumbers)
solves the problem only partially.  d_instantiate_unique(), used there in
attempt to keep dentries from breeding, does not do what you seem to expect
it to do.  Namely, it *never* picks an existing alias.  Why?  Because it
looks for aliases with the same -d_parent (and name) as the dentry we'd
given it.  And that dentry comes from d_alloc_pseudo(), i.e. it has
itself for -d_parent.  *All* aliases for those inodes happen that way.
IOW, none of them have the same -d_parent, and d_instantiate_unique()
here is simply a fancy way to spin through all existing aliases and call
d_instantiate() on our new one.  Quick experiment:
main()
{
int i;
for (i = 0; i  10; i++) {
open(/proc/self/ns/mnt, 0);
if (!(i % 1000))
system(grep dentry /proc/slabinfo);
}
run with ulimit -n 10 and watch it create a new dentry per open().
Profile of that (sans system(3)) is also rather interesting -
each d_instantiate_unique() is O(N), so we spend a quadratic time in
that sucker for no good reason whatsoever.

If nothing else, d = d_find_any_alias(inode); if (d) {dput(dentry);return d;}
d_set_d_op(dentry, ...); d_instantiate(dentry, inode); return dentry; would be
a much better strategy - *this* would almost always find an alias when one is
to be found.  We might occasionally get more than one (when several processes
race), but that beats the hell out of easily getting millions of them...

But more fundamentally, this stuff has no business being in procfs.  The
*only* reason why we are putting them there (and get those inodes and
dentries duplicated for different procfs instances) is this in
do_loopback():
if (!check_mnt(parent) || !check_mnt(old))
goto out2;
You want to be able to bind those suckers by mount --bind.  Which is an
odd API design (and it gives you another headache from hell with mnt_ns_loop())
but it's too late to change, sadly ;-/  So you want -follow_link() to
yield dentries matching some vfsmount in our namespace.  The things would be
much simpler if we didn't try that.

Look: do_loopback() is *already* aware of those suckers.  Has to be, due to
aforementioned mnt_ns_loop().  So we might as well weaken the constraints on
old - check_mnt(old) || proc_ns_inode(old_path.dentry-d_inode) would do.
The thing is, we calculate that proc_ns_inode() anyway.

Now, suppose we'd done that.  In principle, it allows some things that are
not allowed right now - you could open such file and pass it (via SCM_RIGHTS)
to a process in another namespace.  With the current tree you can't bind
that bugger via /proc/self/fd/n - it will have alien vfsmount.  After such
change it will become allowed.  Is there any problem with that?  I don't see
any (after all, recepient might have held it open and anyone who could
get to that sucker via /proc/pid/fd/n could've simply stopped the recepient
and made it pass the damn thing to them - being able to ptrace is enough
for that).  Am I missing anything?

*IF* we can get away with such change, everything becomes much easier.  We
can add a filesystem just for those guys - one for the entire system.  And
kern_mount() it.  Don't bother with register_filesystem() - no point showing
it in the /proc/filesystems, etc.  No need to abuse procfs inodes either.
Or procfs icache, for that matter.  All we need is a small standard piece
embedded into each {mnt,net,pid,user,...}ns.

Namely, let's stash a pointer to such dentry (if any) into that standard piece.
Not a counting reference, of course, and used in a very limited way.
* have -d_prune() for those guys replace the stashed pointer with NULL.
* on the get dentry for ns side,
again:
rcu_read_lock()
d = atomic_long_read(stashed pointer)
if (!d)
goto slow;
if (!lockref_get_not_dead(d-d_lockref))
goto slow;

Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries

2014-10-13 Thread Eric W. Biederman
Al Viro v...@zeniv.linux.org.uk writes:

 But more fundamentally, this stuff has no business being in procfs.  The
 *only* reason why we are putting them there (and get those inodes and
 dentries duplicated for different procfs instances) is this in
 do_loopback():
 if (!check_mnt(parent) || !check_mnt(old))
 goto out2;
 You want to be able to bind those suckers by mount --bind.  Which is an
 odd API design (and it gives you another headache from hell with 
 mnt_ns_loop())
 but it's too late to change, sadly ;-/  So you want -follow_link() to
 yield dentries matching some vfsmount in our namespace.  The things would be
 much simpler if we didn't try that.

 Look: do_loopback() is *already* aware of those suckers.  Has to be, due to
 aforementioned mnt_ns_loop().  So we might as well weaken the constraints on
 old - check_mnt(old) || proc_ns_inode(old_path.dentry-d_inode) would do.
 The thing is, we calculate that proc_ns_inode() anyway.

Truthfully mnt_ns_loop only exists for because we have these for the
mount namespace.  But I take your point.

Your proposal sounds like it should work.  After the bind mount is
created we have a new struct mount so the existing chk_mnt checks won't
get us into trouble.

At some point we may need to allow multiple internal superblocks to
allow having a preset inode number for the purposes of
checkpoint/restart but I don't see that being a bottleneck to process
migration anytime in the near future, and your design does not preclude
that possibility.

There was a potential issue with your earlier suggestion to modify
check_mnt to only include mounted filesystems as otherwise there is an
issue with references to filesystems unmounted with MNT_DETACH being
able to be bind mounted.  Having an explicit proc_ns_inode test avoids
those issues.


 Now, suppose we'd done that.  In principle, it allows some things that are
 not allowed right now - you could open such file and pass it (via SCM_RIGHTS)
 to a process in another namespace.  With the current tree you can't bind
 that bugger via /proc/self/fd/n - it will have alien vfsmount.  After such
 change it will become allowed.  Is there any problem with that?  I don't see
 any (after all, recepient might have held it open and anyone who could
 get to that sucker via /proc/pid/fd/n could've simply stopped the 
 recepient
 and made it pass the damn thing to them - being able to ptrace is enough
 for that).  Am I missing anything?

The only thing you can't do with a passed file descriptor today is to
mount it.  Everything else is already doable in any context, and all
mounting does is give you the ability to keep the namespace alive.
Something you can already do by just keeping the file descriptor open.

These filedescriptors are guarded by directory permissions so if
/proc/pid/fd had more open permissions I might be concerned, but
as it stands only the user of the process possesors of the appropriate
capability can access that directory.

So I really think we are good.  Thank you for making such a careful
analysis.

 *IF* we can get away with such change, everything becomes much easier.  We
 can add a filesystem just for those guys - one for the entire system.  And
 kern_mount() it.  Don't bother with register_filesystem() - no point showing
 it in the /proc/filesystems, etc.  No need to abuse procfs inodes either.
 Or procfs icache, for that matter.  All we need is a small standard piece
 embedded into each {mnt,net,pid,user,...}ns.

 Namely, let's stash a pointer to such dentry (if any) into that standard 
 piece.
 Not a counting reference, of course, and used in a very limited way.
   * have -d_prune() for those guys replace the stashed pointer with NULL.
   * on the get dentry for ns side,
   again:
   rcu_read_lock()
   d = atomic_long_read(stashed pointer)
   if (!d)
   goto slow;
   if (!lockref_get_not_dead(d-d_lockref))
   goto slow;
   rcu_read_unlock();
   return d;
   slow:
   rcu_read_unlock();
   allocate new inode
   allocate new dentry
   d_instantiate()
   d = atomic_long_cmpxchg(stashed pointer, NULL, new dentry);
   if (d) {
   dput(new dentry);
   cpu_relax();// might not be needed
   goto again;
   }
   return new dentry;
 I'd suggest having -i_private point to that standard piece and storing
 ns_ops and proc_inum in there.

I think it is a shame that we don't have an equivalent of
d_materialise_unique that does thoe whole d_find_alias logic
non-directories.  Sigh.  But what should drive which functions to keep
are the real and needs of filesystems not my weird namespace case.

From looking at your proposed code that doesn't look like it will be
any more difficult to maintain from the namespace side.  So I 

Re: [RFC] dealing with proc_ns_follow_link() and namespace dentries

2014-10-13 Thread Al Viro
On Mon, Oct 13, 2014 at 07:29:45AM +0100, Al Viro wrote:

 there and get rid of pointless aliases.  Oh, and we get a decent chance to
 kill d_instantiate_unique(), which is also nice.  Or at least fold it into
 d_add_unique(), if we can't kill that sucker as well.  And if we manage to
 kill it outright, we get rid of __d_instantiate_unique() for free - in my
 local queue d_materialise_unique() had been subsumed by d_splice_alias(),
 getting rid of the other caller of __d_instantiate_unique().  We have *WAY*
 too many primitives in that area, and trimming that forest is definitely
 a good thing.

FWIW, looking at that only caller of d_add_unique(), I really think that
it's killable.

Call graph for that sucker is
d_add_unique()
- _nfs4_open_and_get_state
- _nfs4_do_open
- nfs4_do_open
- nfs4_atomic_open
= nfs_rpc_ops.open_context
- nfs_atomic_open
= inode_operations.atomic_open
- nfs4_file_open
= file_operations.open
- nfs4_proc_create
= nfs_rpc_ops.create
- nfs_creat
= inode_operations.create

Now, to have file_operations.open called, we need dentry to be pinned down
and positive.  In that case d_add_unique() isn't called, obviously.

OTOH, inode_operations.create and inode_operations.atomic_open are called
with parent directory having its -i_mutex held and we have every right to
just do d_splice_alias() in both cases.  IOW, d_add_unique() can be
simply killed, and with aforementioned change to fs/proc/namespaces.c we
can kill d_instantiate_unique() and __d_instantiate_unique() along with it.

OK, that drives the number of primitives further down...  Actually, I suspect
that a bunch of places calling d_add() in their -lookup() instances should
call d_splice_alias() instead - it won't cost more unless the inode happens
to be a directory with existing aliases.  And to hell with we should use
d_splice_alias() if it's exportable, no, make it d_materialise_unique() if
tree topology might change unexpectedly, but if neither is true we can just
use d_add() and return NULL - it's too complicated for no good reason.
Always use d_splice_alias() in lookups make a lot more sense...

Oh, and I strongly suspect that d_instantiate() ought to be taught to act
the same way wrt directory aliases (i.e. relocate if one is found).  Which
kills d_instantiate_no_diralias() (only one caller for that one) and closes
very narrow nfsd races in ext2_mkdir() and friends...

We really have too many primitives in that area; in the beginning it was just
d_instantiate() for object creation (when dentry was already hashed) and
d_add() for lookups (when dentry was hashed in addition to being tied to
inode).  It served well for local never-exported filesystems, but once we
added knfsd (and then open-by-fhandle) the variants started to breed like
rabbits.  Filesystems with tree topology that might be silently changed
behind our backs made it only worse...  The real reason why we didn't just
change d_add()/d_instantiate() behaviour back then was that in the inode
is a directory with existing aliases case we need to go with another dentry
(after moving it in place of ours), which changes the calling conventions.
OTOH, in that case calling d_add() or d_instantiate() currently means a serious
bug - we end up with multiple aliases for a directory, which should never
happen.  Hell knows, maybe it's feasible to change d_add() and d_instantiate()
calling conventions and really fold all that shit back into them...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/