Re: [PATCH 0/3] Introduce LSM-hook for socketpair(2)

2018-04-23 Thread Serge E. Hallyn
Quoting David Herrmann (dh.herrm...@gmail.com):
> Hi
> 
> This series adds a new LSM hook for the socketpair(2) syscall. The idea
> is to allow SO_PEERSEC to be called on AF_UNIX sockets created via
> socketpair(2), and return the same information as if you emulated
> socketpair(2) via a temporary listener socket. Right now SO_PEERSEC
> will return the unlabeled credentials for a socketpair, rather than the
> actual credentials of the creating process.
> 
> A simple call to:
> 
> socketpair(AF_UNIX, SOCK_STREAM, 0, out);
> 
> can be emulated via a temporary listener socket bound to a unique,
> random name in the abstract namespace. By connecting to this listener
> socket, accept(2) will return the second part of the pair. If
> SO_PEERSEC is queried on these, the correct credentials of the creating
> process are returned. A simple comparison between the behavior of
> SO_PEERSEC on socketpair(2) and an emulated socketpair is included in
> the dbus-broker test-suite [1].
> 
> This patch series tries to close this gap and makes both behave the
> same. A new LSM-hook is added which allows LSMs to cache the correct
> peer information on newly created socket-pairs.
> 
> Apart from fixing this behavioral difference, the dbus-broker project
> actually needs to query the credentials of socketpairs, and currently
> must resort to racy procfs(2) queries to get the LSM credentials of its
> controller socket. Several parts of the dbus-broker project allow you
> to pass in a socket during execve(2), which will be used by the child
> process to accept control-commands from its parent. One natural way to
> create this communication channel is to use socketpair(2). However,
> right now SO_PEERSEC does not return any useful information, hence, the
> child-process would need other means to retrieve this information. By
> avoiding socketpair(2) and using the hacky-emulated version, this is not
> an issue.
> 
> There was a previous discussion on this matter [2] roughly a year ago.
> Back then there was the suspicion that proper SO_PEERSEC would confuse
> applications. However, we could not find any evidence backing this
> suspicion. Furthermore, we now actually see the contrary. Lack of
> SO_PEERSEC makes it a hassle to use socketpairs with LSM credentials.
> Hence, we propose to implement full SO_PEERSEC for socketpairs.
> 
> This series only adds SELinux backends, since that is what we need for
> RHEL. I will gladly extend the other LSMs if needed.
> 
> Thanks
> David
> 
> [1] https://github.com/bus1/dbus-broker/blob/master/src/util/test-peersec.c
> [2] https://www.spinics.net/lists/selinux/msg22674.html
> 
> David Herrmann (3):
>   security: add hook for socketpair(AF_UNIX, ...)
>   net/unix: hook unix_socketpair() into LSM
>   selinux: provide unix_stream_socketpair callback
> 
>  include/linux/lsm_hooks.h |  8 
>  include/linux/security.h  |  7 +++
>  net/unix/af_unix.c|  5 +
>  security/security.c   |  6 ++
>  security/selinux/hooks.c  | 14 ++
>  5 files changed, 40 insertions(+)

Makes sense to me, thanks.

Acked-by: Serge Hallyn 



Re: [RFC PATCH V1 00/12] audit: implement container id

2018-03-06 Thread Serge E. Hallyn
Quoting Richard Guy Briggs (r...@redhat.com):
> Implement audit kernel container ID.
> 
> This patchset is a preliminary RFC based on the proposal document (V3)
> posted:
>   https://www.redhat.com/archives/linux-audit/2018-January/msg00014.html

Patchset looks good to me.

Acked-by: Serge Hallyn 



Re: [RFC PATCH V1 01/12] audit: add container id

2018-03-03 Thread Serge E. Hallyn
On Thu, Mar 01, 2018 at 02:41:04PM -0500, Richard Guy Briggs wrote:
...
> +static inline bool audit_containerid_set(struct task_struct *tsk)

Hi Richard,

the calls to audit_containerid_set() confused me.  Could you make it
is_audit_containerid_set() or audit_containerid_isset()?

> +{
> + return audit_get_containerid(tsk) != INVALID_CID;
> +}




Re: RFC(V3): Audit Kernel Container IDs

2018-02-02 Thread Serge E. Hallyn
On Fri, Feb 02, 2018 at 05:05:22PM -0500, Paul Moore wrote:
> On Tue, Jan 9, 2018 at 7:16 AM, Richard Guy Briggs  wrote:
> > Containers are a userspace concept.  The kernel knows nothing of them.
> >
> > The Linux audit system needs a way to be able to track the container
> > provenance of events and actions.  Audit needs the kernel's help to do
> > this.
> 
> Two small comments below, but I tend to think we are at a point where
> you can start cobbling together some prototype/RFC patches.  Surely

Agreed.

LGTM.

> there are going to be a few changes, and new comments, that come out
> once we see an initial implementation so let's see what those are.

thanks,
-serge


Re: [PATCHv3 0/2] capability controlled user-namespaces

2018-01-09 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> On Mon, Jan 8, 2018 at 10:36 AM, Serge E. Hallyn <se...@hallyn.com> wrote:
> > Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> >> On Mon, Jan 8, 2018 at 10:11 AM, Serge E. Hallyn <se...@hallyn.com> wrote:
> >> > Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> >> >> On Mon, Jan 8, 2018 at 7:47 AM, Serge E. Hallyn <se...@hallyn.com> 
> >> >> wrote:
> >> >> > Quoting James Morris (james.l.mor...@oracle.com):
> >> >> >> On Mon, 8 Jan 2018, Serge E. Hallyn wrote:
> >> >> >> I meant in terms of "marking" a user ns as "controlled" type -- it's
> >> >> >> unnecessary jargon from an end user point of view.
> >> >> >
> >> >> > Ah, yes, that was my point in
> >> >> >
> >> >> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/01845.html
> >> >> > and
> >> >> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/02276.html
> >> >> >
> >> >> >> This may happen internally but don't make it a special case with a
> >> >> >> different name and don't bother users with internal concepts: simply
> >> >> >> implement capability whitelists with the default having equivalent
> >> >
> >> > So the challenge is to have unprivileged users be contained, while
> >> > allowing trusted workloads in containers created by a root user to
> >> > bypass the restriction.
> >> >
> >> > Now, the current proposal actually doesn't support a root user starting
> >> > an application that it doesn't quite trust in such a way that it *is*
> >> > subject to the whitelist.
> >>
> >> Well, this is not hard since root process can spawn another process
> >> and loose privileges before creating user-ns to be controlled by the
> >> whitelist.
> >
> > It would have to drop cap_sys_admin for the container to be marked as
> > "controlled", which may prevent the container runtime from properly starting
> > the container.
> >
> Yes, but that's a conflict of trusted operations (that requires
> SYS_ADMIN) and untrusted processes it may spawn.

Not sure I understand what you're saying, but

I guess that in any case the task which is doing unshare(CLONE_NEWNS)
can drop cap_sys_admin first.  Though that is harder if using clone,
and it is awkward because it's not the container manager, but the user,
who will judge whether the container workload should be restricted.
So the container driver will add a flag like "run-controlled", and
the driver will convert that to dropping a capability; which again
is weird.  It would seem nicer to introduce a userns flag, 'caps-controlled'
For an unprivileged userns, it is always set to 1, and root cannot
change it.  For a root-created userns, it stays 0, but root can set it
to 1 (using /proc file?).  In this way a either container runtime or just an
admin script can say "no wait I want this container to still be controlled".

Or we could instead add a second sysctl to decide whether all or only
'controlled' user namespaces should be controlled.  That's not pretty though.



Re: [PATCHv3 0/2] capability controlled user-namespaces

2018-01-08 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> On Mon, Jan 8, 2018 at 10:11 AM, Serge E. Hallyn <se...@hallyn.com> wrote:
> > Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> >> On Mon, Jan 8, 2018 at 7:47 AM, Serge E. Hallyn <se...@hallyn.com> wrote:
> >> > Quoting James Morris (james.l.mor...@oracle.com):
> >> >> On Mon, 8 Jan 2018, Serge E. Hallyn wrote:
> >> >> I meant in terms of "marking" a user ns as "controlled" type -- it's
> >> >> unnecessary jargon from an end user point of view.
> >> >
> >> > Ah, yes, that was my point in
> >> >
> >> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/01845.html
> >> > and
> >> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/02276.html
> >> >
> >> >> This may happen internally but don't make it a special case with a
> >> >> different name and don't bother users with internal concepts: simply
> >> >> implement capability whitelists with the default having equivalent
> >
> > So the challenge is to have unprivileged users be contained, while
> > allowing trusted workloads in containers created by a root user to
> > bypass the restriction.
> >
> > Now, the current proposal actually doesn't support a root user starting
> > an application that it doesn't quite trust in such a way that it *is*
> > subject to the whitelist.
> 
> Well, this is not hard since root process can spawn another process
> and loose privileges before creating user-ns to be controlled by the
> whitelist.

It would have to drop cap_sys_admin for the container to be marked as
"controlled", which may prevent the container runtime from properly starting
the container.

> You need an ability to preserve the creation of user-namespaces that
> exhibit 'the uncontrolled behavior' and only trusted/privileged (root)
> user should have it which is maintained here.
> 
> > Which is unfortunate.  But apart from using
> > ptags or a cgroup, I can't think of a good way to get us everything we
> > want:
> >
> > 1. unprivileged users always restricted
> > 2. existing unprivileged containers become restricted when whitelist
> > is enabled
> > 3. privileged users are able to create containers which are not restricted
> 
> all this is achieved by the patch-set without any changes to the
> application with the above knob.
> 
> > 4. privileged users are able to create containers which *are* restricted
> >
> With this patch-set; the root user process can fork another process
> with less privileges before creating a user-ns if the exec-ed process
> cannot be trusted. So there is a way with little modification as
> opposed to nothing available at this moment for this scenario.


Re: [PATCHv3 0/2] capability controlled user-namespaces

2018-01-08 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> On Mon, Jan 8, 2018 at 7:47 AM, Serge E. Hallyn <se...@hallyn.com> wrote:
> > Quoting James Morris (james.l.mor...@oracle.com):
> >> On Mon, 8 Jan 2018, Serge E. Hallyn wrote:
> >> I meant in terms of "marking" a user ns as "controlled" type -- it's
> >> unnecessary jargon from an end user point of view.
> >
> > Ah, yes, that was my point in
> >
> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/01845.html
> > and
> > http://lkml.iu.edu/hypermail/linux/kernel/1711.1/02276.html
> >
> >> This may happen internally but don't make it a special case with a
> >> different name and don't bother users with internal concepts: simply
> >> implement capability whitelists with the default having equivalent

So the challenge is to have unprivileged users be contained, while
allowing trusted workloads in containers created by a root user to
bypass the restriction.

Now, the current proposal actually doesn't support a root user starting
an application that it doesn't quite trust in such a way that it *is*
subject to the whitelist.  Which is unfortunate.  But apart from using
ptags or a cgroup, I can't think of a good way to get us everything we
want:

1. unprivileged users always restricted
2. existing unprivileged containers become restricted when whitelist
is enabled
3. privileged users are able to create containers which are not restricted
4. privileged users are able to create containers which *are* restricted



Re: [PATCHv3 0/2] capability controlled user-namespaces

2018-01-08 Thread Serge E. Hallyn
Quoting James Morris (james.l.mor...@oracle.com):
> On Mon, 8 Jan 2018, Serge E. Hallyn wrote:
> 
> > > Also, why do we need the concept of a controlled user-ns at all, if the 
> > > default whitelist maintains existing behavior?
> > 
> > In past discussions two uses have been brought up:
> > 
> > 1. if an 0-day is discovered which is exacerbated by a specific
> > privilege in user namespaces, that privilege could be turned off until a
> > reboot with a fixed kernel is scheduled, without fully disabling all
> > containers.
> > 
> > 2. some systems may be specifically designed to run software which
> > only requires a few capabilities in a userns.  In that case all others
> > could be disabled.
> > 
> 
> I meant in terms of "marking" a user ns as "controlled" type -- it's 
> unnecessary jargon from an end user point of view.

Ah, yes, that was my point in

http://lkml.iu.edu/hypermail/linux/kernel/1711.1/01845.html
and
http://lkml.iu.edu/hypermail/linux/kernel/1711.1/02276.html

> This may happen internally but don't make it a special case with a 
> different name and don't bother users with internal concepts: simply 
> implement capability whitelists with the default having equivalent 
> behavior of everything allowed.  Then, document the semantics of the 
> whitelist in terms of inheritance etc., as a feature of user namespaces, 
> not as a "type" of user namespace.

The problem with making them inheritable is that an adversarial user
can just create a user namespace at boot that sits and waits for an
0day to be published, then log in and attach to that namespace later,
since it has already inherited the open whitelist.

It feels like there must be some other approach that doesn't feel as...
band-aid-y as this does, but I'm not sure what.


Re: [PATCHv3 0/2] capability controlled user-namespaces

2018-01-07 Thread Serge E. Hallyn
On Mon, Jan 08, 2018 at 11:35:26AM +1100, James Morris wrote:
> On Tue, 2 Jan 2018, Mahesh Bandewar (महेश बंडेवार) wrote:
> 
> > On Sat, Dec 30, 2017 at 12:31 AM, James Morris
> >  wrote:
> > > On Wed, 27 Dec 2017, Mahesh Bandewar (महेश बंडेवार) wrote:
> > >
> > >> Hello James,
> > >>
> > >> Seems like I missed your name to be added into the review of this
> > >> patch series. Would you be willing be pull this into the security
> > >> tree? Serge Hallyn has already ACKed it.
> > >
> > > Sure!
> > >
> > Thank you James.
> 
> I'd like to see what Eric Biederman thinks of this.
> 
> Also, why do we need the concept of a controlled user-ns at all, if the 
> default whitelist maintains existing behavior?

In past discussions two uses have been brought up:

1. if an 0-day is discovered which is exacerbated by a specific
privilege in user namespaces, that privilege could be turned off until a
reboot with a fixed kernel is scheduled, without fully disabling all
containers.

2. some systems may be specifically designed to run software which
only requires a few capabilities in a userns.  In that case all others
could be disabled.


Re: [PATCHv2 2/2] userns: control capabilities of some user namespaces

2017-12-06 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> On Wed, Nov 29, 2017 at 9:57 AM, Serge E. Hallyn <se...@hallyn.com> wrote:
> > Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> >> On Tue, Nov 28, 2017 at 3:04 PM, Serge E. Hallyn <se...@hallyn.com> wrote:
> >> > Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> >> > ...
> >> >> >> diff --git a/security/commoncap.c b/security/commoncap.c
> >> >> >> index fc46f5b85251..89103f16ac37 100644
> >> >> >> --- a/security/commoncap.c
> >> >> >> +++ b/security/commoncap.c
> >> >> >> @@ -73,6 +73,14 @@ int cap_capable(const struct cred *cred, struct 
> >> >> >> user_namespace *targ_ns,
> >> >> >>  {
> >> >> >>   struct user_namespace *ns = targ_ns;
> >> >> >>
> >> >> >> + /* If the capability is controlled and user-ns that process
> >> >> >> +  * belongs-to is 'controlled' then return EPERM and no need
> >> >> >> +  * to check the user-ns hierarchy.
> >> >> >> +  */
> >> >> >> + if (is_user_ns_controlled(cred->user_ns) &&
> >> >> >> + is_capability_controlled(cap))
> >> >> >> + return -EPERM;
> >> >> >
> >> >> > I'd be curious to see the performance impact on this on a regular
> >> >> > workload (kernel build?) in a controlled ns.
> >> >> >
> >> >> Should it affect? If at all, it should be +ve since, the recursive
> >> >> user-ns hierarchy lookup is avoided with the above check if the
> >> >> capability is controlled.
> >> >
> >> > Yes but I expect that to be the rare case for normal lxc installs
> >> > (which are of course what I am interested in)
> >> >
> >> >>  The additional cost otherwise is this check
> >> >> per cap_capable() call.
> >> >
> >> > And pipeline refetching?
> >> >
> >> > Capability calls also shouldn't be all that frequent, but still I'm
> >> > left wondering...
> >>
> >> Correct, and capability checks are part of the control-path and not
> >> the data-path so shouldn't matter but I guess it doesn't hurt to
> >> find-out the number. Do you have any workload in mind, that we can use
> >> for this test/benchmark?
> >
> > I suppose if you did both (a) a kernel build and (b) a webserve
> > like https://github.com/m3ng9i/ran , being hit for a minute by a
> > heavy load of requests, those two together would be re-assuring.
> >
> Well, I did (a) and (b). Here are the results.
> 
> (a0) I used the ubuntu-artful (17.10) vm instance with standard kernel
> to compile the kernel
> 
> mahesh@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s clean
> mahesh@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s
> real 6m47.525s
> user 22m37.424s
> sys 2m44.745s
> 
> (b0) Now in an user-namespce create by an user that does not have
> SYS_ADMIN (just for apples-to-apples comparison)
> mahesh@mahesh-vm0-artful:~$ sysctl -q kernel.controlled_userns_caps_whitelist
> sysctl: cannot stat /proc/sys/kernel/controlled_userns_caps_whitelist:
> No such file or directory
> mahesh@mahesh-vm0-artful:~$ id
> uid=1000(mahesh) gid=1000(mahesh)
> groups=1000(mahesh),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),118(lpadmin),128(sambashare)
> mahesh@mahesh-vm0-artful:~/Work/Linux$ unshare -Uf -- bash
> nobody@mahesh-vm0-artful:~/Work/Linux$ id
> uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
> nobody@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s clean
> nobody@mahesh-vm0-artful:~/Work/Linux$ time make -j4 -s
> real 9m10.115s

Got some serious noise in this run?

But the numbers look good - thanks!


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-30 Thread Serge E. Hallyn
On Wed, Nov 29, 2017 at 07:35:31PM -0500, Theodore Ts'o wrote:
> On Wed, Nov 29, 2017 at 11:28:52AM -0600, Serge E. Hallyn wrote:
> > 
> > Just to be clear, module loading requires - and must always continue to
> > require - CAP_SYS_MODULE against the initial user namespace.  Containers
> > in user namespaces do not have that.
> > 
> > I don't believe anyone has ever claimed that containers which are not in
> > a user namespace are in any way secure.
> 
> Unless the container performs some action which causes the kernel to
> call request_module(), which then loads some kernel module,

A local unprivileged user can do the same thing.  I reject the popular
notion that linux is a single user operating system.  More interesting
are the (very real) cases where root in a container can do something
which a local unprivileged user could not do.  Since a local unprivileged
user can always create a new namespace, *those* constitute a real and
interesting problem.

> potentially containing cr*p unmaintained code which was included when
> the distro compiled the world, into the host kernel.

> This is an attack vector that doesn't exist if you are using VM's.

Until the vm tenant uses a trivial vm escape.

> And in general, the attack surface of the entire Linux
> kernel<->userspace API is far larger than that which is exposed by the
> guest<->host interface.
> 
> For that reason, containers are *far* more insecure than VM's, since
> once the attacker gets root on the guest VM, they then have to attack
> the hypervisor interface.  And if you compare the attack surface of
> the two, it's pretty clear which is larger, and it's not the
> hypervisor interface.

Any time anyone spends a day looking at either the black hole that is
the hardware emulators or the xen and kvm code itself they walk away
with a set of cve's.  It *should* be more secure, it's not.  You're
telling me your house is safe because you put up a no tresspassing
sign.

-serge


Re: [PATCHv2 2/2] userns: control capabilities of some user namespaces

2017-11-29 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> On Tue, Nov 28, 2017 at 3:04 PM, Serge E. Hallyn <se...@hallyn.com> wrote:
> > Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> > ...
> >> >> diff --git a/security/commoncap.c b/security/commoncap.c
> >> >> index fc46f5b85251..89103f16ac37 100644
> >> >> --- a/security/commoncap.c
> >> >> +++ b/security/commoncap.c
> >> >> @@ -73,6 +73,14 @@ int cap_capable(const struct cred *cred, struct 
> >> >> user_namespace *targ_ns,
> >> >>  {
> >> >>   struct user_namespace *ns = targ_ns;
> >> >>
> >> >> + /* If the capability is controlled and user-ns that process
> >> >> +  * belongs-to is 'controlled' then return EPERM and no need
> >> >> +  * to check the user-ns hierarchy.
> >> >> +  */
> >> >> + if (is_user_ns_controlled(cred->user_ns) &&
> >> >> + is_capability_controlled(cap))
> >> >> + return -EPERM;
> >> >
> >> > I'd be curious to see the performance impact on this on a regular
> >> > workload (kernel build?) in a controlled ns.
> >> >
> >> Should it affect? If at all, it should be +ve since, the recursive
> >> user-ns hierarchy lookup is avoided with the above check if the
> >> capability is controlled.
> >
> > Yes but I expect that to be the rare case for normal lxc installs
> > (which are of course what I am interested in)
> >
> >>  The additional cost otherwise is this check
> >> per cap_capable() call.
> >
> > And pipeline refetching?
> >
> > Capability calls also shouldn't be all that frequent, but still I'm
> > left wondering...
> 
> Correct, and capability checks are part of the control-path and not
> the data-path so shouldn't matter but I guess it doesn't hurt to
> find-out the number. Do you have any workload in mind, that we can use
> for this test/benchmark?

I suppose if you did both (a) a kernel build and (b) a webserver
like https://github.com/m3ng9i/ran , being hit for a minute by a
heavy load of requests, those two together would be re-assuring.

thanks,
-serge


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-29 Thread Serge E. Hallyn
Quoting Theodore Ts'o (ty...@mit.edu):
> Half the problem here is that with containers, people are changing the
> security model, because they want to let untrusted users have "root",
> without really having "root".  Part of the fundamental problem is that
> there are some well-meaning, but fundamentally misguided people, who
> have been asserting: "Containers are just as secure as VM's".
> 
> Well, they are not.  And the sooner people get past this, the better
> off they'll be

Just to be clear, module loading requires - and must always continue to
require - CAP_SYS_MODULE against the initial user namespace.  Containers
in user namespaces do not have that.

I don't believe anyone has ever claimed that containers which are not in
a user namespace are in any way secure.

(And as for the other claim, I'd prefer to stick to "VMs are in most
cases as insecure as properly configured containers" :)

-serge


Re: [PATCHv2 2/2] userns: control capabilities of some user namespaces

2017-11-28 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
...
> >> diff --git a/security/commoncap.c b/security/commoncap.c
> >> index fc46f5b85251..89103f16ac37 100644
> >> --- a/security/commoncap.c
> >> +++ b/security/commoncap.c
> >> @@ -73,6 +73,14 @@ int cap_capable(const struct cred *cred, struct 
> >> user_namespace *targ_ns,
> >>  {
> >>   struct user_namespace *ns = targ_ns;
> >>
> >> + /* If the capability is controlled and user-ns that process
> >> +  * belongs-to is 'controlled' then return EPERM and no need
> >> +  * to check the user-ns hierarchy.
> >> +  */
> >> + if (is_user_ns_controlled(cred->user_ns) &&
> >> + is_capability_controlled(cap))
> >> + return -EPERM;
> >
> > I'd be curious to see the performance impact on this on a regular
> > workload (kernel build?) in a controlled ns.
> >
> Should it affect? If at all, it should be +ve since, the recursive
> user-ns hierarchy lookup is avoided with the above check if the
> capability is controlled.

Yes but I expect that to be the rare case for normal lxc installs
(which are of course what I am interested in)

>  The additional cost otherwise is this check
> per cap_capable() call.

And pipeline refetching?

Capability calls also shouldn't be all that frequent, but still I'm
left wondering...


Re: [PATCHv2 2/2] userns: control capabilities of some user namespaces

2017-11-25 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (mah...@bandewar.net):
> From: Mahesh Bandewar 
> 
> With this new notion of "controlled" user-namespaces, the controlled
> user-namespaces are marked at the time of their creation while the
> capabilities of processes that belong to them are controlled using the
> global mask.
> 
> Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
> that belongs to uncontrolled user-ns can create another (child) user-
> namespace that is uncontrolled. Any other process (that either does
> not have SYS_ADMIN or belongs to a controlled user-ns) can only
> create a user-ns that is controlled.
> 
> global-capability-whitelist (controlled_userns_caps_whitelist) is used
> at the capability check-time and keeps the semantics for the processes
> that belong to uncontrolled user-ns as it is. Processes that belong to
> controlled user-ns however are subjected to different checks-
> 
>(a) if the capability in question is controlled and process belongs
>to controlled user-ns, then it's always denied.
>(b) if the capability in question is NOT controlled then fall back
>to the traditional check.
> 
> Signed-off-by: Mahesh Bandewar 

Acked-by: Serge Hallyn 

Although a few comment addition requests below:

> ---
> v2:
>   Don't recalculate user-ns flags for every setns() call.
> v1:
>   Initial submission.
> 
>  include/linux/capability.h |  1 +
>  include/linux/user_namespace.h | 20 
>  kernel/capability.c|  5 +
>  kernel/user_namespace.c|  4 
>  security/commoncap.c   |  8 
>  5 files changed, 38 insertions(+)
> 
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index 7d79a4689625..a1fd9e460379 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -251,6 +251,7 @@ extern bool ptracer_capable(struct task_struct *tsk, 
> struct user_namespace *ns);
>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
> cpu_vfs_cap_data *cpu_caps);
>  int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
>void __user *buff, size_t *lenp, loff_t *ppos);

Here and at the definition below, please add a comment explaining
that a controlled cap is defined as not being in the sysctl.

> +bool is_capability_controlled(int cap);
>  
>  extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t 
> size);
>  
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 3fe714da7f5a..647f825c7b5f 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -23,6 +23,7 @@ struct uid_gid_map {/* 64 bytes -- 1 cache line */
>  };
>  
>  #define USERNS_SETGROUPS_ALLOWED 1UL
> +#define USERNS_CONTROLLED 2UL
>  
>  #define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED
>  
> @@ -103,6 +104,16 @@ static inline void put_user_ns(struct user_namespace *ns)
>   __put_user_ns(ns);
>  }
>  

Please add a comment explaining that a controlled ns
is one created by a user which did not have CAP_SYS_ADMIN
(or descended from such an ns).

> +static inline bool is_user_ns_controlled(const struct user_namespace *ns)
> +{
> + return ns->flags & USERNS_CONTROLLED;
> +}
> +
> +static inline void mark_user_ns_controlled(struct user_namespace *ns)
> +{
> + ns->flags |= USERNS_CONTROLLED;
> +}
> +
>  struct seq_operations;
>  extern const struct seq_operations proc_uid_seq_operations;
>  extern const struct seq_operations proc_gid_seq_operations;
> @@ -161,6 +172,15 @@ static inline struct ns_common *ns_get_owner(struct 
> ns_common *ns)
>  {
>   return ERR_PTR(-EPERM);
>  }
> +
> +static inline bool is_user_ns_controlled(const struct user_namespace *ns)
> +{
> + return false;
> +}
> +
> +static inline void mark_user_ns_controlled(struct user_namespace *ns)
> +{
> +}
>  #endif
>  
>  #endif /* _LINUX_USER_H */
> diff --git a/kernel/capability.c b/kernel/capability.c
> index 4a859b7d4902..bffe249922de 100644
> --- a/kernel/capability.c
> +++ b/kernel/capability.c
> @@ -511,6 +511,11 @@ bool ptracer_capable(struct task_struct *tsk, struct 
> user_namespace *ns)
>  }
>  
>  /* Controlled-userns capabilities routines */
> +bool is_capability_controlled(int cap)
> +{
> + return !cap_raised(controlled_userns_caps_whitelist, cap);
> +}
> +
>  #ifdef CONFIG_SYSCTL
>  int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
>void __user *buff, size_t *lenp, loff_t *ppos)
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index c490f1e4313b..600c7dcb9ff7 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -139,6 +139,10 @@ int create_user_ns(struct cred *new)
>   goto fail_keyring;
>  
>   set_cred_user_ns(new, ns);
> + if (!ns_capable(parent_ns, CAP_SYS_ADMIN) ||
> + 

Re: [PATCHv2 1/2] capability: introduce sysctl for controlled user-ns capability whitelist

2017-11-25 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (mah...@bandewar.net):
> From: Mahesh Bandewar 
> 
> Add a sysctl variable kernel.controlled_userns_caps_whitelist. This
> takes input as capability mask expressed as two comma separated hex
> u32 words. The mask, however, is stored in kernel as kernel_cap_t type.
> 
> Any capabilities that are not part of this mask will be controlled and
> will not be allowed to processes in controlled user-ns.
> 
> Signed-off-by: Mahesh Bandewar 

Acked-by: Serge Hallyn 

> ---
> v2:
>   Rebase
> v1:
>   Initial submission
> 
>  Documentation/sysctl/kernel.txt | 21 ++
>  include/linux/capability.h  |  3 +++
>  kernel/capability.c | 47 
> +
>  kernel/sysctl.c |  5 +
>  4 files changed, 76 insertions(+)
> 
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index 694968c7523c..a1d39dbae847 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -25,6 +25,7 @@ show up in /proc/sys/kernel:
>  - bootloader_version  [ X86 only ]
>  - callhome[ S390 only ]
>  - cap_last_cap
> +- controlled_userns_caps_whitelist
>  - core_pattern
>  - core_pipe_limit
>  - core_uses_pid
> @@ -187,6 +188,26 @@ CAP_LAST_CAP from the kernel.
>  
>  ==
>  
> +controlled_userns_caps_whitelist
> +
> +Capability mask that is whitelisted for "controlled" user namespaces.
> +Any capability that is missing from this mask will not be allowed to
> +any process that is attached to a controlled-userns. e.g. if CAP_NET_RAW
> +is not part of this mask, then processes running inside any controlled
> +userns's will not be allowed to perform action that needs CAP_NET_RAW
> +capability. However, processes that are attached to a parent user-ns
> +hierarchy that is *not* controlled and has CAP_NET_RAW can continue
> +performing those actions. User-namespaces are marked "controlled" at
> +the time of their creation based on the capabilities of the creator.
> +A process that does not have CAP_SYS_ADMIN will create user-namespaces
> +that are controlled.
> +
> +The value is expressed as two comma separated hex words (u32). This
> +sysctl is avaialble in init-ns and users with CAP_SYS_ADMIN in init-ns
> +are allowed to make changes.
> +
> +==
> +
>  core_pattern:
>  
>  core_pattern is used to specify a core dumpfile pattern name.
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index f640dcbc880c..7d79a4689625 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -14,6 +14,7 @@
>  #define _LINUX_CAPABILITY_H
>  
>  #include 
> +#include 
>  
>  
>  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
> @@ -248,6 +249,8 @@ extern bool ptracer_capable(struct task_struct *tsk, 
> struct user_namespace *ns);
>  
>  /* audit system wants to get cap info from files as well */
>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
> cpu_vfs_cap_data *cpu_caps);
> +int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
> +  void __user *buff, size_t *lenp, loff_t *ppos);
>  
>  extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t 
> size);
>  
> diff --git a/kernel/capability.c b/kernel/capability.c
> index 1e1c0236f55b..4a859b7d4902 100644
> --- a/kernel/capability.c
> +++ b/kernel/capability.c
> @@ -29,6 +29,8 @@ EXPORT_SYMBOL(__cap_empty_set);
>  
>  int file_caps_enabled = 1;
>  
> +kernel_cap_t controlled_userns_caps_whitelist = CAP_FULL_SET;
> +
>  static int __init file_caps_disable(char *str)
>  {
>   file_caps_enabled = 0;
> @@ -507,3 +509,48 @@ bool ptracer_capable(struct task_struct *tsk, struct 
> user_namespace *ns)
>   rcu_read_unlock();
>   return (ret == 0);
>  }
> +
> +/* Controlled-userns capabilities routines */
> +#ifdef CONFIG_SYSCTL
> +int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
> +  void __user *buff, size_t *lenp, loff_t *ppos)
> +{
> + DECLARE_BITMAP(caps_bitmap, CAP_LAST_CAP);
> + struct ctl_table caps_table;
> + char tbuf[NAME_MAX];
> + int ret;
> +
> + ret = bitmap_from_u32array(caps_bitmap, CAP_LAST_CAP,
> +controlled_userns_caps_whitelist.cap,
> +_KERNEL_CAPABILITY_U32S);
> + if (ret != CAP_LAST_CAP)
> + return -1;
> +
> + scnprintf(tbuf, NAME_MAX, "%*pb", CAP_LAST_CAP, caps_bitmap);
> +
> + caps_table.data = tbuf;
> + caps_table.maxlen = NAME_MAX;
> + caps_table.mode = table->mode;
> + ret = proc_dostring(_table, write, buff, lenp, ppos);
> + if (ret)
> + return ret;
> + if (write) {
> + kernel_cap_t tmp;
> +
> +

Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-09 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> single sandbox.  I am not at all certain that the capabilities is the
> proper place to limit code reachability.

Right, I keep having this gut feeling that there is another way we
should be doing that.  Maybe based on ksplice or perf, or maybe more
based on subsystems.  And I hope someone pursues that.  But I can't put
my finger on it, and meanwhile the capability checks obviously *are* in
fact gates...

-serge


Re: [PATCH resend 1/2] capability: introduce sysctl for controlled user-ns capability whitelist

2017-11-09 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
...
> >>
> >>  ==
> >>
> >> +controlled_userns_caps_whitelist
> >> +
> >> +Capability mask that is whitelisted for "controlled" user namespaces.
> >> +Any capability that is missing from this mask will not be allowed to
> >> +any process that is attached to a controlled-userns. e.g. if CAP_NET_RAW
> >> +is not part of this mask, then processes running inside any controlled
> >> +userns's will not be allowed to perform action that needs CAP_NET_RAW
> >> +capability. However, processes that are attached to a parent user-ns
> >> +hierarchy that is *not* controlled and has CAP_NET_RAW can continue
> >> +performing those actions. User-namespaces are marked "controlled" at
> >> +the time of their creation based on the capabilities of the creator.
> >> +A process that does not have CAP_SYS_ADMIN will create user-namespaces
> >> +that are controlled.
> >
> > Hm.  I think that's fine (the way 'controlled' user namespaces are
> > defined), but that is design decision in itself, and should perhaps be
> > discussed.
> >
> > Did you consider other ways?  What about using CAP_SETPCAP?
> >
> I did try other ways e.g. using another bounding-set etc. but
> eventually settled with this approach because of main two properties -

No, I meant did you try other ways of defining a controlled user
namespace, other than one which is created by a task lacking
CAP_SYS_ADMIN?

...

> >> +The value is expressed as two comma separated hex words (u32). This
> >
> > Why comma separated?  whitespace ok?  Leading 0x ok?  What is the
> > default at boot?  (Obviously the patch tells me, I'm asking for it
> > to be spelled out in the doc)
> >
> I tried multiple ways including representing capabilities in
> string/name form for better readability but didn't want to add
> additional complexities of dealing with strings and possible
> string-related-issues for this. Also didn't want to reinvent the new
> form so settled with something that is widely used (cpu
> bounding/affinity/irq mapping etc.) and is capable of handling growing
> bit set (currently 37 but possibly more later).

Ok, thanks.


Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-09 Thread Serge E. Hallyn
Quoting chris hyser (chris.hy...@oracle.com):
> On 11/06/2017 10:23 PM, Serge E. Hallyn wrote:
> >I think I definately prefer what I mentioned in the email to Boris.
> >Basically a "permanent capability bounding set".  The normal bounding
> >set gets reset to a full set on every new user_ns creation.  In this
> >proposal, it would instead be set to the calling task's permanent
> >capability set, which starts (at boot) full, and which privileged
> >tasks can pull capabilities out of.
> 
> Actually, this may solve a similar problem I've been looking at. The
> idea was basically at strategic points in the kernel (possibly LSM
> hook sites, still evaluating, and probably syscall entry) validate
> that a task has not "magically" acquired capabilities that it or
> parent specifically said it cannot have and then take some action
> like say killing it immediately. Using your terms, basically make
> the "permanent capability set" a write-once privilege escalation
> defense. To handle the 0-day threat, perhaps make it writable but
> only with more "restrictive" values.

Would the existing capability bounding set not suffice for that?

The 'permanent' bounding set turns out to not be a good fit for
the problem being discussed in this thread, but please feel free
to start a new thread if you want to discuss your use case.


Re: [PATCH resend 1/2] capability: introduce sysctl for controlled user-ns capability whitelist

2017-11-09 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (mah...@bandewar.net):
> From: Mahesh Bandewar 
> 
> Add a sysctl variable kernel.controlled_userns_caps_whitelist. This

I understand the arguments in favor of whitelists in most cases for
security purposes.  But given that you've said the goal here is to
prevent use of a capability in a user namespace when a CVE has been
found, a whitelist seems the wrong choice, since

1. it means that an attacker may through some other means be able
to add a capability back into the whitelist when you specifically
wanted to drop it.  With a blacklist, you could say "once a cap has
been dropped it can never be re-added without rebooting".
2. it means by default all capabilities will be denied once the
switch is pulled which is specifically not what you want in this
case.
3. the admin can't just say "drop CAP_NET_ADMIN", but needs to
know to echo ~CAP_NET_ADMIN.

Why not make it a blacklist, and once a cap is dropped it can
never be re-added?

-serge


Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-09 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (mah...@bandewar.net):
> From: Mahesh Bandewar 
> 
> With this new notion of "controlled" user-namespaces, the controlled
> user-namespaces are marked at the time of their creation while the
> capabilities of processes that belong to them are controlled using the
> global mask.
> 
> Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
> that belongs to uncontrolled user-ns can create another (child) user-
> namespace that is uncontrolled. Any other process (that either does
> not have SYS_ADMIN or belongs to a controlled user-ns) can only
> create a user-ns that is controlled.
> 
> global-capability-whitelist (controlled_userns_caps_whitelist) is used
> at the capability check-time and keeps the semantics for the processes
> that belong to uncontrolled user-ns as it is. Processes that belong to
> controlled user-ns however are subjected to different checks-
> 
>(a) if the capability in question is controlled and process belongs
>to controlled user-ns, then it's always denied.
>(b) if the capability in question is NOT controlled then fall back
>to the traditional check.
> 
> Signed-off-by: Mahesh Bandewar 
> ---
>  include/linux/capability.h |  1 +
>  include/linux/user_namespace.h | 20 
>  kernel/capability.c|  5 +
>  kernel/user_namespace.c|  3 +++
>  security/commoncap.c   |  8 
>  5 files changed, 37 insertions(+)
> 
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index 6c0b9677c03f..b8c6cac18658 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -250,6 +250,7 @@ extern bool ptracer_capable(struct task_struct *tsk, 
> struct user_namespace *ns);
>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
> cpu_vfs_cap_data *cpu_caps);
>  int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
>void __user *buff, size_t *lenp, loff_t *ppos);
> +bool is_capability_controlled(int cap);
>  
>  extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t 
> size);
>  
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index c18e01252346..e890fe81b47e 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -22,6 +22,7 @@ struct uid_gid_map {/* 64 bytes -- 1 cache line */
>  };
>  
>  #define USERNS_SETGROUPS_ALLOWED 1UL
> +#define USERNS_CONTROLLED 2UL
>  
>  #define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED
>  
> @@ -102,6 +103,16 @@ static inline void put_user_ns(struct user_namespace *ns)
>   __put_user_ns(ns);
>  }
>  
> +static inline bool is_user_ns_controlled(const struct user_namespace *ns)
> +{
> + return ns->flags & USERNS_CONTROLLED;
> +}
> +
> +static inline void mark_user_ns_controlled(struct user_namespace *ns)
> +{
> + ns->flags |= USERNS_CONTROLLED;
> +}
> +
>  struct seq_operations;
>  extern const struct seq_operations proc_uid_seq_operations;
>  extern const struct seq_operations proc_gid_seq_operations;
> @@ -160,6 +171,15 @@ static inline struct ns_common *ns_get_owner(struct 
> ns_common *ns)
>  {
>   return ERR_PTR(-EPERM);
>  }
> +
> +static inline bool is_user_ns_controlled(const struct user_namespace *ns)
> +{
> + return false;
> +}
> +
> +static inline void mark_user_ns_controlled(struct user_namespace *ns)
> +{
> +}
>  #endif
>  
>  #endif /* _LINUX_USER_H */
> diff --git a/kernel/capability.c b/kernel/capability.c
> index 62dbe3350c1b..40a38cc4ff43 100644
> --- a/kernel/capability.c
> +++ b/kernel/capability.c
> @@ -510,6 +510,11 @@ bool ptracer_capable(struct task_struct *tsk, struct 
> user_namespace *ns)
>  }
>  
>  /* Controlled-userns capabilities routines */
> +bool is_capability_controlled(int cap)
> +{
> + return !cap_raised(controlled_userns_caps_whitelist, cap);
> +}
> +
>  #ifdef CONFIG_SYSCTL
>  int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
>void __user *buff, size_t *lenp, loff_t *ppos)
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index c490f1e4313b..f393ea5108f0 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -53,6 +53,9 @@ static void set_cred_user_ns(struct cred *cred, struct 
> user_namespace *user_ns)
>   cred->cap_effective = CAP_FULL_SET;
>   cred->cap_ambient = CAP_EMPTY_SET;
>   cred->cap_bset = CAP_FULL_SET;
> + if (!ns_capable(user_ns->parent, CAP_SYS_ADMIN) ||
> + is_user_ns_controlled(user_ns->parent))
> + mark_user_ns_controlled(user_ns);

Hm, why do this here, rather than at create_user_ns()?  It
shouldn't be recalculated when someone does setns() should it?



Re: [PATCH resend 1/2] capability: introduce sysctl for controlled user-ns capability whitelist

2017-11-09 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (mah...@bandewar.net):
> From: Mahesh Bandewar 
> 
> Add a sysctl variable kernel.controlled_userns_caps_whitelist. This
> takes input as capability mask expressed as two comma separated hex
> u32 words. The mask, however, is stored in kernel as kernel_cap_t type.
> 
> Any capabilities that are not part of this mask will be controlled and
> will not be allowed to processes in controlled user-ns.
> 
> Signed-off-by: Mahesh Bandewar 
> ---
>  Documentation/sysctl/kernel.txt | 21 ++
>  include/linux/capability.h  |  3 +++
>  kernel/capability.c | 47 
> +
>  kernel/sysctl.c |  5 +
>  4 files changed, 76 insertions(+)
> 
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index 694968c7523c..a1d39dbae847 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -25,6 +25,7 @@ show up in /proc/sys/kernel:
>  - bootloader_version  [ X86 only ]
>  - callhome[ S390 only ]
>  - cap_last_cap
> +- controlled_userns_caps_whitelist
>  - core_pattern
>  - core_pipe_limit
>  - core_uses_pid
> @@ -187,6 +188,26 @@ CAP_LAST_CAP from the kernel.
>  
>  ==
>  
> +controlled_userns_caps_whitelist
> +
> +Capability mask that is whitelisted for "controlled" user namespaces.
> +Any capability that is missing from this mask will not be allowed to
> +any process that is attached to a controlled-userns. e.g. if CAP_NET_RAW
> +is not part of this mask, then processes running inside any controlled
> +userns's will not be allowed to perform action that needs CAP_NET_RAW
> +capability. However, processes that are attached to a parent user-ns
> +hierarchy that is *not* controlled and has CAP_NET_RAW can continue
> +performing those actions. User-namespaces are marked "controlled" at
> +the time of their creation based on the capabilities of the creator.
> +A process that does not have CAP_SYS_ADMIN will create user-namespaces
> +that are controlled.

Hm.  I think that's fine (the way 'controlled' user namespaces are
defined), but that is design decision in itself, and should perhaps be
discussed.

Did you consider other ways?  What about using CAP_SETPCAP?

> +The value is expressed as two comma separated hex words (u32). This

Why comma separated?  whitespace ok?  Leading 0x ok?  What is the
default at boot?  (Obviously the patch tells me, I'm asking for it
to be spelled out in the doc)

Otherwise looks good, thanks!

Serge

> +sysctl is avaialble in init-ns and users with CAP_SYS_ADMIN in init-ns
> +are allowed to make changes.
> +
> +==
> +
>  core_pattern:
>  
>  core_pattern is used to specify a core dumpfile pattern name.
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index b52e278e4744..6c0b9677c03f 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -13,6 +13,7 @@
>  #define _LINUX_CAPABILITY_H
>  
>  #include 
> +#include 
>  
>  
>  #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
> @@ -247,6 +248,8 @@ extern bool ptracer_capable(struct task_struct *tsk, 
> struct user_namespace *ns);
>  
>  /* audit system wants to get cap info from files as well */
>  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
> cpu_vfs_cap_data *cpu_caps);
> +int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
> +  void __user *buff, size_t *lenp, loff_t *ppos);
>  
>  extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t 
> size);
>  
> diff --git a/kernel/capability.c b/kernel/capability.c
> index f97fe77ceb88..62dbe3350c1b 100644
> --- a/kernel/capability.c
> +++ b/kernel/capability.c
> @@ -28,6 +28,8 @@ EXPORT_SYMBOL(__cap_empty_set);
>  
>  int file_caps_enabled = 1;
>  
> +kernel_cap_t controlled_userns_caps_whitelist = CAP_FULL_SET;
> +
>  static int __init file_caps_disable(char *str)
>  {
>   file_caps_enabled = 0;
> @@ -506,3 +508,48 @@ bool ptracer_capable(struct task_struct *tsk, struct 
> user_namespace *ns)
>   rcu_read_unlock();
>   return (ret == 0);
>  }
> +
> +/* Controlled-userns capabilities routines */
> +#ifdef CONFIG_SYSCTL
> +int proc_douserns_caps_whitelist(struct ctl_table *table, int write,
> +  void __user *buff, size_t *lenp, loff_t *ppos)
> +{
> + DECLARE_BITMAP(caps_bitmap, CAP_LAST_CAP);
> + struct ctl_table caps_table;
> + char tbuf[NAME_MAX];
> + int ret;
> +
> + ret = bitmap_from_u32array(caps_bitmap, CAP_LAST_CAP,
> +controlled_userns_caps_whitelist.cap,
> +_KERNEL_CAPABILITY_U32S);
> + if (ret != CAP_LAST_CAP)
> + return -1;
> +
> + scnprintf(tbuf, NAME_MAX, "%*pb", 

Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-09 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> Of course. Let's take an example of the CVE that I have mentioned in
> my cover-letter -
> CVE-2017-7308(https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-7308).
> It's well documented and even has a
> exploit(https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308)
> c-program that can demonstrate how it can be used against non-patched
> kernel. There is very nice blog
> post(https://googleprojectzero.blogspot.kr/2017/05/exploiting-linux-kernel-via-packet.html)
> about this vulnerability by Andrey Konovalov.

Ok, thanks.  It's a good example because the fix for this CVE actually
came by itself 
(http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/changelog).
Normally multiple CVEs come at the same time, which would make a
workaround for one now helpful.  This is a good counter-example.

I'm going to maintain that I really don't like this.  But it looks
useful, so ack on the concept, I'll just have to look again at the
code now.  Thanks for indulging me.

-serge


Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-08 Thread Serge E. Hallyn
On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार) wrote:
> On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner
>  wrote:
> > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) 
> > wrote:
> >> Sorry folks I was traveling and seems like lot happened on this thread. :p
> >>
> >> I will try to response few of these comments selectively -
> >>
> >> > The thing that makes me hesitate with this set is that it is a
> >> > permanent new feature to address what (I hope) is a temporary
> >> > problem.
> >> I agree this is permanent new feature but it's not solving a temporary
> >> problem. It's impossible to assess what and when new vulnerability
> >> that could show up. I think Daniel summed it up appropriately in his
> >> response
> >>
> >> > Seems like there are two naive ways to do it, the first being to just
> >> > look at all code under ns_capable() plus code called from there.  It
> >> > seems like looking at the result of that could be fruitful.
> >> This is really hard. The main issue that there were features designed
> >> and developed before user-ns days with an assumption that unprivileged
> >> users will never get certain capabilities which only root user gets.
> >> Now that is not true anymore with user-ns creation with mapping root
> >> for any process. Also at the same time blocking user-ns creation for
> >> eveyone is a big-hammer which is not needed too. So it's not that easy
> >> to just perform a code-walk-though and correct those decisions now.
> >>
> >> > It seems to me that the existing control in
> >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
> >> > in that case.
> >> This solution is essentially blocking unprivileged users from using
> >> the user-namespaces entirely. This is not really a solution that can
> >> work. The solution that this patch-set adds allows unprivileged users
> >> to create user-namespaces. Actually the proposed solution is more
> >> fine-grained approach than the unprivileged_userns_clone solution
> >> since you can selectively block capabilities rather than completely
> >> blocking the functionality.
> >
> > I've been talking to Stéphane today about this and we should also keep in 
> > mind
> > that we have:
> >
> > chb@conventiont|~
> >> ls -al /proc/sys/user/
> > total 0
> > dr-xr-xr-x 1 root root 0 Nov  6 23:32 .
> > dr-xr-xr-x 1 root root 0 Nov  2 22:13 ..
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_cgroup_namespaces
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_instances
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_inotify_watches
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_ipc_namespaces
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_mnt_namespaces
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_net_namespaces
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_pid_namespaces
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_user_namespaces
> > -rw-r--r-- 1 root root 0 Nov  8 19:48 max_uts_namespaces
> >
> > These files allow you to limit the number of namespaces that can be created
> > *per namespace* type. So let's say your system runs a bunch of user 
> > namespaces
> > you can do:
> >
> > chb@conventiont|~
> >> echo 0 > /proc/sys/user/max_user_namespaces
> >
> > So that the next time you try to create a user namespaces you'd see:
> >
> > chb@conventiont|~
> >> unshare -U
> > unshare: unshare failed: No space left on device
> >
> > So there's not even a need to upstream a new sysctl since we have ways of
> > blocking this.
> >
> I'm not sure how it's solving the problem that my patch-set is addressing?
> I agree though that the need for unprivileged_userns_clone sysctl goes
> away as this is equivalent to setting that sysctl to 0 as you have
> described above.

oh right that was the reasoning iirc for not needing the other sysctl.

> However as I mentioned earlier, blocking processes from creating
> user-namespaces is not the solution. Processes should be able to
> create namespaces as they are designed but at the same time we need to
> have controls to 'contain' them if a need arise. Setting max_no to 0
> is not the solution that I'm looking for since it doesn't solve the
> problem.

well yesterday we were told that was explicitly not the goal, but that was 
not by you ... i just mention it to explain why we seem to be walking in
circles a bit.

anyway the bounding set doesn't actually make sense so forget that.   the
question then is just whether it makes sense to allow things to continue
at all in this situation.  would you mind indulging me by giving one or two
concrete examples in the previous known cves of what capabilities you would
have dropped tto allow the rest to continue to be safely used?

thanks,
serge


Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-06 Thread Serge E. Hallyn
On Mon, Nov 06, 2017 at 07:01:58PM -0500, Boris Lukashev wrote:
> On Mon, Nov 6, 2017 at 6:39 PM, Serge E. Hallyn <se...@hallyn.com> wrote:
> > Quoting Boris Lukashev (blukas...@sempervictus.com):
> >> On Mon, Nov 6, 2017 at 5:14 PM, Serge E. Hallyn <se...@hallyn.com> wrote:
> >> > Quoting Daniel Micay (danielmi...@gmail.com):
> >> >> Substantial added attack surface will never go away as a problem. There
> >> >> aren't a finite number of vulnerabilities to be found.
> >> >
> >> > There's varying levels of usefulness and quality.  There is code which I
> >> > want to be able to use in a container, and code which I can't ever see a
> >> > reason for using there.  The latter, especially if it's also in a
> >> > staging driver, would be nice to have a toggle to disable.
> >> >
> >> > You're not advocating dropping the added attack surface, only adding a
> >> > way of dealing with an 0day after the fact.  Privilege raising 0days can
> >> > exist anywhere, not just in code which only root in a user namespace can
> >> > exercise.  So from that point of view, ksplice seems a more complete
> >> > solution.  Why not just actually fix the bad code block when we know
> >> > about it?
> >> >
> >> > Finally, it has been well argued that you can gain many new caps from
> >> > having only a few others.  Given that, how could you ever be sure that,
> >> > if an 0day is found which allows root in a user ns to abuse
> >> > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
> >> > would suffice?  It seems to me that the existing control in
> >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
> >> > in that case.
> >> >
> >> > -serge
> >>
> >> This seems to be heading toward "we need full zones in Linux" with
> >> their own procfs and sysfs namespace and a stricter isolation model
> >> for resources and capabilities. So long as things can happen in a
> >> namespace which have a privileged relationship with host resources,
> >> this is going to be cat-and-mouse to one degree or another.
> >>
> >> Containers and namespaces dont have a one-to-one relationship, so i'm
> >> not sure that's the best term to use in the kernel security context
> >
> > Sorry - what's not the best term to use?
> 
> Pardon, "containers," since they're namespaces+system construct.
> 
> >
> >> since there's a bunch of userspace and implementation delta across the
> >> different systems (with their own security models and so forth).
> >> Without accounting for what a specific implementation may or may not
> >> do, and only looking at "how do we reduce privileged impact on parent
> >> context from unprivileged namespaces," this patch does seem to provide
> >> a logical way of reducing the privileges available in such a namespace
> >> and often needed to mount escapes/impact parent context.
> >
> > What different implementations do is irrelevant - as an unprivileged user
> > I can always, with no help, create a new user namespace mapping my current
> > uid to root, and exercise this code.  So the security model implemented
> > by a particular userspace namespace-using driver doesn't matter, as it
> > only restricts me if I choose to use it.
> >
> > But, I guess you're actually saying that some program might know that it
> > should never use network code so want to drop CAP_NET_*?  And you're
> > saying that a "global capability bounding set" might be useful?
> >
> 
> The "global capability bounding set" with forced inheritance can be
> used to prevent the vector you describe wherein the capability of UID
> 0 in the child NS is restricted from the parent implicitly, so yes,
> that nomenclature seems appropriate.
> 
> > Would it be better to actually implement it as a new bounding set that
> > is maintained across user namespace creations, but is per-task (inherted
> > by children of course)?  Instead of a sysctl?
> >
> > -serge
> 
> In line with the previous comment, the inheritance across subsequent
> invocations should be forced to prevent the context you described.
> Please pardon my ignorance, not sure what you mean in terms of
> "per-task" across namespace creation.

I meant each task has a perm_cap_bset next to the cap_bset.  So task
p1 (if it has privilege) can drop CAP_SYS_ADMIN from perm_cap_bset,
p2 (if it has privilege) can drop CAP_NET_ADMIN.  When p1 creates a
new user_ns, that init task has its cap_bset set to all caps but
CAP_SYS_ADMIN.

I think for simplicity perm_cap_bset would *only* affect the filling
of cap_bset at user namespace creation.  So if you wanted to drop a
capability from your own cap_bset as well, you'd have to do that
separately.


Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-06 Thread Serge E. Hallyn
On Mon, Nov 06, 2017 at 09:16:03PM -0500, Daniel Micay wrote:
> On Mon, 2017-11-06 at 16:14 -0600, Serge E. Hallyn wrote:
> > Quoting Daniel Micay (danielmi...@gmail.com):
> > > Substantial added attack surface will never go away as a problem.
> > > There
> > > aren't a finite number of vulnerabilities to be found.
> > 
> > There's varying levels of usefulness and quality.  There is code which
> > I
> > want to be able to use in a container, and code which I can't ever see
> > a
> > reason for using there.  The latter, especially if it's also in a
> > staging driver, would be nice to have a toggle to disable.
> > 
> > You're not advocating dropping the added attack surface, only adding a
> > way of dealing with an 0day after the fact.  Privilege raising 0days
> > can
> > exist anywhere, not just in code which only root in a user namespace
> > can
> > exercise.  So from that point of view, ksplice seems a more complete
> > solution.  Why not just actually fix the bad code block when we know
> > about it?
> 
> That's not what I'm advocating. I only care about it for proactive
> attack surface reduction downstream. I have no interest in using it to
> block access to known vulnerabilities.
> 
> > Finally, it has been well argued that you can gain many new caps from
> > having only a few others.  Given that, how could you ever be sure
> > that,
> > if an 0day is found which allows root in a user ns to abuse
> > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
> > would suffice?
> 
> I didn't suggest using it that way...
> 
> >  It seems to me that the existing control in
> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct
> > tape
> > in that case.
> 
> There's no such thing as unprivileged_userns_clone in mainline.

Hm.  I was sure Kees had gotten that in...  I guess I was wrong.

> The advantage of this over unprivileged_userns_clone in Debian and maybe
> some other distributions is not giving up unprivileged app containers /
> sandboxes implemented via user namespaces.  For example, Chromium's user
> namespace sandbox likely only needs to have CAP_SYS_CHROOT. Chromium
> will be dropping their setuid sandbox, forcing usage of user namespaces
> to avoid losing the sandbox which will greatly increase local kernel
> attack surface on the host by exposing netfilter management, etc. to
> unprivileged users.
> 
> The proposed approach isn't necessarily the best way to implement this
> kind of mitigation but I think it's filling a real need.

I think I definately prefer what I mentioned in the email to Boris.
Basically a "permanent capability bounding set".  The normal bounding
set gets reset to a full set on every new user_ns creation.  In this
proposal, it would instead be set to the calling task's permanent
capability set, which starts (at boot) full, and which privileged
tasks can pull capabilities out of.

-serge


Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-06 Thread Serge E. Hallyn
Quoting Boris Lukashev (blukas...@sempervictus.com):
> On Mon, Nov 6, 2017 at 5:14 PM, Serge E. Hallyn <se...@hallyn.com> wrote:
> > Quoting Daniel Micay (danielmi...@gmail.com):
> >> Substantial added attack surface will never go away as a problem. There
> >> aren't a finite number of vulnerabilities to be found.
> >
> > There's varying levels of usefulness and quality.  There is code which I
> > want to be able to use in a container, and code which I can't ever see a
> > reason for using there.  The latter, especially if it's also in a
> > staging driver, would be nice to have a toggle to disable.
> >
> > You're not advocating dropping the added attack surface, only adding a
> > way of dealing with an 0day after the fact.  Privilege raising 0days can
> > exist anywhere, not just in code which only root in a user namespace can
> > exercise.  So from that point of view, ksplice seems a more complete
> > solution.  Why not just actually fix the bad code block when we know
> > about it?
> >
> > Finally, it has been well argued that you can gain many new caps from
> > having only a few others.  Given that, how could you ever be sure that,
> > if an 0day is found which allows root in a user ns to abuse
> > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
> > would suffice?  It seems to me that the existing control in
> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
> > in that case.
> >
> > -serge
> 
> This seems to be heading toward "we need full zones in Linux" with
> their own procfs and sysfs namespace and a stricter isolation model
> for resources and capabilities. So long as things can happen in a
> namespace which have a privileged relationship with host resources,
> this is going to be cat-and-mouse to one degree or another.
> 
> Containers and namespaces dont have a one-to-one relationship, so i'm
> not sure that's the best term to use in the kernel security context

Sorry - what's not the best term to use?

> since there's a bunch of userspace and implementation delta across the
> different systems (with their own security models and so forth).
> Without accounting for what a specific implementation may or may not
> do, and only looking at "how do we reduce privileged impact on parent
> context from unprivileged namespaces," this patch does seem to provide
> a logical way of reducing the privileges available in such a namespace
> and often needed to mount escapes/impact parent context.

What different implementations do is irrelevant - as an unprivileged user
I can always, with no help, create a new user namespace mapping my current
uid to root, and exercise this code.  So the security model implemented
by a particular userspace namespace-using driver doesn't matter, as it
only restricts me if I choose to use it.

But, I guess you're actually saying that some program might know that it
should never use network code so want to drop CAP_NET_*?  And you're
saying that a "global capability bounding set" might be useful?

Would it be better to actually implement it as a new bounding set that
is maintained across user namespace creations, but is per-task (inherted
by children of course)?  Instead of a sysctl?

-serge


Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-06 Thread Serge E. Hallyn
Quoting Daniel Micay (danielmi...@gmail.com):
> Substantial added attack surface will never go away as a problem. There
> aren't a finite number of vulnerabilities to be found.

There's varying levels of usefulness and quality.  There is code which I
want to be able to use in a container, and code which I can't ever see a
reason for using there.  The latter, especially if it's also in a
staging driver, would be nice to have a toggle to disable.

You're not advocating dropping the added attack surface, only adding a
way of dealing with an 0day after the fact.  Privilege raising 0days can
exist anywhere, not just in code which only root in a user namespace can
exercise.  So from that point of view, ksplice seems a more complete
solution.  Why not just actually fix the bad code block when we know
about it?

Finally, it has been well argued that you can gain many new caps from
having only a few others.  Given that, how could you ever be sure that,
if an 0day is found which allows root in a user ns to abuse
CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
would suffice?  It seems to me that the existing control in
/proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
in that case.

-serge


Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-06 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (महेश बंडेवार) (mahe...@google.com):
> On Sat, Nov 4, 2017 at 4:53 PM, Serge E. Hallyn <se...@hallyn.com> wrote:
> >
> > Quoting Mahesh Bandewar (mah...@bandewar.net):
> > > Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
> > > that belongs to uncontrolled user-ns can create another (child) user-
> > > namespace that is uncontrolled. Any other process (that either does
> > > not have SYS_ADMIN or belongs to a controlled user-ns) can only
> > > create a user-ns that is controlled.
> >
> > That's a huge change though.  It means that any system that previously
> > used unprivileged containers will need new privileged code (which always
> > risks more privilege leaks through the new code) to re-enable what was
> > possible without privilege before.  That's a regression.
> >
> I wouldn't call it a regression since the existing behavior is
> preserved as it is if the default-mask is not altered. i.e.
> uncontrolled process can create user-ns and have full control inside
> that user-ns. The only difference is - as an example if 'something'
> comes up which makes a specific capability expose ring-0, so admin can
> quickly remove the capability in question from the mask, so that no
> untrusted code can exploit that capability until either the kernel is

Oh, sorry, I misread then, and missed that step.  I thought the default
with this patchset was that there were no capabilities exposed to user
namespaces.

> patched or workloads are sanitized keeping in mind what was
> discovered. (I have given some real life example vulnerabilities
> published recently about CAP_NET_RAW in the cover letter)
> 
> > I'm very much interested in what you want to do,  But it seems like
> > it would be worth starting with some automated code analysis that shows
> > exactly what code becomes accessible to unprivileged users with user
> > namespaces which was accessible to unprivileged users before.  Then we
> > can reason about classifying that code and perhaps limiting access to
> > some of it.
> I would like to look at this as 'a tool' that is available to admins
> who can quickly take possible-compromise-situation under-control
> probably at the cost of some functionality-loss until kernel is
> patched and the mask is restored to default value.

The thing that makes me hesitate with this set is that it is a
permanent new feature to address what (I hope) is a temporary
problem.  What would you think about doing this as a stackable
(yama-style) LSM?

> I'm not sure if automated tools could discover anything since these
> changes should not alter behavior in any way.

Seems like there are two naive ways to do it, the first being to just
look at all code under ns_capable() plus code called from there.  It
seems like looking at the result of that could be fruitful.

-serge


Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces

2017-11-04 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (mah...@bandewar.net):
> Init-user-ns is always uncontrolled and a process that has SYS_ADMIN
> that belongs to uncontrolled user-ns can create another (child) user-
> namespace that is uncontrolled. Any other process (that either does
> not have SYS_ADMIN or belongs to a controlled user-ns) can only
> create a user-ns that is controlled.

That's a huge change though.  It means that any system that previously
used unprivileged containers will need new privileged code (which always
risks more privilege leaks through the new code) to re-enable what was
possible without privilege before.  That's a regression.

I'm very much interested in what you want to do,  But it seems like
it would be worth starting with some automated code analysis that shows
exactly what code becomes accessible to unprivileged users with user
namespaces which was accessible to unprivileged users before.  Then we
can reason about classifying that code and perhaps limiting access to
some of it.


Re: [kernel-hardening] [PATCH 0/2] capability controlled user-namespaces

2017-10-02 Thread Serge E. Hallyn
Quoting Mahesh Bandewar (mah...@bandewar.net):
> From: Mahesh Bandewar 
> 
> [Same as the previous RFC series sent on 9/21]
> 
> TL;DR version
> -
> Creating a sandbox environment with namespaces is challenging
> considering what these sandboxed processes can engage into. e.g.
> CVE-2017-6074, CVE-2017-7184, CVE-2017-7308 etc. just to name few.
> Current form of user-namespaces, however, if changed a bit can allow
> us to create a sandbox environment without locking down user-
> namespaces.
> 
> Detailed version
> 

Hi,

still struggling with how I feel about the idea in general.

So is the intent mainly that if/when there comes an 0-day which allows
users with CAP_NET_ADMIN in any namespace to gain privilege on the host,
then this can be used as a stop-gap measure until there is a proper fix?

Otherwise, do you have any guidance for how people should use this?

IMO it should be heavily discouraged to use this tool as a regular
day to day configuration, as I'm not sure there is any "educated"
decision to be made, even by those who are in the know, about what
to put in this set.

> Problem
> ---
> User-namespaces in the current form have increased the attack surface as
> any process can acquire capabilities which are not available to them (by
> default) by performing combination of clone()/unshare()/setns() syscalls.
> 
> #define _GNU_SOURCE
> #include 
> #include 
> #include 
> 
> int main(int ac, char **av)
> {
> int sock = -1;
> 
> printf("Attempting to open RAW socket before unshare()...\n");
> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
> if (sock < 0) {
> perror("socket() SOCK_RAW failed: ");
> } else {
> printf("Successfully opened RAW-Sock before unshare().\n");
> close(sock);
> sock = -1;
> }
> 
> if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) {
> perror("unshare() failed: ");
> return 1;
> }
> 
> printf("Attempting to open RAW socket after unshare()...\n");
> sock = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW);
> if (sock < 0) {
> perror("socket() SOCK_RAW failed: ");
> } else {
> printf("Successfully opened RAW-Sock after unshare().\n");
> close(sock);
> sock = -1;
> }
> 
> return 0;
> }
> 
> The above example shows how easy it is to acquire NET_RAW capabilities
> and once acquired, these processes could take benefit of above mentioned
> or similar issues discovered/undiscovered with malicious intent. Note
> that this is just an example and the problem/solution is not limited
> to NET_RAW capability *only*. 
> 
> The easiest fix one can apply here is to lock-down user-namespaces which
> many of the distros do (i.e. don't allow users to create user namespaces),
> but unfortunately that prevents everyone from using them.
> 
> Approach
> 
> Introduce a notion of 'controlled' user-namespaces. Every process on
> the host is allowed to create user-namespaces (governed by the limit
> imposed by per-ns sysctl) however, mark user-namespaces created by
> sandboxed processes as 'controlled'. Use this 'mark' at the time of
> capability check in conjunction with a global capability whitelist.
> If the capability is not whitelisted, processes that belong to 
> controlled user-namespaces will not be allowed.
> 
> Once a user-ns is marked as 'controlled'; all its child user-
> namespaces are marked as 'controlled' too.
> 
> A global whitelist is list of capabilities governed by the
> sysctl which is available to (privileged) user in init-ns to modify
> while it's applicable to all controlled user-namespaces on the host.
> 
> Marking user-namespaces controlled without modifying the whitelist is
> equivalent of the current behavior. The default value of whitelist includes
> all capabilities so that the compatibility is maintained. However it gives
> admins fine-grained ability to control various capabilities system wide
> without locking down user-namespaces.
> 
> Please see individual patches in this series.
> 
> Mahesh Bandewar (2):
>   capability: introduce sysctl for controlled user-ns capability
> whitelist
>   userns: control capabilities of some user namespaces
> 
>  Documentation/sysctl/kernel.txt | 21 +
>  include/linux/capability.h  |  4 
>  include/linux/user_namespace.h  | 20 
>  kernel/capability.c | 52 
> +
>  kernel/sysctl.c |  5 
>  kernel/user_namespace.c |  3 +++
>  security/commoncap.c|  8 +++
>  7 files changed, 113 insertions(+)
> 
> -- 
> 2.14.2.822.g60be5d43e6-goog


Re: [RESEND][PATCH v4] cgroup: Use CAP_SYS_RESOURCE to allow a process to migrate other tasks between cgroups

2016-12-05 Thread Serge E. Hallyn
On Mon, Dec 05, 2016 at 04:36:51PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 5, 2016 at 4:28 PM, John Stultz  wrote:
> > On Tue, Nov 22, 2016 at 4:57 PM, John Stultz  wrote:
> >> On Tue, Nov 8, 2016 at 4:12 PM, Andy Lutomirski  
> >> wrote:
> >>> On Tue, Nov 8, 2016 at 4:03 PM, Alexei Starovoitov
> >>>  wrote:
>  On Tue, Nov 08, 2016 at 03:51:40PM -0800, Andy Lutomirski wrote:
> >
> > I hate to say it, but I think I may see a problem.  Current
> > developments are afoot to make cgroups do more than resource control.
> > For example, there's Landlock and there's Daniel's ingress/egress
> > filter thing.  Current cgroup controllers can mostly just DoS their
> > controlled processes.  These new controllers (or controller-like
> > things) can exfiltrate data and change semantics.
> >
> > Does anyone have a security model in mind for these controllers and
> > the cgroups that they're attached to?  I'm reasonably confident that
> > CAP_SYS_RESOURCE is not the answer...
> 
>  and specifically the answer is... ?
>  Also would be great if you start with specifying the question first
>  and the problem you're trying to solve.
> 
> >>>
> >>> I don't have a good answer right now.  Here are some constraints, though:
> >>>
> >>> 1. An insufficiently privileged process should not be able to move a
> >>> victim into a dangerous cgroup.
> >>>
> >>> 2. An insufficiently privileged process should not be able to move
> >>> itself into a dangerous cgroup and then use execve to gain privilege
> >>> such that the execve'd program can be compromised.
> >>>
> >>> 3. An insufficiently privileged process should not be able to make an
> >>> existing cgroup dangerous in a way that could compromise a victim in
> >>> that cgroup.
> >>>
> >>> 4. An insufficiently privileged process should not be able to make a
> >>> cgroup dangerous in a way that bypasses protections that would
> >>> otherwise protect execve() as used by itself or some other process in
> >>> that cgroup.
> >>>
> >>> Keep in mind that "dangerous" may apply to a cgroup's descendents in
> >>> addition to the cgroup being controlled.
> >>
> >> Sorry for taking awhile to get back to you here.  I'm a little
> >> befuddled as to what next steps I should consider (and honestly, I'm
> >> not totally sure I really grok your concern here, particularly what
> >> you mean with "dangrous cgroups").
> >>
> >> So is going back to the CAP_CGROUP_MIGRATE approach (to properly
> >> separate "sufficiently" from "insufficiently privileged") better?
> >>
> >> Or something closer to the original method Android used of each cgroup
> >> having an allow_attach() check which could determine what is
> >> sufficiently privledged for the respective level of danger the cgroup
> >> might poise?
> >>
> >> Or just stepping back, what method would you imagine to be reasonable
> >> to allow a specified task to migrate other tasks between cgroups
> >> without it having to be root/suid?
> >
> > Any suggested feedback here?
> 
> I really don't know.  The cgroupfs interface is a bit unfortunate in
> that it doesn't really express the constraints.  To safely migrate a
> task, ISTM you ought to have some form of privilege over the task
> *and* some form of privilege over the cgroup.

Agreed.  The problem is that the privilege required should depend on
the controller (I guess).  For memory and cpuset, CAP_SYS_NICE seems
right.  Perhaps CAP_SYS_RESOURCE would be needed for some..  but then,
as I look through the lists (capabilities(7) and the list of controllers),
it seems like CAP_SYS_NICE works for everything.  What else would we need?
Maybe CAP_NET_ADMIN for net_cls and net_prio?  CAP_SYS_RESOURCE|CAP_SYS_ADMIN
for pids?

>   cgroupfs only handles
> the latter.

If we need different checks for different controllers, we can add
checks to cgroupfs.

> CAP_CGROUP_MIGRATE ought to be okay.  Or maybe cgroupfs needs to gain
> a concept of "dangerous" cgroups and further restrict them and
> CAP_SYS_RESOURCE should be fine for non-dangerous cgroups?  I think I
> favor the latter, but it might be nice to hear from Tejun first.
> 
> --Andy


Re: [PATCH v2 10/10] mntns: Add a limit on the number of mount namespaces.

2016-07-25 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> Signed-off-by: "Eric W. Biederman" 

Acked-by: Serge Hallyn 

Thanks, Eric.

> ---
>  fs/namespace.c | 19 ++-
>  include/linux/user_namespace.h |  1 +
>  kernel/user_namespace.c|  1 +
>  3 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index aabe8e397fc3..3942ae6c34f5 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2718,9 +2718,20 @@ dput_out:
>   return retval;
>  }
>  
> +static bool inc_mnt_namespaces(struct user_namespace *ns)
> +{
> + return inc_ucount(ns, UCOUNT_MNT_NAMESPACES);
> +}
> +
> +static void dec_mnt_namespaces(struct user_namespace *ns)
> +{
> + dec_ucount(ns, UCOUNT_MNT_NAMESPACES);
> +}
> +
>  static void free_mnt_ns(struct mnt_namespace *ns)
>  {
>   ns_free_inum(>ns);
> + dec_mnt_namespaces(ns->user_ns);
>   put_user_ns(ns->user_ns);
>   kfree(ns);
>  }
> @@ -2739,12 +2750,18 @@ static struct mnt_namespace *alloc_mnt_ns(struct 
> user_namespace *user_ns)
>   struct mnt_namespace *new_ns;
>   int ret;
>  
> + if (!inc_mnt_namespaces(user_ns))
> + return ERR_PTR(-ENFILE);
> +
>   new_ns = kmalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
> - if (!new_ns)
> + if (!new_ns) {
> + dec_mnt_namespaces(user_ns);
>   return ERR_PTR(-ENOMEM);
> + }
>   ret = ns_alloc_inum(_ns->ns);
>   if (ret) {
>   kfree(new_ns);
> + dec_mnt_namespaces(user_ns);
>   return ERR_PTR(ret);
>   }
>   new_ns->ns.ops = _operations;
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index f86afa536baf..ee0e9d7b2e67 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -28,6 +28,7 @@ enum ucounts {
>   UCOUNT_UTS_NAMESPACES,
>   UCOUNT_IPC_NAMESPACES,
>   UCOUNT_NET_NAMESPACES,
> + UCOUNT_MNT_NAMESPACES,
>   UCOUNT_CGROUP_NAMESPACES,
>   UCOUNT_COUNTS,
>  };
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index e326ca722ae0..f6702d84afab 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -81,6 +81,7 @@ static struct ctl_table userns_table[] = {
>   UCOUNT_ENTRY("max_uts_namespaces"),
>   UCOUNT_ENTRY("max_ipc_namespaces"),
>   UCOUNT_ENTRY("max_net_namespaces"),
> + UCOUNT_ENTRY("max_mnt_namespaces"),
>   UCOUNT_ENTRY("max_cgroup_namespaces"),
>   { }
>  };
> -- 
> 2.8.3


Re: [PATCH v2 09/10] netns: Add a limit on the number of net namespaces

2016-07-25 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> Signed-off-by: "Eric W. Biederman" 

Acked-by: Serge Hallyn 

> ---
>  include/linux/user_namespace.h |  1 +
>  kernel/user_namespace.c|  1 +
>  net/core/net_namespace.c   | 15 +++
>  3 files changed, 17 insertions(+)
> 
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 1a3a9cb93277..f86afa536baf 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -27,6 +27,7 @@ enum ucounts {
>   UCOUNT_PID_NAMESPACES,
>   UCOUNT_UTS_NAMESPACES,
>   UCOUNT_IPC_NAMESPACES,
> + UCOUNT_NET_NAMESPACES,
>   UCOUNT_CGROUP_NAMESPACES,
>   UCOUNT_COUNTS,
>  };
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 1cf074cb47e2..e326ca722ae0 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -80,6 +80,7 @@ static struct ctl_table userns_table[] = {
>   UCOUNT_ENTRY("max_pid_namespaces"),
>   UCOUNT_ENTRY("max_uts_namespaces"),
>   UCOUNT_ENTRY("max_ipc_namespaces"),
> + UCOUNT_ENTRY("max_net_namespaces"),
>   UCOUNT_ENTRY("max_cgroup_namespaces"),
>   { }
>  };
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 2c2eb1b629b1..a489f192d619 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -266,6 +266,16 @@ struct net *get_net_ns_by_id(struct net *net, int id)
>   return peer;
>  }
>  
> +static bool inc_net_namespaces(struct user_namespace *ns)
> +{
> + return inc_ucount(ns, UCOUNT_NET_NAMESPACES);
> +}
> +
> +static void dec_net_namespaces(struct user_namespace *ns)
> +{
> + dec_ucount(ns, UCOUNT_NET_NAMESPACES);
> +}
> +
>  /*
>   * setup_net runs the initializers for the network namespace object.
>   */
> @@ -276,6 +286,9 @@ static __net_init int setup_net(struct net *net, struct 
> user_namespace *user_ns)
>   int error = 0;
>   LIST_HEAD(net_exit_list);
>  
> + if (!inc_net_namespaces(user_ns))
> + return -ENFILE;
> +
>   atomic_set(>count, 1);
>   atomic_set(>passive, 1);
>   net->dev_base_seq = 1;
> @@ -372,6 +385,7 @@ struct net *copy_net_ns(unsigned long flags,
>   }
>   mutex_unlock(_mutex);
>   if (rv < 0) {
> + dec_net_namespaces(user_ns);
>   put_user_ns(user_ns);
>   net_drop_ns(net);
>   return ERR_PTR(rv);
> @@ -444,6 +458,7 @@ static void cleanup_net(struct work_struct *work)
>   /* Finally it is safe to free my network namespace structure */
>   list_for_each_entry_safe(net, tmp, _exit_list, exit_list) {
>   list_del_init(>exit_list);
> + dec_net_namespaces(net->user_ns);
>   put_user_ns(net->user_ns);
>   net_drop_ns(net);
>   }
> -- 
> 2.8.3


Re: [PATCH v2 08/10] cgroupns: Add a limit on the number of cgroup namespaces

2016-07-25 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> Signed-off-by: "Eric W. Biederman" 

Acked-by: Serge Hallyn 

> ---
>  include/linux/user_namespace.h |  1 +
>  kernel/cgroup.c| 15 +++
>  kernel/user_namespace.c|  1 +
>  3 files changed, 17 insertions(+)
> 
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 367cf08ff63d..1a3a9cb93277 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -27,6 +27,7 @@ enum ucounts {
>   UCOUNT_PID_NAMESPACES,
>   UCOUNT_UTS_NAMESPACES,
>   UCOUNT_IPC_NAMESPACES,
> + UCOUNT_CGROUP_NAMESPACES,
>   UCOUNT_COUNTS,
>  };
>  
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 86cb5c6e8932..240eec43390b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -6260,6 +6260,16 @@ void cgroup_sk_free(struct sock_cgroup_data *skcd)
>  
>  /* cgroup namespaces */
>  
> +static bool inc_cgroup_namespaces(struct user_namespace *ns)
> +{
> + return inc_ucount(ns, UCOUNT_CGROUP_NAMESPACES);
> +}
> +
> +static void dec_cgroup_namespaces(struct user_namespace *ns)
> +{
> + dec_ucount(ns, UCOUNT_CGROUP_NAMESPACES);
> +}
> +
>  static struct cgroup_namespace *alloc_cgroup_ns(void)
>  {
>   struct cgroup_namespace *new_ns;
> @@ -6281,6 +6291,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
>  void free_cgroup_ns(struct cgroup_namespace *ns)
>  {
>   put_css_set(ns->root_cset);
> + dec_cgroup_namespaces(ns->user_ns);
>   put_user_ns(ns->user_ns);
>   ns_free_inum(>ns);
>   kfree(ns);
> @@ -6305,6 +6316,9 @@ struct cgroup_namespace *copy_cgroup_ns(unsigned long 
> flags,
>   if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>   return ERR_PTR(-EPERM);
>  
> + if (!inc_cgroup_namespaces(user_ns))
> + return ERR_PTR(-ENFILE);
> +
>   mutex_lock(_mutex);
>   spin_lock_bh(_set_lock);
>  
> @@ -6317,6 +6331,7 @@ struct cgroup_namespace *copy_cgroup_ns(unsigned long 
> flags,
>   new_ns = alloc_cgroup_ns();
>   if (IS_ERR(new_ns)) {
>   put_css_set(cset);
> + dec_cgroup_namespaces(user_ns);
>   return new_ns;
>   }
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 060d3e099f87..1cf074cb47e2 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -80,6 +80,7 @@ static struct ctl_table userns_table[] = {
>   UCOUNT_ENTRY("max_pid_namespaces"),
>   UCOUNT_ENTRY("max_uts_namespaces"),
>   UCOUNT_ENTRY("max_ipc_namespaces"),
> + UCOUNT_ENTRY("max_cgroup_namespaces"),
>   { }
>  };
>  #endif /* CONFIG_SYSCTL */
> -- 
> 2.8.3


Re: [PATCH v2 07/10] ipcns: Add a limit on the number of ipc namespaces

2016-07-25 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> Signed-off-by: "Eric W. Biederman" 

Acked-by: Serge Hallyn 

> ---
>  include/linux/user_namespace.h |  1 +
>  ipc/namespace.c| 42 
> +++---
>  kernel/user_namespace.c|  1 +
>  3 files changed, 33 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index bed2506081fe..367cf08ff63d 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -26,6 +26,7 @@ enum ucounts {
>   UCOUNT_USER_NAMESPACES,
>   UCOUNT_PID_NAMESPACES,
>   UCOUNT_UTS_NAMESPACES,
> + UCOUNT_IPC_NAMESPACES,
>   UCOUNT_COUNTS,
>  };
>  
> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index 04cb07eb81f1..3996a1e41a1d 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -16,33 +16,42 @@
>  
>  #include "util.h"
>  
> +static bool inc_ipc_namespaces(struct user_namespace *ns)
> +{
> + return inc_ucount(ns, UCOUNT_IPC_NAMESPACES);
> +}
> +
> +static void dec_ipc_namespaces(struct user_namespace *ns)
> +{
> + dec_ucount(ns, UCOUNT_IPC_NAMESPACES);
> +}
> +
>  static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
>  struct ipc_namespace *old_ns)
>  {
>   struct ipc_namespace *ns;
>   int err;
>  
> + err = -ENFILE;
> + if (!inc_ipc_namespaces(user_ns))
> + goto fail;
> +
> + err = -ENOMEM;
>   ns = kmalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
>   if (ns == NULL)
> - return ERR_PTR(-ENOMEM);
> + goto fail_dec;
>  
>   err = ns_alloc_inum(>ns);
> - if (err) {
> - kfree(ns);
> - return ERR_PTR(err);
> - }
> + if (err)
> + goto fail_free;
>   ns->ns.ops = _operations;
>  
>   atomic_set(>count, 1);
>   ns->user_ns = get_user_ns(user_ns);
>  
>   err = mq_init_ns(ns);
> - if (err) {
> - put_user_ns(ns->user_ns);
> - ns_free_inum(>ns);
> - kfree(ns);
> - return ERR_PTR(err);
> - }
> + if (err)
> + goto fail_put;
>   atomic_inc(_ipc_ns);
>  
>   sem_init_ns(ns);
> @@ -50,6 +59,16 @@ static struct ipc_namespace *create_ipc_ns(struct 
> user_namespace *user_ns,
>   shm_init_ns(ns);
>  
>   return ns;
> +
> +fail_put:
> + put_user_ns(ns->user_ns);
> + ns_free_inum(>ns);
> +fail_free:
> + kfree(ns);
> +fail_dec:
> + dec_ipc_namespaces(user_ns);
> +fail:
> + return ERR_PTR(err);
>  }
>  
>  struct ipc_namespace *copy_ipcs(unsigned long flags,
> @@ -98,6 +117,7 @@ static void free_ipc_ns(struct ipc_namespace *ns)
>   shm_exit_ns(ns);
>   atomic_dec(_ipc_ns);
>  
> + dec_ipc_namespaces(ns->user_ns);
>   put_user_ns(ns->user_ns);
>   ns_free_inum(>ns);
>   kfree(ns);
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 6b205c24e888..060d3e099f87 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -79,6 +79,7 @@ static struct ctl_table userns_table[] = {
>   UCOUNT_ENTRY("max_user_namespaces"),
>   UCOUNT_ENTRY("max_pid_namespaces"),
>   UCOUNT_ENTRY("max_uts_namespaces"),
> + UCOUNT_ENTRY("max_ipc_namespaces"),
>   { }
>  };
>  #endif /* CONFIG_SYSCTL */
> -- 
> 2.8.3


Re: [PATCH v2 06/10] utsns: Add a limit on the number of uts namespaces

2016-07-25 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> Signed-off-by: "Eric W. Biederman" 

Acked-by: Serge Hallyn 

> ---
>  include/linux/user_namespace.h |  1 +
>  kernel/user_namespace.c|  1 +
>  kernel/utsname.c   | 31 ++-
>  3 files changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 47733637741a..bed2506081fe 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -25,6 +25,7 @@ struct uid_gid_map {/* 64 bytes -- 1 cache line */
>  enum ucounts {
>   UCOUNT_USER_NAMESPACES,
>   UCOUNT_PID_NAMESPACES,
> + UCOUNT_UTS_NAMESPACES,
>   UCOUNT_COUNTS,
>  };
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 02a03ead7afc..6b205c24e888 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -78,6 +78,7 @@ static int count_max = COUNT_MAX;
>  static struct ctl_table userns_table[] = {
>   UCOUNT_ENTRY("max_user_namespaces"),
>   UCOUNT_ENTRY("max_pid_namespaces"),
> + UCOUNT_ENTRY("max_uts_namespaces"),
>   { }
>  };
>  #endif /* CONFIG_SYSCTL */
> diff --git a/kernel/utsname.c b/kernel/utsname.c
> index 831ea7108232..4f13c0419d64 100644
> --- a/kernel/utsname.c
> +++ b/kernel/utsname.c
> @@ -17,6 +17,16 @@
>  #include 
>  #include 
>  
> +static bool inc_uts_namespaces(struct user_namespace *ns)
> +{
> + return inc_ucount(ns, UCOUNT_UTS_NAMESPACES);
> +}
> +
> +static void dec_uts_namespaces(struct user_namespace *ns)
> +{
> + dec_ucount(ns, UCOUNT_UTS_NAMESPACES);
> +}
> +
>  static struct uts_namespace *create_uts_ns(void)
>  {
>   struct uts_namespace *uts_ns;
> @@ -38,15 +48,18 @@ static struct uts_namespace *clone_uts_ns(struct 
> user_namespace *user_ns,
>   struct uts_namespace *ns;
>   int err;
>  
> + err = -ENFILE;
> + if (!inc_uts_namespaces(user_ns))
> + goto fail;
> +
> + err = -ENOMEM;
>   ns = create_uts_ns();
>   if (!ns)
> - return ERR_PTR(-ENOMEM);
> + goto fail_dec;
>  
>   err = ns_alloc_inum(>ns);
> - if (err) {
> - kfree(ns);
> - return ERR_PTR(err);
> - }
> + if (err)
> + goto fail_free;
>  
>   ns->ns.ops = _operations;
>  
> @@ -55,6 +68,13 @@ static struct uts_namespace *clone_uts_ns(struct 
> user_namespace *user_ns,
>   ns->user_ns = get_user_ns(user_ns);
>   up_read(_sem);
>   return ns;
> +
> +fail_free:
> + kfree(ns);
> +fail_dec:
> + dec_uts_namespaces(user_ns);
> +fail:
> + return ERR_PTR(err);
>  }
>  
>  /*
> @@ -85,6 +105,7 @@ void free_uts_ns(struct kref *kref)
>   struct uts_namespace *ns;
>  
>   ns = container_of(kref, struct uts_namespace, kref);
> + dec_uts_namespaces(ns->user_ns);
>   put_user_ns(ns->user_ns);
>   ns_free_inum(>ns);
>   kfree(ns);
> -- 
> 2.8.3


Re: [PATCH v2 05/10] pidns: Add a limit on the number of pid namespaces

2016-07-25 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> Signed-off-by: "Eric W. Biederman" 

Acked-by: Serge Hallyn 

> ---
>  include/linux/user_namespace.h |  1 +
>  kernel/pid_namespace.c | 22 ++
>  kernel/user_namespace.c|  1 +
>  3 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index f74a0facc696..47733637741a 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -24,6 +24,7 @@ struct uid_gid_map {/* 64 bytes -- 1 cache line */
>  
>  enum ucounts {
>   UCOUNT_USER_NAMESPACES,
> + UCOUNT_PID_NAMESPACES,
>   UCOUNT_COUNTS,
>  };
>  
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index a65ba137fd15..049cc14ae37a 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -79,6 +79,16 @@ static void proc_cleanup_work(struct work_struct *work)
>  /* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
>  #define MAX_PID_NS_LEVEL 32
>  
> +static bool inc_pid_namespaces(struct user_namespace *ns)
> +{
> + return inc_ucount(ns, UCOUNT_PID_NAMESPACES);
> +}
> +
> +static void dec_pid_namespaces(struct user_namespace *ns)
> +{
> + dec_ucount(ns, UCOUNT_PID_NAMESPACES);
> +}
> +
>  static struct pid_namespace *create_pid_namespace(struct user_namespace 
> *user_ns,
>   struct pid_namespace *parent_pid_ns)
>  {
> @@ -87,15 +97,16 @@ static struct pid_namespace *create_pid_namespace(struct 
> user_namespace *user_ns
>   int i;
>   int err;
>  
> - if (level > MAX_PID_NS_LEVEL) {
> - err = -EINVAL;
> + err = -EINVAL;
> + if (level > MAX_PID_NS_LEVEL)
> + goto out;
> + if (!inc_pid_namespaces(user_ns))
>   goto out;
> - }
>  
>   err = -ENOMEM;
>   ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL);
>   if (ns == NULL)
> - goto out;
> + goto out_dec;
>  
>   ns->pidmap[0].page = kzalloc(PAGE_SIZE, GFP_KERNEL);
>   if (!ns->pidmap[0].page)
> @@ -129,6 +140,8 @@ out_free_map:
>   kfree(ns->pidmap[0].page);
>  out_free:
>   kmem_cache_free(pid_ns_cachep, ns);
> +out_dec:
> + dec_pid_namespaces(user_ns);
>  out:
>   return ERR_PTR(err);
>  }
> @@ -146,6 +159,7 @@ static void destroy_pid_namespace(struct pid_namespace 
> *ns)
>   ns_free_inum(>ns);
>   for (i = 0; i < PIDMAP_ENTRIES; i++)
>   kfree(ns->pidmap[i].page);
> + dec_pid_namespaces(ns->user_ns);
>   put_user_ns(ns->user_ns);
>   call_rcu(>rcu, delayed_free_pidns);
>  }
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 728d7e4995ff..02a03ead7afc 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -77,6 +77,7 @@ static int count_max = COUNT_MAX;
>   }
>  static struct ctl_table userns_table[] = {
>   UCOUNT_ENTRY("max_user_namespaces"),
> + UCOUNT_ENTRY("max_pid_namespaces"),
>   { }
>  };
>  #endif /* CONFIG_SYSCTL */
> -- 
> 2.8.3


Re: [PATCH v2 04/10] userns: Generalize the user namespace count into ucount

2016-07-25 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> The same kind of recursive sane default limit and policy
> countrol that has been implemented for the user namespace
> is desirable for the other namespaces, so generalize
> the user namespace refernce count into a ucount.
> 
> Signed-off-by: "Eric W. Biederman" 

Acked-by: Serge Hallyn 

> ---
>  include/linux/user_namespace.h | 32 ++--
>  kernel/fork.c  |  5 +++-
>  kernel/user_namespace.c| 55 
> +++---
>  3 files changed, 69 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index ba6a995178f9..f74a0facc696 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -22,6 +22,16 @@ struct uid_gid_map {   /* 64 bytes -- 1 cache line */
>  
>  #define USERNS_INIT_FLAGS USERNS_SETGROUPS_ALLOWED
>  
> +enum ucounts {
> + UCOUNT_USER_NAMESPACES,
> + UCOUNT_COUNTS,
> +};
> +
> +struct ucount {
> + int max;
> + atomic_t count;
> +};
> +
>  struct user_namespace {
>   struct uid_gid_map  uid_map;
>   struct uid_gid_map  gid_map;
> @@ -43,8 +53,7 @@ struct user_namespace {
>   struct ctl_table_setset;
>   struct ctl_table_header *sysctls;
>  #endif
> - int max_user_namespaces;
> - atomic_t user_namespaces;
> + struct ucount ucount[UCOUNT_COUNTS];
>  };
>  
>  extern struct user_namespace init_user_ns;
> @@ -79,6 +88,8 @@ extern ssize_t proc_setgroups_write(struct file *, const 
> char __user *, size_t,
>  extern int proc_setgroups_show(struct seq_file *m, void *v);
>  extern bool userns_may_setgroups(const struct user_namespace *ns);
>  extern bool current_in_userns(const struct user_namespace *target_ns);
> +extern bool inc_ucount(struct user_namespace *ns, enum ucounts type);
> +extern void dec_ucount(struct user_namespace *ns, enum ucounts type);
>  #else
>  
>  static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
> @@ -112,6 +123,23 @@ static inline bool current_in_userns(const struct 
> user_namespace *target_ns)
>  {
>   return true;
>  }
> +
> +static inline bool inc_ucount(struct user_namespace *ns, enum ucounts type)
> +{
> + int max = READ_ONCE(init_user_ns.ucount[type].max);
> + int sum = atomic_inc_return(_user_ns.ucount[type].count);
> + if (sum > max) {
> + atomic_dec(_user_ns.ucount[type].count);
> + return false;
> + }
> + return true;
> +}
> +
> +static inline void dec_ucount(struct user_namespace *ns, enum ucounts type)
> +{
> + int dec = atomic_dec_if_positive(_user_ns.ucount[type].count);
> + WARN_ON_ONCE(dec < 0);
> +}
>  #endif
>  
>  #endif /* _LINUX_USER_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 95d5498c463f..30ca5eb4ca1e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -304,6 +304,7 @@ int arch_task_struct_size __read_mostly;
>  
>  void __init fork_init(void)
>  {
> + int i;
>  #ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
>  #ifndef ARCH_MIN_TASKALIGN
>  #define ARCH_MIN_TASKALIGN   L1_CACHE_BYTES
> @@ -324,7 +325,9 @@ void __init fork_init(void)
>   init_task.signal->rlim[RLIMIT_SIGPENDING] =
>   init_task.signal->rlim[RLIMIT_NPROC];
>  
> - init_user_ns.max_user_namespaces = max_threads;
> + for (i = 0; i < UCOUNT_COUNTS; i++) {
> + init_user_ns.ucount[i].max = max_threads;
> + }
>  }
>  
>  int __weak arch_dup_task_struct(struct task_struct *dst,
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 0061550e3282..728d7e4995ff 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -66,16 +66,17 @@ static struct ctl_table_root set_root = {
>  
>  static int zero = 0;
>  static int count_max = COUNT_MAX;
> +#define UCOUNT_ENTRY(name)   \
> + {   \
> + .procname   = name, \
> + .maxlen = sizeof(int),  \
> + .mode   = 0644, \
> + .proc_handler   = proc_dointvec_minmax, \
> + .extra1 = ,\
> + .extra2 = _max,   \
> + }
>  static struct ctl_table userns_table[] = {
> - {
> - .procname   = "max_user_namespaces",
> - .data   = _user_ns.max_user_namespaces,
> - .maxlen = sizeof(init_user_ns.max_user_namespaces),
> - .mode   = 0644,
> - .proc_handler   = proc_dointvec_minmax,
> - .extra1 = ,
> - .extra2 = _max,
> - },
> + UCOUNT_ENTRY("max_user_namespaces"),
>   { }
>  };
>  #endif /* CONFIG_SYSCTL */
> @@ -87,8 +88,10 @@ static bool setup_userns_sysctls(struct user_namespace *ns)
>   

Re: [PATCH v2 03/10] userns: Add a limit on the number of user namespaces

2016-07-25 Thread Serge E. Hallyn
Quoting Eric W. Biederman (ebied...@xmission.com):
> Export the export the maximum number of user namespaces as

^ note if you resend, duplicate "export the"

> /proc/sys/userns/max_user_namespaces.
> 
> Signed-off-by: "Eric W. Biederman" 

Acked-by: Serge Hallyn 

> ---
>  include/linux/user_namespace.h |  2 ++
>  kernel/fork.c  |  2 ++
>  kernel/user_namespace.c| 69 
> +-
>  3 files changed, 65 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 7d59af1f08f1..ba6a995178f9 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -43,6 +43,8 @@ struct user_namespace {
>   struct ctl_table_setset;
>   struct ctl_table_header *sysctls;
>  #endif
> + int max_user_namespaces;
> + atomic_t user_namespaces;
>  };
>  
>  extern struct user_namespace init_user_ns;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5c2c355aa97f..95d5498c463f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -323,6 +323,8 @@ void __init fork_init(void)
>   init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
>   init_task.signal->rlim[RLIMIT_SIGPENDING] =
>   init_task.signal->rlim[RLIMIT_NPROC];
> +
> + init_user_ns.max_user_namespaces = max_threads;
>  }
>  
>  int __weak arch_dup_task_struct(struct task_struct *dst,
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 10afbb55dfc2..0061550e3282 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -29,6 +29,7 @@ static DEFINE_MUTEX(userns_state_mutex);
>  static bool new_idmap_permitted(const struct file *file,
>   struct user_namespace *ns, int cap_setid,
>   struct uid_gid_map *map);
> +#define COUNT_MAX (INT_MAX - 1)
>  
>  #ifdef CONFIG_SYSCTL
>  static struct ctl_table_set *
> @@ -63,7 +64,18 @@ static struct ctl_table_root set_root = {
>   .permissions = set_permissions,
>  };
>  
> +static int zero = 0;
> +static int count_max = COUNT_MAX;
>  static struct ctl_table userns_table[] = {
> + {
> + .procname   = "max_user_namespaces",
> + .data   = _user_ns.max_user_namespaces,
> + .maxlen = sizeof(init_user_ns.max_user_namespaces),
> + .mode   = 0644,
> + .proc_handler   = proc_dointvec_minmax,
> + .extra1 = ,
> + .extra2 = _max,
> + },
>   { }
>  };
>  #endif /* CONFIG_SYSCTL */
> @@ -75,6 +87,8 @@ static bool setup_userns_sysctls(struct user_namespace *ns)
>   setup_sysctl_set(>set, _root, set_is_seen);
>   tbl = kmemdup(userns_table, sizeof(userns_table), GFP_KERNEL);
>   if (tbl) {
> + tbl[0].data = >max_user_namespaces;
> +
>   ns->sysctls = __register_sysctl_table(>set, "userns", tbl);
>   }
>   if (!ns->sysctls) {
> @@ -113,6 +127,34 @@ static void set_cred_user_ns(struct cred *cred, struct 
> user_namespace *user_ns)
>   cred->user_ns = user_ns;
>  }
>  
> +static bool inc_user_namespaces(struct user_namespace *ns)
> +{
> + struct user_namespace *pos, *bad;
> + for (pos = ns; pos; pos = pos->parent) {
> + int max = READ_ONCE(pos->max_user_namespaces);
> + int sum = atomic_inc_return(>user_namespaces);
> + if (sum > max)
> + goto fail;
> + }
> + return true;
> +fail:
> + bad = pos;
> + atomic_dec(>user_namespaces);
> + for (pos = ns; pos != bad; pos = pos->parent)
> + atomic_dec(>user_namespaces);
> +
> + return false;
> +}
> +
> +static void dec_user_namespaces(struct user_namespace *ns)
> +{
> + struct user_namespace *pos;
> + for (pos = ns; pos; pos = pos->parent) {
> + int dec = atomic_dec_if_positive(>user_namespaces);
> + WARN_ON_ONCE(dec < 0);
> + }
> +}
> +
>  /*
>   * Create a new user namespace, deriving the creator from the user in the
>   * passed credentials, and replacing that user with the new root user for the
> @@ -128,8 +170,12 @@ int create_user_ns(struct cred *new)
>   kgid_t group = new->egid;
>   int ret;
>  
> + ret = -EUSERS;
>   if (parent_ns->level > 32)
> - return -EUSERS;
> + goto fail;
> +
> + if (!inc_user_namespaces(parent_ns))
> + goto fail;
>  
>   /*
>* Verify that we can not violate the policy of which files
> @@ -137,26 +183,27 @@ int create_user_ns(struct cred *new)
>* by verifing that the root directory is at the root of the
>* mount namespace which allows all files to be accessed.
>*/
> + ret = -EPERM;
>   if (current_chrooted())
> - return -EPERM;
> + goto fail_dec;
>  
>   /* The creator needs a mapping in the parent user 

Re: [PATCH 3/8] cgroup: implement cgroup_get_from_path() and expose cgroup_put()

2015-12-21 Thread Serge E. Hallyn
On Mon, Dec 07, 2015 at 05:38:50PM -0500, Tejun Heo wrote:
> Implement cgroup_get_from_path() using kernfs_walk_and_get() which
> obtains a default hierarchy cgroup from its path.  This will be used
> to allow cgroup path based matching from outside cgroup proper -
> e.g. networking and perf.

Hi Tejun,

I'm trying to figure out how to handle this in the cgroup ns patchset.
Is this going to be purely used internally?  From the user i see in
this patchset it looks like I should leave it be (and have @path always
be absolute)  Is that right?

thanks,
-serge

> v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path).
> 
> Signed-off-by: Tejun Heo 
> ---
>  include/linux/cgroup.h |  7 +++
>  kernel/cgroup.c| 39 ++-
>  2 files changed, 41 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index b5ee2c4..4c3ffab 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -81,6 +81,8 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup 
> *cgroup,
>  struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
>  struct cgroup_subsys 
> *ss);
>  
> +struct cgroup *cgroup_get_from_path(const char *path);
> +
>  int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
>  int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
>  
> @@ -351,6 +353,11 @@ static inline void css_put_many(struct 
> cgroup_subsys_state *css, unsigned int n)
>   percpu_ref_put_many(>refcnt, n);
>  }
>  
> +static inline void cgroup_put(struct cgroup *cgrp)
> +{
> + css_put(>self);
> +}
> +
>  /**
>   * task_css_set_check - obtain a task's css_set with extra access conditions
>   * @task: the task to obtain css_set for
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 3190040..3db5e8f 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -434,11 +434,6 @@ static bool cgroup_tryget(struct cgroup *cgrp)
>   return css_tryget(>self);
>  }
>  
> -static void cgroup_put(struct cgroup *cgrp)
> -{
> - css_put(>self);
> -}
> -
>  struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
>  {
>   struct cgroup *cgrp = of->kn->parent->priv;
> @@ -5753,6 +5748,40 @@ struct cgroup_subsys_state *css_from_id(int id, struct 
> cgroup_subsys *ss)
>   return id > 0 ? idr_find(>css_idr, id) : NULL;
>  }
>  
> +/**
> + * cgroup_get_from_path - lookup and get a cgroup from its default hierarchy 
> path
> + * @path: path on the default hierarchy
> + *
> + * Find the cgroup at @path on the default hierarchy, increment its
> + * reference count and return it.  Returns pointer to the found cgroup on
> + * success, ERR_PTR(-ENOENT) if @path doens't exist and ERR_PTR(-ENOTDIR)
> + * if @path points to a non-directory.
> + */
> +struct cgroup *cgroup_get_from_path(const char *path)
> +{
> + struct kernfs_node *kn;
> + struct cgroup *cgrp;
> +
> + mutex_lock(_mutex);
> +
> + kn = kernfs_walk_and_get(cgrp_dfl_root.cgrp.kn, path);
> + if (kn) {
> + if (kernfs_type(kn) == KERNFS_DIR) {
> + cgrp = kn->priv;
> + cgroup_get(cgrp);
> + } else {
> + cgrp = ERR_PTR(-ENOTDIR);
> + }
> + kernfs_put(kn);
> + } else {
> + cgrp = ERR_PTR(-ENOENT);
> + }
> +
> + mutex_unlock(_mutex);
> + return cgrp;
> +}
> +EXPORT_SYMBOL_GPL(cgroup_get_from_path);
> +
>  #ifdef CONFIG_CGROUP_DEBUG
>  static struct cgroup_subsys_state *
>  debug_css_alloc(struct cgroup_subsys_state *parent_css)
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] net: Implement the per network namespace sysctl infrastructure

2007-11-30 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Serge E. Hallyn [EMAIL PROTECTED] writes:
 
 
  Hey Eric,
 
  the patches look nice.
 
  The hand-forcing of the passed-in net_ns into a copy of current-nsproxy
  does make it seem like nsproxy may not be the best choice of what to
  pass in.  Doesn't only net_sysctl_root-lookup() look at the argument?
 
 Yes.  Although I call it from __register_sysctl_paths.
 
  But I assume you don't want to be more general than sending in a
  nsproxy so as to dissuade abuse of this interface for needlessly complex
  sysctl interfaces?
 
 A bit of that.  I would love to pass in a task_struct so you can use
 anything from a task.  The trouble is I don't have any task_structs or
 nsproxys with the proper value at the point where I am first setting
 this up.  Further I have to have the full sysctl lookup working or I
 could not call sysctl_check.
 
  (Well I expect that'll become clear once the the patches using this
  come out.)
 
  Are you planning to use this infrastructure for the uts and ipc
  sysctls as well?
 
 Yes.  Where it comes in especially useful, is I can move /proc/sys
 to /proc/sys/tgid/task/pid/sys.  And get a particular processes
 view of sysctl.  
 
 We also get a little more reuse of common functions.
 
 Otherwise Pavel does have a point that using this for uts and ipc
 is not a savings lines of code wise.
 
 After having seen Pavel changes I am asking myself if there is a sane
 way to remove the ctl_name argument from the ctl_path.
 
 Anyway where I am with the nsproxy question was that I don't
 see anything easily better.  What I have works and gets the job
 done, and doesn't have any module unload races or holes where a sloppy
 programmer can mess up the sysctl tree.  We needed a solution.
 Trying any harder to find something better would take ages.  So
 I figured this implementation was good enough.

I agree.  So it's already in -mm but still

Acked-by: Serge Hallyn [EMAIL PROTECTED]

thanks,
-serge
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] net: Implement the per network namespace sysctl infrastructure

2007-11-30 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 
 The user interface is: register_net_sysctl_table and
 unregister_net_sysctl_table.  Very much like the current
 interface except there is a network namespace parameter.
 
 With this any sysctl registered with register_net_sysctl_table
 will only show up to tasks in the same network namespace.
 
 All other sysctls continue to be globally visible.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
 ---
  include/net/net_namespace.h |9 +++
  net/sysctl_net.c|   57 
 +++
  2 files changed, 66 insertions(+), 0 deletions(-)
 
 diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
 index 4d0d634..235214c 100644
 --- a/include/net/net_namespace.h
 +++ b/include/net/net_namespace.h
 @@ -25,6 +25,8 @@ struct net {
   struct proc_dir_entry   *proc_net_stat;
   struct proc_dir_entry   *proc_net_root;
 
 + struct list_headsysctl_table_headers;
 +
   struct net_device   *loopback_dev;  /* The loopback */
 
   struct list_headdev_base_head;
 @@ -144,4 +146,11 @@ extern void unregister_pernet_subsys(struct 
 pernet_operations *);
  extern int register_pernet_device(struct pernet_operations *);
  extern void unregister_pernet_device(struct pernet_operations *);
 
 +struct ctl_path;
 +struct ctl_table;
 +struct ctl_table_header;
 +extern struct ctl_table_header *register_net_sysctl_table(struct net *net,
 + const struct ctl_path *path, struct ctl_table *table);
 +extern void unregister_net_sysctl_table(struct ctl_table_header *header);
 +
  #endif /* __NET_NET_NAMESPACE_H */
 diff --git a/net/sysctl_net.c b/net/sysctl_net.c
 index cd4eafb..c50c793 100644
 --- a/net/sysctl_net.c
 +++ b/net/sysctl_net.c
 @@ -14,6 +14,7 @@
 
  #include linux/mm.h
  #include linux/sysctl.h
 +#include linux/nsproxy.h
 
  #include net/sock.h
 
 @@ -54,3 +55,59 @@ struct ctl_table net_table[] = {
  #endif
   { 0 },
  };
 +
 +static struct list_head *
 +net_ctl_header_lookup(struct ctl_table_root *root, struct nsproxy 
 *namespaces)
 +{
 + return namespaces-net_ns-sysctl_table_headers;
 +}
 +
 +static struct ctl_table_root net_sysctl_root = {
 + .lookup = net_ctl_header_lookup,
 +};
 +
 +static int sysctl_net_init(struct net *net)
 +{
 + INIT_LIST_HEAD(net-sysctl_table_headers);
 + return 0;
 +}
 +
 +static void sysctl_net_exit(struct net *net)
 +{
 + WARN_ON(!list_empty(net-sysctl_table_headers));
 + return;
 +}
 +
 +static struct pernet_operations sysctl_pernet_ops = {
 + .init = sysctl_net_init,
 + .exit = sysctl_net_exit,
 +};
 +
 +static __init int sysctl_init(void)
 +{
 + int ret;
 + ret = register_pernet_subsys(sysctl_pernet_ops);
 + if (ret)
 + goto out;
 + register_sysctl_root(net_sysctl_root);
 +out:
 + return ret;
 +}
 +subsys_initcall(sysctl_init);
 +
 +struct ctl_table_header *register_net_sysctl_table(struct net *net,
 + const struct ctl_path *path, struct ctl_table *table)
 +{
 + struct nsproxy namespaces;
 + namespaces = *current-nsproxy;
 + namespaces.net_ns = net;
 + return __register_sysctl_paths(net_sysctl_root,
 + namespaces, path, table);

Hey Eric,

the patches look nice.

The hand-forcing of the passed-in net_ns into a copy of current-nsproxy
does make it seem like nsproxy may not be the best choice of what to
pass in.  Doesn't only net_sysctl_root-lookup() look at the argument?

But I assume you don't want to be more general than sending in a
nsproxy so as to dissuade abuse of this interface for needlessly complex
sysctl interfaces?

(Well I expect that'll become clear once the the patches using this
come out.)

Are you planning to use this infrastructure for the uts and ipc
sysctls as well?

thanks,
-serge

 +}
 +EXPORT_SYMBOL_GPL(register_net_sysctl_table);
 +
 +void unregister_net_sysctl_table(struct ctl_table_header *header)
 +{
 + return unregister_sysctl_table(header);
 +}
 +EXPORT_SYMBOL_GPL(unregister_net_sysctl_table);
 -- 
 1.5.3.rc6.17.g1911
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Update get_net_ns_by_pid

2007-09-28 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 
 In the -mm tree the rules for access an nsproxy have changed,
 and in get_net_ns_by_pid we access the nsproxy, so update
 it to follow the new rules.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]

Yup, looks right.

I assume Pavel's Acked-by would actually matter, but still

Acked-by: Serge Hallyn [EMAIL PROTECTED]

thanks,
-serge

 ---
 
 diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
 index 739fbad..1caba10 100644
 --- a/net/core/rtnetlink.c
 +++ b/net/core/rtnetlink.c
 @@ -746,10 +746,10 @@ static struct net *get_net_ns_by_pid(pid_t pid)
   rcu_read_lock();
   tsk = find_task_by_pid(pid);
   if (tsk) {
 - task_lock(tsk);
 - if (tsk-nsproxy)
 - net = get_net(tsk-nsproxy-net_ns);
 - task_unlock(tsk);
 + struct nsproxy *nsproxy;
 + nsproxy = task_nsproxy(tsk);
 + if (nsproxy)
 + net = get_net(nsproxy-net_ns);
   }
   rcu_read_unlock();
   return net;
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] net: netlink support for moving devices between network namespaces.

2007-09-10 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 
 The simplest thing to implement is moving network devices between
 namespaces.  However with the same attribute IFLA_NET_NS_PID we can
 easily implement creating devices in the destination network
 namespace as well.  However that is a little bit trickier so this
 patch sticks to what is simple and easy.
 
 A pid is used to identify a process that happens to be a member
 of the network namespace we want to move the network device to.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
 ---
  include/linux/if_link.h |1 +
  net/core/rtnetlink.c|   35 +++
  2 files changed, 36 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/if_link.h b/include/linux/if_link.h
 index 422084d..84c3492 100644
 --- a/include/linux/if_link.h
 +++ b/include/linux/if_link.h
 @@ -78,6 +78,7 @@ enum
   IFLA_LINKMODE,
   IFLA_LINKINFO,
  #define IFLA_LINKINFO IFLA_LINKINFO
 + IFLA_NET_NS_PID,
   __IFLA_MAX
  };
 
 diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
 index 44f91bb..1b9c32d 100644
 --- a/net/core/rtnetlink.c
 +++ b/net/core/rtnetlink.c
 @@ -35,6 +35,7 @@
  #include linux/security.h
  #include linux/mutex.h
  #include linux/if_addr.h
 +#include linux/nsproxy.h
 
  #include asm/uaccess.h
  #include asm/system.h
 @@ -727,6 +728,7 @@ const struct nla_policy ifla_policy[IFLA_MAX+1] = {
   [IFLA_WEIGHT]   = { .type = NLA_U32 },
   [IFLA_OPERSTATE]= { .type = NLA_U8 },
   [IFLA_LINKMODE] = { .type = NLA_U8 },
 + [IFLA_NET_NS_PID]   = { .type = NLA_U32 },
  };
 
  static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
 @@ -734,12 +736,45 @@ static const struct nla_policy 
 ifla_info_policy[IFLA_INFO_MAX+1] = {
   [IFLA_INFO_DATA]= { .type = NLA_NESTED },
  };
 
 +static struct net *get_net_ns_by_pid(pid_t pid)
 +{
 + struct task_struct *tsk;
 + struct net *net;
 +
 + /* Lookup the network namespace */
 + net = ERR_PTR(-ESRCH);
 + rcu_read_lock();
 + tsk = find_task_by_pid(pid);
 + if (tsk) {
 + task_lock(tsk);
 + if (tsk-nsproxy)
 + net = get_net(tsk-nsproxy-net_ns);
 + task_unlock(tsk);

Thinking...  Ok, I'm not sure this is 100% safe in the target tree, but
the long-term correct way probably isn't yet implemented in the net-
tree.  Eventually you will want to:

net_ns = NULL;
rcu_read_lock();
tsk = find_task_by_pid();  /* or _pidns equiv? */
nsproxy = task_nsproxy(tsk);
if (nsproxy)
net_ns = get_net(nsproxy-net_ns);
rcu_read_unlock;

What you have here is probably unsafe if tsk is the last task pointing
to it's nsproxy and it does an unshare, bc unshare isn't protected by
task_lock, and you're not rcu_dereferencing tsk-nsproxy (which
task_nsproxy does).  At one point we floated a patch to reuse the same
nsproxy in that case which would prevent you having to worry about it,
but that isn't being done in -mm now so i doubt it's in -net.

 + }
 + rcu_read_unlock();
 + return net;
 +}
 +
  static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
 struct nlattr **tb, char *ifname, int modified)
  {
   int send_addr_notify = 0;
   int err;
 
 + if (tb[IFLA_NET_NS_PID]) {
 + struct net *net;
 + net = get_net_ns_by_pid(nla_get_u32(tb[IFLA_NET_NS_PID]));
 + if (IS_ERR(net)) {
 + err = PTR_ERR(net);
 + goto errout;
 + }
 + err = dev_change_net_namespace(dev, net, ifname);
 + put_net(net);
 + if (err)
 + goto errout;
 + modified = 1;
 + }
 +
   if (tb[IFLA_MAP]) {
   struct rtnl_link_ifmap *u_map;
   struct ifmap k_map;
 -- 
 1.5.3.rc6.17.g1911
 
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/16] net: netlink support for moving devices between network namespaces.

2007-09-10 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Serge E. Hallyn [EMAIL PROTECTED] writes:
  
  +static struct net *get_net_ns_by_pid(pid_t pid)
  +{
  +  struct task_struct *tsk;
  +  struct net *net;
  +
  +  /* Lookup the network namespace */
  +  net = ERR_PTR(-ESRCH);
  +  rcu_read_lock();
  +  tsk = find_task_by_pid(pid);
  +  if (tsk) {
  +  task_lock(tsk);
  +  if (tsk-nsproxy)
  +  net = get_net(tsk-nsproxy-net_ns);
  +  task_unlock(tsk);
 
  Thinking...  Ok, I'm not sure this is 100% safe in the target tree, but
  the long-term correct way probably isn't yet implemented in the net-
  tree.  Eventually you will want to:
 
  net_ns = NULL;
  rcu_read_lock();
  tsk = find_task_by_pid();  /* or _pidns equiv? */
  nsproxy = task_nsproxy(tsk);
  if (nsproxy)
  net_ns = get_net(nsproxy-net_ns);
  rcu_read_unlock;
 
  What you have here is probably unsafe if tsk is the last task pointing
  to it's nsproxy and it does an unshare, bc unshare isn't protected by
  task_lock, and you're not rcu_dereferencing tsk-nsproxy (which
  task_nsproxy does).  At one point we floated a patch to reuse the same
  nsproxy in that case which would prevent you having to worry about it,
  but that isn't being done in -mm now so i doubt it's in -net.
 
 
 That change isn't merged upstream yet, so it isn't in David's
 net-2.6.24 tree.  Currently task-nsproxy is protected but
 task_lock(current). So the code is fine.
 
 I am aware that removing the task_lock(current) for the setting
 of current-nsproxy is currently in the works, and I have planned
 to revisit this later when all of these pieces come together.
 
 For now the code is fine.
 
 If need be we can drop this patch to remove the potential merge
 conflict.

No, no.  Like you say it's correct at the moment.  Just something we
need to watch out for when it does get merged with the newer changes.

 But I figured it was useful

Absolutely.

 for this part of the user space
 interface to be available for review.

Agreed.  And the rest of the patchset looks good to me.

Thanks.

-serge
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-25 Thread Serge E. Hallyn
Quoting David Miller ([EMAIL PROTECTED]):
 From: [EMAIL PROTECTED] (Eric W. Biederman)
 Date: Sat, 23 Jun 2007 11:19:34 -0600
 
  Further and fundamentally all a global achieves is removing the need
  for the noise patches where you pass the pointer into the various
  functions.  For long term maintenance it doesn't help anything.
 
 I don't accept that we have to add another function argument
 to a bunch of core routines just to support this crap,
 especially since you give no way to turn it off and get
 that function argument slot back.
 
 To be honest I think this form of virtualization is a complete
 waste of time, even the openvz approach.
 
 We're protecting the kernel from itself, and that's an endless
 uphill battle that you will never win.  Let's do this kind of

Hi David,

just to be clear this isn't so much about security.  Security can be
provided using selinux, just as with the userid namespace.  But like
with the userid namespace, this provides usability for the virtual
servers, plus some support for restarting checkpointed applications.

That doesn't attempt to justify the extra argument - if you don't
like it, you don't like it  :)  Just wanted to clarify.

thanks,
-serge

 stuff properly with a real minimal hypervisor, hopefully with
 appropriate hardware level support and good virtualized device
 interfaces, instead of this namespace stuff.
 
 At least the hypervisor approach you have some chance to fully
 harden in some verifyable and truly protected way, with
 namespaces it's just a pipe dream and everyone who works on
 these namespace approaches knows that very well.
 
 The only positive thing that came out of this work is the
 great auditing that the openvz folks have done and the bugs
 they have found, but it basically ends right there.
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] L2 Network namespace infrastructure

2007-06-25 Thread Serge E. Hallyn
Quoting Jeff Garzik ([EMAIL PROTECTED]):
 Eric W. Biederman wrote:
 Jeff Garzik [EMAIL PROTECTED] writes:
 
 David Miller wrote:
 I don't accept that we have to add another function argument
 to a bunch of core routines just to support this crap,
 especially since you give no way to turn it off and get
 that function argument slot back.
 
 To be honest I think this form of virtualization is a complete
 waste of time, even the openvz approach.
 
 We're protecting the kernel from itself, and that's an endless
 uphill battle that you will never win.  Let's do this kind of
 stuff properly with a real minimal hypervisor, hopefully with
 appropriate hardware level support and good virtualized device
 interfaces, instead of this namespace stuff.
 Strongly seconded.  This containerized virtualization approach just 
 bloats up
 the kernel for something that is inherently fragile and IMO less secure --
 protecting the kernel from itself.
 
 Plenty of other virt approaches don't stir the code like this, while
 simultaneously providing fewer, more-clean entry points for the 
 virtualization
 to occur.
 
 Wrong.  I really don't want to get into a my virtualization approach is 
 better
 then yours.  But this is flat out wrong.
 
 99% of the changes I'm talking about introducing are just:
 - variable 
 + ptr-variable
 
 There are more pieces mostly with when we initialize those variables but
 that is the essence of the change.
 
 You completely dodged the main objection.  Which is OK if you are 
 selling something to marketing departments, but not OK
 
 Containers introduce chroot-jail-like features that give one a false 
 sense of security, while still requiring one to poke holes in the 
 illusion to get hardware-specific tasks accomplished.
 
 The capable/not-capable model (i.e. superuser / normal user) is _still_ 
 being secured locally, even after decades of work and whitepapers and 
 audits.
 
 You are drinking Deep Kool-Aid if you think adding containers to the 
 myriad kernel subsystems does anything besides increasing fragility, and 
 decreasing security.  You are securing in-kernel subsystems against 
 other in-kernel subsystems.

No we're not.  As the name 'network namespaces' implies, we are
introducing namespaces for network-related variables.  That's it.

We are not trying to protect in-kernel subsystems from each other.  In
fact we're not even trying to protect userspace process from each other.
Though that will in part come free when user processes can't access each
other's data because they are in different namespaces.  But using an
LSM like selinux or a custom one to tag and enforce isolation would
still be encouraged.

 superuser/user model made that difficult 
 enough... now containers add exponential audit complexity to that.  Who 
 is to say that a local root does not also pierce the container model?

At the moment it does.

 And as opposed to other virtualization approaches so far no one has been
 able to measure the overhead.  I suspect there will be a few more cache
 line misses somewhere but they haven't shown up yet.
 
 If the only use was strong isolation which Dave complains about I would
 concur that the namespace approach is inappropriate.  However there are
 a lot other uses.
 
 Sure there are uses.  There are uses to putting the X server into the 
 kernel, too.  At some point complexity and featuritis has to take a back 
 seat to basic sanity.

Generally true, yes.

-serge
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-08-16 Thread Serge E. Hallyn
Quoting Andrey Savochkin ([EMAIL PROTECTED]):
 Hi All,
 
 I'd like to resurrect our discussion about network namespaces.
 In our previous discussions it appeared that we have rather polar concepts
 which seemed hard to reconcile.
 Now I have an idea how to look at all discussed concepts to enable everyone's
 usage scenario.
 
 1. The most straightforward concept is complete separation of namespaces,
covering device list, routing tables, netfilter tables, socket hashes, and
everything else.
 
On input path, each packet is tagged with namespace right from the
place where it appears from a device, and is processed by each layer
in the context of this namespace.
Non-root namespaces communicate with the outside world in two ways: by
owning hardware devices, or receiving packets forwarded them by their 
 parent
namespace via pass-through device.
 
This complete separation of namespaces is very useful for at least two
purposes:
 - allowing users to create and manage by their own various tunnels and
   VPNs, and
 - enabling easier and more straightforward live migration of groups of
   processes with their environment.

I conceptually prefer this approach, but I seem to recall there were
actual problems in using this for checkpoint/restart of lightweight
(application) containers.  Performance aside, are there any reasons why
this approach would be problematic for c/r?

I'm afraid Daniel may be on vacation, and don't know who else other than
Eric might have thoughts on this.

 2. People expressed concerns that complete separation of namespaces
may introduce an undesired overhead in certain usage scenarios.
The overhead comes from packets traversing input path, then output path,
then input path again in the destination namespace if root namespace
acts as a router.
 
So, we may introduce short-cuts, when input packet starts to be processes
in one namespace, but changes it at some upper layer.
The places where packet can change namespace are, for example:
routing, post-routing netfilter hook, or even lookup in socket hash.
 
The cleanest example among them is post-routing netfilter hook.
Tagging of input packets there means that the packets is checked against
root namespace's routing table, found to be local, and go directly to
the socket hash lookup in the destination namespace.
In this scheme the ability to change routing tables or netfilter rules on
a per-namespace basis is traded for lower overhead.
 
All other optimized schemes where input packets do not travel
input-output-input paths in general case may be viewed as short-cuts in
scheme (1).  The remaining question is which exactly short-cuts make most
sense, and how to make them consistent from the interface point of view.
 
 My current idea is to reach some agreement on the basic concept, review
 patches, and then move on to implementing feasible short-cuts.
 
 Opinions?
 
 Next in this thread are patches introducing namespaces to device list,
 IPv4 routing, and socket hashes, and a pass-through device.
 Patches are against 2.6.18-rc4-mm1.

Just to provide the extreme other end of implementation options, here is
the bsdjail based version I've been using for some testing while waiting
for network namespaces to show up in -mm  :)

(Not intended for *any* sort of inclusion consideration :)

Example usage:
ifconfig eth0:0 192.168.1.16
echo -n ip 192.168.1.16  /proc/$$/attr/exec
exec /bin/sh

-serge

From: Serge E. Hallyn [EMAIL PROTECTED](none)
Date: Wed, 26 Jul 2006 21:47:13 -0500
Subject: [PATCH 1/1] bsdjail: define bsdjail lsm

Define the actual bsdjail LSM.

Signed-off-by: Serge E. Hallyn [EMAIL PROTECTED]
---
 security/Kconfig   |   11 
 security/Makefile  |1 
 security/bsdjail.c | 1351 
 3 files changed, 1363 insertions(+), 0 deletions(-)

diff --git a/security/Kconfig b/security/Kconfig
index 67785df..fa30e40 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -105,6 +105,17 @@ config SECURITY_SECLVL
 
  If you are unsure how to answer this question, answer N.
 
+config SECURITY_BSDJAIL
+   tristate BSD Jail LSM
+   depends on SECURITY
+   select SECURITY_NETWORK
+   help
+ Provides BSD Jail compartmentalization functionality.
+ See Documentation/bsdjail.txt for more information and
+ usage instructions.
+
+ If you are unsure how to answer this question, answer N.
+
 source security/selinux/Kconfig
 
 endmenu
diff --git a/security/Makefile b/security/Makefile
index 8cbbf2f..050b588 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -17,3 +17,4 @@ obj-$(CONFIG_SECURITY_SELINUX)+= selin
 obj-$(CONFIG_SECURITY_CAPABILITIES)+= commoncap.o capability.o
 obj-$(CONFIG_SECURITY_ROOTPLUG)+= commoncap.o root_plug.o
 obj-$(CONFIG_SECURITY_SECLVL

Re: strict isolation of net interfaces

2006-06-30 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 This whole debate on network devices show up in multiple network namespaces
 is just silly.  The only reason for wanting that appears to be better 
 management.

A damned good reason.  Clearly we want the parent namespace to be able
to control what the child can do.  So whatever interface a child gets,
the parent should be able to somehow address.  Simple iptables rules
controlling traffic between it's own netdevice and the one it hands it's
children seem a good option.

 We have deeper issues like can we do a reasonable implementation without a
 network device showing up in multiple namespaces.

Isn't that the same issue?

 If we can get layer 2 level isolation working without measurable overhead
 with one namespace per device it may be worth revisiting things.  Until
 then it is a side issue at best.

Ok, and in the meantime we can all use the network part of the bsdjail
lsm?  :)

-serge
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: strict isolation of net interfaces

2006-06-29 Thread Serge E. Hallyn
Quoting Cedric Le Goater ([EMAIL PROTECTED]):
 Sam Vilain wrote:
  jamal wrote:
  note: personally I'm absolutely not against virtualizing
  the device names so that each guest can have a separate
  name space for devices, but there should be a way to
  'see' _and_ 'identify' the interfaces from outside
  (i.e. host or spectator context)
 
  
  Makes sense for the host side to have naming convention tied
  to the guest. Example as a prefix: guest0-eth0. Would it not
  be interesting to have the host also manage these interfaces
  via standard tools like ip or ifconfig etc? i.e if i admin up
  guest0-eth0, then the user in guest0 will see its eth0 going
  up.
  
  That particular convention only works if you have network namespaces and
  UTS namespaces tightly bound.  We plan to have them separate - so for
  that to work, each network namespace could have an arbitrary prefix
  that determines what the interface name will look like from the outside
  when combined.  We'd have to be careful about length limits.
  
  And guest0-eth0 doesn't necessarily make sense; it's not really an
  ethernet interface, more like a tun or something.
  
  So, an equally good convention might be to use sequential prefixes on
  the host, like tun, dummy, or a new prefix - then a property of that
  is what the name of the interface is perceived to be to those who are in
  the corresponding network namespace.
  
  Then the pragmatic question becomes how to correlate what you see from
  `ip addr list' to guests.
 
 
 we could work on virtualizing the net interfaces in the host, map them to
 eth0 or something in the guest and let the guest handle upper network layers ?
 
 lo0 would just be exposed relying on skbuff tagging to discriminate traffic
 between guests.

This seems to me the preferable way.  We create a full virtual net
device for each new container, and fully virtualize the device
namespace.

 host  |  guest 0  |  guest 1  |  guest2
 --+---+---+--
   |   |   |   |
   |- l0  ---+- lo0 ... | lo0   | lo0
   |   |   |   |
   |- bar0   +- eth0|   |
   |   |   |   |
   |- foo0   +---+---+- eth0
   |   |   |   |
   `- foo0:1  ---+---+- eth0|
   |   |   |
 
 
 is that clear ? stupid ? reinventing the wheel ?

The last one in your diagram confuses me - why foo0:1?  I would
have thought it'd be

host  |  guest 0  |  guest 1  |  guest2
--+---+---+--
  |   |   |   |
  |- l0  ---+- lo0 ... | lo0   | lo0
  |   |   |   |
  |- eth0|   |   |
  |   |   |   |
  |- veth0  +- eth0|   |
  |   |   |   |
  |- veth1  +---+---+- eth0
  |   |   |   |
  |- veth2   ---+---+- eth0|

I think we should avoid using device aliases, as trying to do
something like giving eth0:1 to guest1 and eth0:2 to guest2
while hiding eth0:1 from guest2 requires some uglier code (as
I recall) than working with full devices.  In other words,
if a namespace can see eth0, and eth0:2 exists, it should always
see eth0:2.

So conceptually using a full virtual net device per container
certainly seems cleaner to me, and it seems like it should be
simpler by way of statistics gathering etc, but are there actually
any real gains?  Or is the support for multiple IPs per device
actually enough?

Herbert, is this basically how ngnet is supposed to work?

-serge
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network namespaces a path to mergable code.

2006-06-28 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
  I think we're reaching the limits of namespaces. It would be much easier
  with a container id in each kernel object we want to isolate.
 
 Nope.  Except for the fact that names are peculiar (sockets, network
 device names, IP address, routes...) the network stack splits quite cleanly.
 
 I did all of this in a proof of concept mode several months ago and
 the code is still sitting in my git tree on kernel.org.  I even got
 the generic stack reference counting fixed.
 
 Eric

Which branch?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][patch 1/4] Network namespaces: cleanup of dev_base list use

2006-06-27 Thread Serge E. Hallyn
Quoting Kirill Korotaev ([EMAIL PROTECTED]):
 Cleanup of dev_base list use, with the aim to make device list 
 per-namespace.
 In almost every occasion, use of dev_base variable and dev-next pointer
 could be easily replaced by for_each_netdev loop.
 A few most complicated places were converted to using
 first_netdev()/next_netdev().
 
 As a proof of concept patch this is ok.
 As a real world patch this is much too big, which prevents review.
 Plus it takes a few actions that are more than replace just
 iterators through the device list.
 
 Mmm, actually it is a whole changeset and should go as a one patch. I 
 didn't
 find it to be big and my review took only 5-10mins..
 I also don't think that mailing each driver maintainer is a good idea.
 Only if we want to make some buzz :)
 
 
 Thanks for supporting my case.  You reviewed it and missed the obvious 
 typo.
 I do agree that a patchset doing it all should happen at once.
 This doesn't support anything. e.g. I caught quite a lot of bugs after 
 Ingo Molnar, but this doesn't make his code poor. People are people.

Exactly - people are people, and they will do a better review of a
well-constructed set of small patches than of one large patch.  Eyes
glaze over, it just happens.

-serge
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html