Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-08-08 Thread Eric W. Biederman

I won't have any more time for this until I return from vacation at the
end of the month but after a little bit of thought I think I have fixed
all of the bugs (except arguably the return value).

I have further tweaked these and made the limits per user.  Because it
occured to me that if the limits are global it is possible for one
misbehaving user to DOS another which is undesirable in princible.

I have put my updated code at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
for-testing

Eric


Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-26 Thread Kees Cook
On Tue, Jul 26, 2016 at 10:29 AM, Michael Kerrisk (man-pages)
 wrote:
> On 26 July 2016 at 18:52, Kees Cook  wrote:
>> On Tue, Jul 26, 2016 at 8:06 AM, Eric W. Biederman
>>  wrote:
>>> "Michael Kerrisk (man-pages)"  writes:
>>>
 Hello Eric,

 I realized I had a question after the last mail.

 On 07/21/2016 06:39 PM, Eric W. Biederman wrote:
>
> This patchset addresses two use cases:
> - Implement a sane upper bound on the number of namespaces.
> - Provide a way for sandboxes to limit the attack surface from
>   namespaces.

 Can you say more about the second point? What exactly is the
 problem that is being addressed, and how does the patch series
 address it? (It would be good to have those details in the
 revised commit message...)
>>>
>>> At some point it was reported that seccomp was not sufficient to disable
>>> namespace creation.  I need to go back and look at that claim to see
>>> which set of circumstances that was referring to.  Seccomp doesn't stack
>>> so I can see why it is an issue.
>>
>> seccomp does stack. The trouble usually comes from a perception that
>> seccomp overhead is not trivial, so setting a system-wide policy is a
>> bit of a large hammer for such a limitiation. Also, at the time,
>> seccomp could be bypasses with ptrace, but this (as of v4.8) is no
>> longer true.
>
> Sounds like someone needs to send me a patch for the seccomp.2 man page?

It's on my TODO list, no worries. :) I'm waiting for it to land in
Linus's tree first. It's only been in -next so far.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-26 Thread Michael Kerrisk (man-pages)
On 26 July 2016 at 18:52, Kees Cook  wrote:
> On Tue, Jul 26, 2016 at 8:06 AM, Eric W. Biederman
>  wrote:
>> "Michael Kerrisk (man-pages)"  writes:
>>
>>> Hello Eric,
>>>
>>> I realized I had a question after the last mail.
>>>
>>> On 07/21/2016 06:39 PM, Eric W. Biederman wrote:

 This patchset addresses two use cases:
 - Implement a sane upper bound on the number of namespaces.
 - Provide a way for sandboxes to limit the attack surface from
   namespaces.
>>>
>>> Can you say more about the second point? What exactly is the
>>> problem that is being addressed, and how does the patch series
>>> address it? (It would be good to have those details in the
>>> revised commit message...)
>>
>> At some point it was reported that seccomp was not sufficient to disable
>> namespace creation.  I need to go back and look at that claim to see
>> which set of circumstances that was referring to.  Seccomp doesn't stack
>> so I can see why it is an issue.
>
> seccomp does stack. The trouble usually comes from a perception that
> seccomp overhead is not trivial, so setting a system-wide policy is a
> bit of a large hammer for such a limitiation. Also, at the time,
> seccomp could be bypasses with ptrace, but this (as of v4.8) is no
> longer true.

Sounds like someone needs to send me a patch for the seccomp.2 man page?

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-26 Thread Kees Cook
On Tue, Jul 26, 2016 at 8:06 AM, Eric W. Biederman
 wrote:
> "Michael Kerrisk (man-pages)"  writes:
>
>> Hello Eric,
>>
>> I realized I had a question after the last mail.
>>
>> On 07/21/2016 06:39 PM, Eric W. Biederman wrote:
>>>
>>> This patchset addresses two use cases:
>>> - Implement a sane upper bound on the number of namespaces.
>>> - Provide a way for sandboxes to limit the attack surface from
>>>   namespaces.
>>
>> Can you say more about the second point? What exactly is the
>> problem that is being addressed, and how does the patch series
>> address it? (It would be good to have those details in the
>> revised commit message...)
>
> At some point it was reported that seccomp was not sufficient to disable
> namespace creation.  I need to go back and look at that claim to see
> which set of circumstances that was referring to.  Seccomp doesn't stack
> so I can see why it is an issue.

seccomp does stack. The trouble usually comes from a perception that
seccomp overhead is not trivial, so setting a system-wide policy is a
bit of a large hammer for such a limitiation. Also, at the time,
seccomp could be bypasses with ptrace, but this (as of v4.8) is no
longer true.

> The general problem is that namespaces by their nature (and especially
> in combination with the user namespaces) allow unprivileged users to use
> more of the kernel than a user would have access to without them.  This
> in turn allows malicious users more kernel calls they can use in attempt
> to find an exploitable bug.
>
> So if you are building a sandbox/chroot jail/chromium tab or anything
> like that and you know you won't be needing a kernel feature having an
> easy way to disable the feature is useful for making the kernel
> marginally more secure, as certain attack vectors are no longer
> possible.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-26 Thread Eric W. Biederman
"Michael Kerrisk (man-pages)"  writes:

> Hello Eric,
>
> On 07/21/2016 06:39 PM, Eric W. Biederman wrote:
>>
>> This patchset addresses two use cases:
>> - Implement a sane upper bound on the number of namespaces.
>> - Provide a way for sandboxes to limit the attack surface from
>>   namespaces.
>>
>> The maximum sane case I can imagine is if every process is a fat
>> process, so I set the maximum number of namespaces to the maximum
>> number of threads.
>>
>> I make these limits recursive and per user namespace so that a
>> usernamespace root can reduce the limits further.  If a user namespace
>> root raises the limit the limit in the parent namespace will be honored.
>>
>> I have cut this implementation to the bare minimum needed to achieve
>> these objectives.
>>
>> Does anyone know if there is a proper error code to return for resource
>> limit exceeded?  I am currently using -EUSERS or -ENFILE but both of
>> those feel a little wrong.
>
> ENFILE certainly seems weird. I suppose my first question is: why two
> different errors?

EUSERS was there in the user namespace case for this condition (well
nesting depth but same principle) so it made sense to start with.  But
too many users doesn't really make sense.

ENFILE is actually slightly less insane.  It actually is about hitting a
resource limit, and seemed the most generic catchall we had.

> Some alternatives you might want to consider: E2BIG, EOVERFLOW,
> or (maybe) ERANGE.

My problem with those is typically those are about the values in fields,
not about creating too many things.  So those really don't feel right.
EACCESS or EPERM might be better but those are very very generic.

I really want ETOOMANY or something like that.

I am actually surprised given how common this pattern is, and the fact that
rlimits have existed forever, that we don't have a resource limit
exceeded error code.  I suppose it might be worth looking at Posix to
see if they have any suggestions.  Perhaps it may even be time to add an
appropriate error code to the list.

Not the most important thing out of this patch series but something
worth looking at.

Eric


Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-26 Thread Eric W. Biederman
"Michael Kerrisk (man-pages)"  writes:

> Hello Eric,
>
> I realized I had a question after the last mail.
>
> On 07/21/2016 06:39 PM, Eric W. Biederman wrote:
>>
>> This patchset addresses two use cases:
>> - Implement a sane upper bound on the number of namespaces.
>> - Provide a way for sandboxes to limit the attack surface from
>>   namespaces.
>
> Can you say more about the second point? What exactly is the
> problem that is being addressed, and how does the patch series
> address it? (It would be good to have those details in the
> revised commit message...)

At some point it was reported that seccomp was not sufficient to disable
namespace creation.  I need to go back and look at that claim to see
which set of circumstances that was referring to.  Seccomp doesn't stack
so I can see why it is an issue.

The general problem is that namespaces by their nature (and especially
in combination with the user namespaces) allow unprivileged users to use
more of the kernel than a user would have access to without them.  This
in turn allows malicious users more kernel calls they can use in attempt
to find an exploitable bug.

So if you are building a sandbox/chroot jail/chromium tab or anything
like that and you know you won't be needing a kernel feature having an
easy way to disable the feature is useful for making the kernel
marginally more secure, as certain attack vectors are no longer
possible.

Eric






Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-26 Thread Michael Kerrisk (man-pages)

Hello Eric,

I realized I had a question after the last mail.

On 07/21/2016 06:39 PM, Eric W. Biederman wrote:


This patchset addresses two use cases:
- Implement a sane upper bound on the number of namespaces.
- Provide a way for sandboxes to limit the attack surface from
  namespaces.


Can you say more about the second point? What exactly is the
problem that is being addressed, and how does the patch series
address it? (It would be good to have those details in the
revised commit message...)

Cheers,

Michael




Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-26 Thread Michael Kerrisk (man-pages)

Hello Eric,

On 07/21/2016 06:39 PM, Eric W. Biederman wrote:


This patchset addresses two use cases:
- Implement a sane upper bound on the number of namespaces.
- Provide a way for sandboxes to limit the attack surface from
  namespaces.

The maximum sane case I can imagine is if every process is a fat
process, so I set the maximum number of namespaces to the maximum
number of threads.

I make these limits recursive and per user namespace so that a
usernamespace root can reduce the limits further.  If a user namespace
root raises the limit the limit in the parent namespace will be honored.

I have cut this implementation to the bare minimum needed to achieve
these objectives.

Does anyone know if there is a proper error code to return for resource
limit exceeded?  I am currently using -EUSERS or -ENFILE but both of
those feel a little wrong.


ENFILE certainly seems weird. I suppose my first question is: why two
different errors?

Some alternatives you might want to consider: E2BIG, EOVERFLOW,
or (maybe) ERANGE.

Cheers,

Michael








Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-22 Thread Eric W. Biederman
Kees Cook  writes:

> On Fri, Jul 22, 2016 at 11:45 AM, Eric W. Biederman
>  wrote:
>> Colin Walters  writes:
>>
>>> On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote:

 This patchset addresses two use cases:
 - Implement a sane upper bound on the number of namespaces.
 - Provide a way for sandboxes to limit the attack surface from
   namespaces.
>>>
>>> Perhaps this is obvious, but since you didn't quite explicitly state it;
>>> do you see this as obsoleting the existing downstream patches
>>> mentioned in:
>>> https://lwn.net/Articles/673597/
>>> It seems conceptually similar to Kees' original approach, right?
>>
>> Similar yes, and I expect it fills the need.  My primary difference is
>> that I believe this approach makes sense from a perspective of assuming
>> that user namespaces or other namespaces are not any buggier than any
>> other piece of kernel code and that people will use them.
>>
>> I don't see these limits making sense from a perspective that user
>> namespaces are flawed and distro kernels should not have enabled them in
>> the first place.  That was my perception right or wrong of Kees patches
>> and the related patches that landed in Ubuntu and Debian.
>>
>> With Kees approach I could not see how to handle the case where some
>> applications on the system wanted user namespaces and others don't.
>> Which made it very nasty for future evolution and more deployment of
>> user namespaces.  Being per user namespace these limits can be used to
>> sandbox applications without affecting the rest of the system.
>
> While it certainly works for my use-case (init ns
> max_usernamespaces=0), I don't see how this helps the case of "let
> user foobar open 1 userns, but everyone else is 0", which is likely
> the middle ground between "just turn it off" and "everyone gets to
> create usernamespaces". I'm personally not interested in that level of
> granularity, but in earlier discussions it sounded like this was
> something you wanted?

So the case I really care about is when there is limited use, so people
don't have to redesign their applications.  In this case if you want to
disable things in a sandbox like way you just create a user namespace
and set the count to 0 in that user namespace.

A whole system disable I tend to think is a stupid configuration for a
new system.  It gets into people negotiating for what they need, and I
don't see that as sustainable.  I prefer good usable defaults.

I would have loved to have done something with per user limits so it
could be disabled for a selection of users, but it turns out the kernel
doesn't have appropriate data structures for to hold limits for users
that have not logged in.

And in practice I don't care the case where 1 user is allowed but not
the others, I care about disallow this user/program that is in a
sandbox.  I also seem to recall people have problems with using seccomp
to disable things.  All of that said a per user policy is easily
implemented in pam by setting the size count for a specific user to 0.

I do think a limit to catch applications that go crazy is very sane, and
that is primarily what is implemented here.

Eric


Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-22 Thread Kees Cook
On Fri, Jul 22, 2016 at 11:45 AM, Eric W. Biederman
 wrote:
> Colin Walters  writes:
>
>> On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote:
>>>
>>> This patchset addresses two use cases:
>>> - Implement a sane upper bound on the number of namespaces.
>>> - Provide a way for sandboxes to limit the attack surface from
>>>   namespaces.
>>
>> Perhaps this is obvious, but since you didn't quite explicitly state it;
>> do you see this as obsoleting the existing downstream patches
>> mentioned in:
>> https://lwn.net/Articles/673597/
>> It seems conceptually similar to Kees' original approach, right?
>
> Similar yes, and I expect it fills the need.  My primary difference is
> that I believe this approach makes sense from a perspective of assuming
> that user namespaces or other namespaces are not any buggier than any
> other piece of kernel code and that people will use them.
>
> I don't see these limits making sense from a perspective that user
> namespaces are flawed and distro kernels should not have enabled them in
> the first place.  That was my perception right or wrong of Kees patches
> and the related patches that landed in Ubuntu and Debian.
>
> With Kees approach I could not see how to handle the case where some
> applications on the system wanted user namespaces and others don't.
> Which made it very nasty for future evolution and more deployment of
> user namespaces.  Being per user namespace these limits can be used to
> sandbox applications without affecting the rest of the system.

While it certainly works for my use-case (init ns
max_usernamespaces=0), I don't see how this helps the case of "let
user foobar open 1 userns, but everyone else is 0", which is likely
the middle ground between "just turn it off" and "everyone gets to
create usernamespaces". I'm personally not interested in that level of
granularity, but in earlier discussions it sounded like this was
something you wanted?

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security


Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-22 Thread Eric W. Biederman
Colin Walters  writes:

> On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote:
>> 
>> This patchset addresses two use cases:
>> - Implement a sane upper bound on the number of namespaces.
>> - Provide a way for sandboxes to limit the attack surface from
>>   namespaces.
>
> Perhaps this is obvious, but since you didn't quite explicitly state it;
> do you see this as obsoleting the existing downstream patches
> mentioned in:
> https://lwn.net/Articles/673597/
> It seems conceptually similar to Kees' original approach, right?

Similar yes, and I expect it fills the need.  My primary difference is
that I believe this approach makes sense from a perspective of assuming
that user namespaces or other namespaces are not any buggier than any
other piece of kernel code and that people will use them.

I don't see these limits making sense from a perspective that user
namespaces are flawed and distro kernels should not have enabled them in
the first place.  That was my perception right or wrong of Kees patches
and the related patches that landed in Ubuntu and Debian.

With Kees approach I could not see how to handle the case where some
applications on the system wanted user namespaces and others don't.
Which made it very nasty for future evolution and more deployment of
user namespaces.  Being per user namespace these limits can be used to
sandbox applications without affecting the rest of the system.

> The high level makes sense to me...most interesting is
> per-userns sysctls.  I'll note most current container managers
> mount /proc/sys read-only, and Docker specifically drops
> CAP_SYS_RESOURCE by default, so they'd likely need to learn
> how to undo that if one wanted to support recursive container usage.
> We'd probably need to evaluate the safety of having /proc/sys
> writable generally.  (Also it's rather common to filter out CLONE_NEWUSER
> via seccomp, but that's easy to undo)

Just using a user namespace replaces most of those precautions.

> But that's the flip side - if we're aiming primarily for an upstreamable
> way to *limit* namespace usage, it seems sane to me.

Yes.  The primary target is to stop applications that have gone buggy
and allocated a crazy number of namespaces.  The secondary target
is to allow sandboxes to disable creation of additional namespaces.
Just set the limit to 0 and drop caps, or similarly set the limit
to 1 and create another fresh set of nested namespaces.

Eric





Re: [PATCH v2 00/10] userns: sysctl limits for namespaces

2016-07-22 Thread Colin Walters
On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote:
> 
> This patchset addresses two use cases:
> - Implement a sane upper bound on the number of namespaces.
> - Provide a way for sandboxes to limit the attack surface from
>   namespaces.

Perhaps this is obvious, but since you didn't quite explicitly state it;
do you see this as obsoleting the existing downstream patches
mentioned in:
https://lwn.net/Articles/673597/
It seems conceptually similar to Kees' original approach, right?

The high level makes sense to me...most interesting is
per-userns sysctls.  I'll note most current container managers
mount /proc/sys read-only, and Docker specifically drops
CAP_SYS_RESOURCE by default, so they'd likely need to learn
how to undo that if one wanted to support recursive container usage.
We'd probably need to evaluate the safety of having /proc/sys
writable generally.  (Also it's rather common to filter out CLONE_NEWUSER
via seccomp, but that's easy to undo)

But that's the flip side - if we're aiming primarily for an upstreamable
way to *limit* namespace usage, it seems sane to me.