Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
I won't have any more time for this until I return from vacation at the end of the month but after a little bit of thought I think I have fixed all of the bugs (except arguably the return value). I have further tweaked these and made the limits per user. Because it occured to me that if the limits are global it is possible for one misbehaving user to DOS another which is undesirable in princible. I have put my updated code at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing Eric
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
On Tue, Jul 26, 2016 at 10:29 AM, Michael Kerrisk (man-pages)wrote: > On 26 July 2016 at 18:52, Kees Cook wrote: >> On Tue, Jul 26, 2016 at 8:06 AM, Eric W. Biederman >> wrote: >>> "Michael Kerrisk (man-pages)" writes: >>> Hello Eric, I realized I had a question after the last mail. On 07/21/2016 06:39 PM, Eric W. Biederman wrote: > > This patchset addresses two use cases: > - Implement a sane upper bound on the number of namespaces. > - Provide a way for sandboxes to limit the attack surface from > namespaces. Can you say more about the second point? What exactly is the problem that is being addressed, and how does the patch series address it? (It would be good to have those details in the revised commit message...) >>> >>> At some point it was reported that seccomp was not sufficient to disable >>> namespace creation. I need to go back and look at that claim to see >>> which set of circumstances that was referring to. Seccomp doesn't stack >>> so I can see why it is an issue. >> >> seccomp does stack. The trouble usually comes from a perception that >> seccomp overhead is not trivial, so setting a system-wide policy is a >> bit of a large hammer for such a limitiation. Also, at the time, >> seccomp could be bypasses with ptrace, but this (as of v4.8) is no >> longer true. > > Sounds like someone needs to send me a patch for the seccomp.2 man page? It's on my TODO list, no worries. :) I'm waiting for it to land in Linus's tree first. It's only been in -next so far. -Kees -- Kees Cook Chrome OS & Brillo Security
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
On 26 July 2016 at 18:52, Kees Cookwrote: > On Tue, Jul 26, 2016 at 8:06 AM, Eric W. Biederman > wrote: >> "Michael Kerrisk (man-pages)" writes: >> >>> Hello Eric, >>> >>> I realized I had a question after the last mail. >>> >>> On 07/21/2016 06:39 PM, Eric W. Biederman wrote: This patchset addresses two use cases: - Implement a sane upper bound on the number of namespaces. - Provide a way for sandboxes to limit the attack surface from namespaces. >>> >>> Can you say more about the second point? What exactly is the >>> problem that is being addressed, and how does the patch series >>> address it? (It would be good to have those details in the >>> revised commit message...) >> >> At some point it was reported that seccomp was not sufficient to disable >> namespace creation. I need to go back and look at that claim to see >> which set of circumstances that was referring to. Seccomp doesn't stack >> so I can see why it is an issue. > > seccomp does stack. The trouble usually comes from a perception that > seccomp overhead is not trivial, so setting a system-wide policy is a > bit of a large hammer for such a limitiation. Also, at the time, > seccomp could be bypasses with ptrace, but this (as of v4.8) is no > longer true. Sounds like someone needs to send me a patch for the seccomp.2 man page? Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
On Tue, Jul 26, 2016 at 8:06 AM, Eric W. Biedermanwrote: > "Michael Kerrisk (man-pages)" writes: > >> Hello Eric, >> >> I realized I had a question after the last mail. >> >> On 07/21/2016 06:39 PM, Eric W. Biederman wrote: >>> >>> This patchset addresses two use cases: >>> - Implement a sane upper bound on the number of namespaces. >>> - Provide a way for sandboxes to limit the attack surface from >>> namespaces. >> >> Can you say more about the second point? What exactly is the >> problem that is being addressed, and how does the patch series >> address it? (It would be good to have those details in the >> revised commit message...) > > At some point it was reported that seccomp was not sufficient to disable > namespace creation. I need to go back and look at that claim to see > which set of circumstances that was referring to. Seccomp doesn't stack > so I can see why it is an issue. seccomp does stack. The trouble usually comes from a perception that seccomp overhead is not trivial, so setting a system-wide policy is a bit of a large hammer for such a limitiation. Also, at the time, seccomp could be bypasses with ptrace, but this (as of v4.8) is no longer true. > The general problem is that namespaces by their nature (and especially > in combination with the user namespaces) allow unprivileged users to use > more of the kernel than a user would have access to without them. This > in turn allows malicious users more kernel calls they can use in attempt > to find an exploitable bug. > > So if you are building a sandbox/chroot jail/chromium tab or anything > like that and you know you won't be needing a kernel feature having an > easy way to disable the feature is useful for making the kernel > marginally more secure, as certain attack vectors are no longer > possible. -Kees -- Kees Cook Chrome OS & Brillo Security
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
"Michael Kerrisk (man-pages)"writes: > Hello Eric, > > On 07/21/2016 06:39 PM, Eric W. Biederman wrote: >> >> This patchset addresses two use cases: >> - Implement a sane upper bound on the number of namespaces. >> - Provide a way for sandboxes to limit the attack surface from >> namespaces. >> >> The maximum sane case I can imagine is if every process is a fat >> process, so I set the maximum number of namespaces to the maximum >> number of threads. >> >> I make these limits recursive and per user namespace so that a >> usernamespace root can reduce the limits further. If a user namespace >> root raises the limit the limit in the parent namespace will be honored. >> >> I have cut this implementation to the bare minimum needed to achieve >> these objectives. >> >> Does anyone know if there is a proper error code to return for resource >> limit exceeded? I am currently using -EUSERS or -ENFILE but both of >> those feel a little wrong. > > ENFILE certainly seems weird. I suppose my first question is: why two > different errors? EUSERS was there in the user namespace case for this condition (well nesting depth but same principle) so it made sense to start with. But too many users doesn't really make sense. ENFILE is actually slightly less insane. It actually is about hitting a resource limit, and seemed the most generic catchall we had. > Some alternatives you might want to consider: E2BIG, EOVERFLOW, > or (maybe) ERANGE. My problem with those is typically those are about the values in fields, not about creating too many things. So those really don't feel right. EACCESS or EPERM might be better but those are very very generic. I really want ETOOMANY or something like that. I am actually surprised given how common this pattern is, and the fact that rlimits have existed forever, that we don't have a resource limit exceeded error code. I suppose it might be worth looking at Posix to see if they have any suggestions. Perhaps it may even be time to add an appropriate error code to the list. Not the most important thing out of this patch series but something worth looking at. Eric
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
"Michael Kerrisk (man-pages)"writes: > Hello Eric, > > I realized I had a question after the last mail. > > On 07/21/2016 06:39 PM, Eric W. Biederman wrote: >> >> This patchset addresses two use cases: >> - Implement a sane upper bound on the number of namespaces. >> - Provide a way for sandboxes to limit the attack surface from >> namespaces. > > Can you say more about the second point? What exactly is the > problem that is being addressed, and how does the patch series > address it? (It would be good to have those details in the > revised commit message...) At some point it was reported that seccomp was not sufficient to disable namespace creation. I need to go back and look at that claim to see which set of circumstances that was referring to. Seccomp doesn't stack so I can see why it is an issue. The general problem is that namespaces by their nature (and especially in combination with the user namespaces) allow unprivileged users to use more of the kernel than a user would have access to without them. This in turn allows malicious users more kernel calls they can use in attempt to find an exploitable bug. So if you are building a sandbox/chroot jail/chromium tab or anything like that and you know you won't be needing a kernel feature having an easy way to disable the feature is useful for making the kernel marginally more secure, as certain attack vectors are no longer possible. Eric
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
Hello Eric, I realized I had a question after the last mail. On 07/21/2016 06:39 PM, Eric W. Biederman wrote: This patchset addresses two use cases: - Implement a sane upper bound on the number of namespaces. - Provide a way for sandboxes to limit the attack surface from namespaces. Can you say more about the second point? What exactly is the problem that is being addressed, and how does the patch series address it? (It would be good to have those details in the revised commit message...) Cheers, Michael
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
Hello Eric, On 07/21/2016 06:39 PM, Eric W. Biederman wrote: This patchset addresses two use cases: - Implement a sane upper bound on the number of namespaces. - Provide a way for sandboxes to limit the attack surface from namespaces. The maximum sane case I can imagine is if every process is a fat process, so I set the maximum number of namespaces to the maximum number of threads. I make these limits recursive and per user namespace so that a usernamespace root can reduce the limits further. If a user namespace root raises the limit the limit in the parent namespace will be honored. I have cut this implementation to the bare minimum needed to achieve these objectives. Does anyone know if there is a proper error code to return for resource limit exceeded? I am currently using -EUSERS or -ENFILE but both of those feel a little wrong. ENFILE certainly seems weird. I suppose my first question is: why two different errors? Some alternatives you might want to consider: E2BIG, EOVERFLOW, or (maybe) ERANGE. Cheers, Michael
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
Kees Cookwrites: > On Fri, Jul 22, 2016 at 11:45 AM, Eric W. Biederman > wrote: >> Colin Walters writes: >> >>> On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote: This patchset addresses two use cases: - Implement a sane upper bound on the number of namespaces. - Provide a way for sandboxes to limit the attack surface from namespaces. >>> >>> Perhaps this is obvious, but since you didn't quite explicitly state it; >>> do you see this as obsoleting the existing downstream patches >>> mentioned in: >>> https://lwn.net/Articles/673597/ >>> It seems conceptually similar to Kees' original approach, right? >> >> Similar yes, and I expect it fills the need. My primary difference is >> that I believe this approach makes sense from a perspective of assuming >> that user namespaces or other namespaces are not any buggier than any >> other piece of kernel code and that people will use them. >> >> I don't see these limits making sense from a perspective that user >> namespaces are flawed and distro kernels should not have enabled them in >> the first place. That was my perception right or wrong of Kees patches >> and the related patches that landed in Ubuntu and Debian. >> >> With Kees approach I could not see how to handle the case where some >> applications on the system wanted user namespaces and others don't. >> Which made it very nasty for future evolution and more deployment of >> user namespaces. Being per user namespace these limits can be used to >> sandbox applications without affecting the rest of the system. > > While it certainly works for my use-case (init ns > max_usernamespaces=0), I don't see how this helps the case of "let > user foobar open 1 userns, but everyone else is 0", which is likely > the middle ground between "just turn it off" and "everyone gets to > create usernamespaces". I'm personally not interested in that level of > granularity, but in earlier discussions it sounded like this was > something you wanted? So the case I really care about is when there is limited use, so people don't have to redesign their applications. In this case if you want to disable things in a sandbox like way you just create a user namespace and set the count to 0 in that user namespace. A whole system disable I tend to think is a stupid configuration for a new system. It gets into people negotiating for what they need, and I don't see that as sustainable. I prefer good usable defaults. I would have loved to have done something with per user limits so it could be disabled for a selection of users, but it turns out the kernel doesn't have appropriate data structures for to hold limits for users that have not logged in. And in practice I don't care the case where 1 user is allowed but not the others, I care about disallow this user/program that is in a sandbox. I also seem to recall people have problems with using seccomp to disable things. All of that said a per user policy is easily implemented in pam by setting the size count for a specific user to 0. I do think a limit to catch applications that go crazy is very sane, and that is primarily what is implemented here. Eric
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
On Fri, Jul 22, 2016 at 11:45 AM, Eric W. Biedermanwrote: > Colin Walters writes: > >> On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote: >>> >>> This patchset addresses two use cases: >>> - Implement a sane upper bound on the number of namespaces. >>> - Provide a way for sandboxes to limit the attack surface from >>> namespaces. >> >> Perhaps this is obvious, but since you didn't quite explicitly state it; >> do you see this as obsoleting the existing downstream patches >> mentioned in: >> https://lwn.net/Articles/673597/ >> It seems conceptually similar to Kees' original approach, right? > > Similar yes, and I expect it fills the need. My primary difference is > that I believe this approach makes sense from a perspective of assuming > that user namespaces or other namespaces are not any buggier than any > other piece of kernel code and that people will use them. > > I don't see these limits making sense from a perspective that user > namespaces are flawed and distro kernels should not have enabled them in > the first place. That was my perception right or wrong of Kees patches > and the related patches that landed in Ubuntu and Debian. > > With Kees approach I could not see how to handle the case where some > applications on the system wanted user namespaces and others don't. > Which made it very nasty for future evolution and more deployment of > user namespaces. Being per user namespace these limits can be used to > sandbox applications without affecting the rest of the system. While it certainly works for my use-case (init ns max_usernamespaces=0), I don't see how this helps the case of "let user foobar open 1 userns, but everyone else is 0", which is likely the middle ground between "just turn it off" and "everyone gets to create usernamespaces". I'm personally not interested in that level of granularity, but in earlier discussions it sounded like this was something you wanted? -Kees -- Kees Cook Chrome OS & Brillo Security
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
Colin Walterswrites: > On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote: >> >> This patchset addresses two use cases: >> - Implement a sane upper bound on the number of namespaces. >> - Provide a way for sandboxes to limit the attack surface from >> namespaces. > > Perhaps this is obvious, but since you didn't quite explicitly state it; > do you see this as obsoleting the existing downstream patches > mentioned in: > https://lwn.net/Articles/673597/ > It seems conceptually similar to Kees' original approach, right? Similar yes, and I expect it fills the need. My primary difference is that I believe this approach makes sense from a perspective of assuming that user namespaces or other namespaces are not any buggier than any other piece of kernel code and that people will use them. I don't see these limits making sense from a perspective that user namespaces are flawed and distro kernels should not have enabled them in the first place. That was my perception right or wrong of Kees patches and the related patches that landed in Ubuntu and Debian. With Kees approach I could not see how to handle the case where some applications on the system wanted user namespaces and others don't. Which made it very nasty for future evolution and more deployment of user namespaces. Being per user namespace these limits can be used to sandbox applications without affecting the rest of the system. > The high level makes sense to me...most interesting is > per-userns sysctls. I'll note most current container managers > mount /proc/sys read-only, and Docker specifically drops > CAP_SYS_RESOURCE by default, so they'd likely need to learn > how to undo that if one wanted to support recursive container usage. > We'd probably need to evaluate the safety of having /proc/sys > writable generally. (Also it's rather common to filter out CLONE_NEWUSER > via seccomp, but that's easy to undo) Just using a user namespace replaces most of those precautions. > But that's the flip side - if we're aiming primarily for an upstreamable > way to *limit* namespace usage, it seems sane to me. Yes. The primary target is to stop applications that have gone buggy and allocated a crazy number of namespaces. The secondary target is to allow sandboxes to disable creation of additional namespaces. Just set the limit to 0 and drop caps, or similarly set the limit to 1 and create another fresh set of nested namespaces. Eric
Re: [PATCH v2 00/10] userns: sysctl limits for namespaces
On Thu, Jul 21, 2016, at 12:39 PM, Eric W. Biederman wrote: > > This patchset addresses two use cases: > - Implement a sane upper bound on the number of namespaces. > - Provide a way for sandboxes to limit the attack surface from > namespaces. Perhaps this is obvious, but since you didn't quite explicitly state it; do you see this as obsoleting the existing downstream patches mentioned in: https://lwn.net/Articles/673597/ It seems conceptually similar to Kees' original approach, right? The high level makes sense to me...most interesting is per-userns sysctls. I'll note most current container managers mount /proc/sys read-only, and Docker specifically drops CAP_SYS_RESOURCE by default, so they'd likely need to learn how to undo that if one wanted to support recursive container usage. We'd probably need to evaluate the safety of having /proc/sys writable generally. (Also it's rather common to filter out CLONE_NEWUSER via seccomp, but that's easy to undo) But that's the flip side - if we're aiming primarily for an upstreamable way to *limit* namespace usage, it seems sane to me.