Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
> (a) minimal: just use our existing default stack (and stack _only_) > limit value for suid binaries that actually get extra permissions: { > _STK_LIM, RLIM_INFINITY }. Even that is dangerous because a setuid binary can be transitioning between two users (none privileged) yet be subject to an rlimit attack. There's even less reason to believe that non root setuid binaries are properly hardened than obvious targets. CPU limit attacks in particular can be used to do some quite clever things. Also consider a binary that is gaining some minor right (eg network rights) being targetted because giving it extra permissions allows the attacker to gain access to infinite resources when that clearly isn't the intent. > (c) perhaps encourage people to annotate their suid binaries with > initial resource requirements (and for stack, I mean the existing > GNU_STACK ELF annotation in particular). Making this for setuid binaries only makes no sense. If a user can annotate required resources and the execve() fails if those resources are over the rlimit then that is a useful feature full stop, and there's no reason to even make it setuid dependent. Alan
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Mon, 2017-07-10 at 20:16 +0200, Michal Hocko wrote: > OK, I misread the code. 32b applications on 64b systems do top down > by > default and only if they override this by ADDR_COMPAT_LAYOUT > personality. For some reason I thought that 32b userspace goes a > different path and makes sure that they are always doing bottom up. > > Anyway even if somebody really needs to grow stack really large we > have > the personality to give them the legacy layout. I think what will happen when rlimit_stack is RLIMIT_INFINITY is that mmap_base will end up placing mm->mmap_base at 512MB (task_size / 6 * 5 below the top of address space) for 32 bit kernels, and we eventually fall back to a bottom-up search if the space below mmap_base is exhausted (if it ever is).
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Mon 10-07-17 18:27:51, Michal Hocko wrote: > On Mon 10-07-17 09:12:11, Kees Cook wrote: [...] > > >> do_execveat_common() -> > > >> exec_binprm() -> > > >> search_binary_handler() -> > > >> fmt->load_binary (load_elf_binary()) -> > > >> setup_new_exec() -> > > >> arch_pick_mmap_layout() -> > > >> mmap_is_legacy() -> > > >> rlimit(RLIMIT_STACK) == RLIM_INFINITY > > > > > > FWIW this is gone in tip tree. See > > > lkml.kernel.org/r/20170614082218.12450-1-mho...@kernel.org > > > > Sounds good to me, but won't large-memory users in 32-bit get annoyed? > > Why would they? 32b do bottom up layouts by default. OK, I misread the code. 32b applications on 64b systems do top down by default and only if they override this by ADDR_COMPAT_LAYOUT personality. For some reason I thought that 32b userspace goes a different path and makes sure that they are always doing bottom up. Anyway even if somebody really needs to grow stack really large we have the personality to give them the legacy layout. -- Michal Hocko SUSE Labs
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Mon, Jul 10, 2017 at 09:18:09AM -0700, Linus Torvalds wrote: > On Mon, Jul 10, 2017 at 9:12 AM, Kees Cook wrote: > > > > Sounds good to me, but won't large-memory users in 32-bit get annoyed? > > We'll see. > > I suspect that all large-memory users have long since upgraded to > x86-64 (rule of thumb: if you are upgrading kernels today, you > probably upgraded hardware ten years ago), and that this may be a > non-issue today. I tend to agree. We've been using 32-bit machines with "a lot" (=2GB) of RAM and haproxy using something like 1.3GB in the past, and it started to become a bit complex due to ASLR puching large holes between each and every shared object, forcing us to stop setting strict overcommit limits for example. We've abandonned them after kernel 3.10, when the new models had been migrated to 64 bits a few years ago already and I think anyone doing anything serious with memory doesn't use 32-bit at all. Well I know of one exception :-) My netbook has 3 GB and is 32-bit, running on 4.9 : willy@eeepc:~$ uname -a Linux eeepc 4.9.36-eeepc #1 SMP Mon Jul 10 07:33:29 CEST 2017 i686 Intel(R) Atom(TM) CPU N2800 @ 1.86GHz GenuineIntel GNU/Linux willy@eeepc:~$ free total used free sharedbuffers cached Mem: 3097840 6498162448024 0 52476 507216 -/+ buffers/cache: 901243007716 Swap: 1025440 01025440 It only runs end-user stuff (firefox) so it cannot be considered anything serious. Willy
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Mon 10-07-17 09:12:11, Kees Cook wrote: > On Mon, Jul 10, 2017 at 1:44 AM, Michal Hocko wrote: > > On Thu 06-07-17 12:12:55, Kees Cook wrote: > >> On Thu, Jul 6, 2017 at 10:52 AM, Linus Torvalds > >> wrote: > >> > On Thu, Jul 6, 2017 at 10:29 AM, Kees Cook wrote: > >> >>> > >> >>> (a) minimal: just use our existing default stack (and stack _only_) > >> >>> limit value for suid binaries that actually get extra permissions: { > >> >>> _STK_LIM, RLIM_INFINITY }. > >> >> > >> >> This would look a lot like the existing patch; it'd just not copy the > >> >> init process rlimits. > >> > > >> > Can't we just do the final rlimit setting so late in execve that we > >> > don't need that whole "saved_rlimit" thing? > >> > >> The stack rlimit defines the mmap layout too: > >> > >> do_execveat_common() -> > >> exec_binprm() -> > >> search_binary_handler() -> > >> fmt->load_binary (load_elf_binary()) -> > >> setup_new_exec() -> > >> arch_pick_mmap_layout() -> > >> mmap_is_legacy() -> > >> rlimit(RLIMIT_STACK) == RLIM_INFINITY > > > > FWIW this is gone in tip tree. See > > lkml.kernel.org/r/20170614082218.12450-1-mho...@kernel.org > > Sounds good to me, but won't large-memory users in 32-bit get annoyed? Why would they? 32b do bottom up layouts by default. -- Michal Hocko SUSE Labs
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Mon, Jul 10, 2017 at 9:12 AM, Kees Cook wrote: > > Sounds good to me, but won't large-memory users in 32-bit get annoyed? We'll see. I suspect that all large-memory users have long since upgraded to x86-64 (rule of thumb: if you are upgrading kernels today, you probably upgraded hardware ten years ago), and that this may be a non-issue today. But only time will tell. I certainly prefer "keep it simple" over theoretical concerns. It's why I prefer that unconditional stack limit too - we may have to make it conditional on suid'ness or something like the ELF PT_GNU_STACK setting, but before over-designing things, let's see if anybody even cares. Linus
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Mon, Jul 10, 2017 at 1:44 AM, Michal Hocko wrote: > On Thu 06-07-17 12:12:55, Kees Cook wrote: >> On Thu, Jul 6, 2017 at 10:52 AM, Linus Torvalds >> wrote: >> > On Thu, Jul 6, 2017 at 10:29 AM, Kees Cook wrote: >> >>> >> >>> (a) minimal: just use our existing default stack (and stack _only_) >> >>> limit value for suid binaries that actually get extra permissions: { >> >>> _STK_LIM, RLIM_INFINITY }. >> >> >> >> This would look a lot like the existing patch; it'd just not copy the >> >> init process rlimits. >> > >> > Can't we just do the final rlimit setting so late in execve that we >> > don't need that whole "saved_rlimit" thing? >> >> The stack rlimit defines the mmap layout too: >> >> do_execveat_common() -> >> exec_binprm() -> >> search_binary_handler() -> >> fmt->load_binary (load_elf_binary()) -> >> setup_new_exec() -> >> arch_pick_mmap_layout() -> >> mmap_is_legacy() -> >> rlimit(RLIMIT_STACK) == RLIM_INFINITY > > FWIW this is gone in tip tree. See > lkml.kernel.org/r/20170614082218.12450-1-mho...@kernel.org Sounds good to me, but won't large-memory users in 32-bit get annoyed? A setuid/setgid exec will also get ADDR_COMPAT_LAYOUT cleared in bprm_fill_uid() (but not for fs-caps), so that'll keep things from getting layout-controlled. Thanks! (ADDR_COMPAT_LAYOUT isn't cleared for capability elevations, though, I think.) -Kees -- Kees Cook Pixel Security
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu 06-07-17 12:12:55, Kees Cook wrote: > On Thu, Jul 6, 2017 at 10:52 AM, Linus Torvalds > wrote: > > On Thu, Jul 6, 2017 at 10:29 AM, Kees Cook wrote: > >>> > >>> (a) minimal: just use our existing default stack (and stack _only_) > >>> limit value for suid binaries that actually get extra permissions: { > >>> _STK_LIM, RLIM_INFINITY }. > >> > >> This would look a lot like the existing patch; it'd just not copy the > >> init process rlimits. > > > > Can't we just do the final rlimit setting so late in execve that we > > don't need that whole "saved_rlimit" thing? > > The stack rlimit defines the mmap layout too: > > do_execveat_common() -> > exec_binprm() -> > search_binary_handler() -> > fmt->load_binary (load_elf_binary()) -> > setup_new_exec() -> > arch_pick_mmap_layout() -> > mmap_is_legacy() -> > rlimit(RLIMIT_STACK) == RLIM_INFINITY FWIW this is gone in tip tree. See lkml.kernel.org/r/20170614082218.12450-1-mho...@kernel.org -- Michal Hocko SUSE Labs
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Fri, Jul 7, 2017 at 9:06 AM, Linus Torvalds wrote: > On Thu, Jul 6, 2017 at 11:10 PM, Kees Cook wrote: >> On Thu, Jul 6, 2017 at 11:02 PM, Linus Torvalds >> wrote: >>> So 2+MB is still definitely something people can do (and probably *do* do). >> >> With the default 8MB stack, most people are already limited to 2MB >> here. I guess the question is, do people raise their stack rlimit to >> gain more arguments? Should I pick a different value for the args? > > So I would not be at all surprised if people just made the stack limit > higher when they hit the E2BIG issue in some script. > > So yes, I'd make the max args cutoff be higher than 2MB. > > I'd suggest we make the code do: > > (a) keep the existing rlimit/4 check (so *most* people will see the > exact same behavior) > > (b) add a static max arg check for something that is closer to 8MB > but leaves a somewhat reasonable stack size even if the stack size get > reset to 8MB > > I'd suggest that (b) case just be 6MB or something. Maybe make it > (_STK_LIM/4*3) or whatever, in case we ever end up changing that > value. Sounds good. I'll send a patch... -Kees -- Kees Cook Pixel Security
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Fri, Jul 7, 2017 at 9:22 AM, Linus Torvalds wrote: > On Thu, Jul 6, 2017 at 11:40 PM, Kees Cook wrote: > So I definitely like this approach, as long as we clarify that crazy > security_bprm_secureexec() model. That code really is insane. Yeah, it really seems like security_bprm_secureexec() should be checking the bprm cred, though there are comments in the LSMs about why this isn't done. I'll dig through the history there; it may be an artifact of the old cred installation ordering... It seems like an LSM would want to see "current creds" and "new creds" as distinct things for comparing the results, but I'll check it out. > In contrast, if the code just did: > > if (security_bprm_secureexec(bprm)) { > if (current->signal->rlim[RLIMIT_STACK].rlim_cur > _STK_LIM) > current->signal->rlim[RLIMIT_STACK].rlim_cur = _STK_LIM; > } > > and leave the hard limit alone entirely. At least that doesn't let > anybody escape the limits that the sysadmin has set. Sounds good to me. > > Hmm? Yes, this allows people to try to attack suid binaries with a > really small stack. But that's a pre-existing attack - do we have > worries about it? Agreed, I think that's out of scope for this. -Kees -- Kees Cook Pixel Security
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 11:40 PM, Kees Cook wrote: > On Thu, Jul 6, 2017 at 10:49 PM, Kees Cook wrote: > > At Andy's suggestion I'm using security_bprm_secureexec() to test for > setuid-ness. However, this seems to expect the credentials to have > already been installed. That function actually takes a look at current creds, so yes, it currently works only after the creds have been installed. It works for you because it *also* checks if the current cred is not root, and then it looks at bprm->cap_effective too (indirectly - check the commoncap code), which has been set earlier. But it's crazy code. I literally don't know why it looks at current_creds(), when it's all about the bprm, which has its own creds - and then it would work any time. That code is confusing. Somebody should check it. That whole security_bprm_secureexec() hook seems mis-designed. > And yet ... the following patch still works correctly when I call it "early". So I definitely like this approach, as long as we clarify that crazy security_bprm_secureexec() model. That code really is insane. But I wonder: > + if (security_bprm_secureexec(bprm)) { > + struct rlimit default_stack = { _STK_LIM, RLIM_INFINITY }; > + > + current->signal->rlim[RLIMIT_STACK] = default_stack; Should we override the whole thing? And in particular, should we override the hard limit at all? If we have an existing lower stack limit, we might as well leave it be entirely. And if the soft stack limit is higher, why modify the hard limit? In particular, the scenario I'm thinking of is "system admin on a small machine has set everybodys stack limits to just 8M as a _hard_ limit". Now, Bob and Jill agree to help each other to escape that limit, so they make suid binaries for their own account, and let each other use them - boom, they now have _STK_LIM and RLIM_INFINITY stack limits. Not good. In contrast, if the code just did: if (security_bprm_secureexec(bprm)) { if (current->signal->rlim[RLIMIT_STACK].rlim_cur > _STK_LIM) current->signal->rlim[RLIMIT_STACK].rlim_cur = _STK_LIM; } and leave the hard limit alone entirely. At least that doesn't let anybody escape the limits that the sysadmin has set. Hmm? Yes, this allows people to try to attack suid binaries with a really small stack. But that's a pre-existing attack - do we have worries about it? Linus
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 11:10 PM, Kees Cook wrote: > On Thu, Jul 6, 2017 at 11:02 PM, Linus Torvalds > wrote: >> So 2+MB is still definitely something people can do (and probably *do* do). > > With the default 8MB stack, most people are already limited to 2MB > here. I guess the question is, do people raise their stack rlimit to > gain more arguments? Should I pick a different value for the args? So I would not be at all surprised if people just made the stack limit higher when they hit the E2BIG issue in some script. So yes, I'd make the max args cutoff be higher than 2MB. I'd suggest we make the code do: (a) keep the existing rlimit/4 check (so *most* people will see the exact same behavior) (b) add a static max arg check for something that is closer to 8MB but leaves a somewhat reasonable stack size even if the stack size get reset to 8MB I'd suggest that (b) case just be 6MB or something. Maybe make it (_STK_LIM/4*3) or whatever, in case we ever end up changing that value. Hmm? Linus
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 10:45 PM, Kees Cook wrote: > On Thu, Jul 6, 2017 at 10:36 PM, Andy Lutomirski wrote: >> Aren't there real use cases that use many megs of arguments? > > They'd be relatively new since the args were pretty limited before. > I'd be curious to see them. > >> We could probably get away with saying max(rlimit(RLIMIT_STACK), 2MB) >> as long as we make sure later on that we don't screw up if we've >> overallocated? > > min, not max, but yeah. Here's part of what I have for get_arg_page(): > > rlim = current->signal->rlim; > - if (size > READ_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4) > + arg_stack = READ_ONCE(rlim[RLIMIT_STACK].rlim_cur); > + arg_stack = min_t(unsigned long, arg_stack, _STK_LIM) / 4; > + if (size > arg_stack) > goto fail; I really did mean max, the idea being that, if we're going to increase rlim_cur, it's a bit odd to fail the exec if it would have worked under the higher value. That being said, I see no real exploit vector here if just rlimit(RLIMIT_STACK) is used. (Can you just use rlimit()? The open-coding seems entirely useless.) I thought of another approach, though: change the rlimit macros so that a secureexec program always gets at least 8MB stack. Might be less regression-prone.
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 10:49 PM, Kees Cook wrote: > On Thu, Jul 6, 2017 at 10:39 PM, Linus Torvalds >> And I think the credentials switch (which is the point of no return >> anyway) happens before we start mmap'ing the executable etc. We used >> to have some odd code there and do it in the completely wrong order >> (checking that the binary was executable for the *old* user, which >> makes no sense, iirc) > > Yeah, it all happens in setup_new_exec(). The first thing is layout > selection, then switching credentials. It could be made to take a hint > from GNU_STACK (which was parsed before setup_new_exec() is called), > check security_bprm_secureexec() and then make the rlimit changes, all > before the layout selection. At Andy's suggestion I'm using security_bprm_secureexec() to test for setuid-ness. However, this seems to expect the credentials to have already been installed. And yet ... the following patch still works correctly when I call it "early". I'm going to look again in the morning. diff --git a/fs/exec.c b/fs/exec.c index b60804216b59..a4d2433a44ec 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1334,9 +1334,20 @@ EXPORT_SYMBOL(would_dump); void setup_new_exec(struct linux_binprm * bprm) { + /* This is the point of no return */ + + /* +* If this is a setuid execution, reset the stack limit to +* a sane default to avoid bad behavior from the prior rlimits. +*/ + if (security_bprm_secureexec(bprm)) { + struct rlimit default_stack = { _STK_LIM, RLIM_INFINITY }; + + current->signal->rlim[RLIMIT_STACK] = default_stack; + } + arch_pick_mmap_layout(current->mm); - /* This is the point of no return */ current->sas_ss_sp = current->sas_ss_size = 0; if (uid_eq(current_euid(), current_uid()) && gid_eq(current_egid(), current_gid())) -- Kees Cook Pixel Security
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 11:02 PM, Linus Torvalds wrote: > So 2+MB is still definitely something people can do (and probably *do* do). With the default 8MB stack, most people are already limited to 2MB here. I guess the question is, do people raise their stack rlimit to gain more arguments? Should I pick a different value for the args? -Kees -- Kees Cook Pixel Security
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 10:45 PM, Kees Cook wrote: > On Thu, Jul 6, 2017 at 10:36 PM, Andy Lutomirski wrote: >> >> Aren't there real use cases that use many megs of arguments? > > They'd be relatively new since the args were pretty limited before. > I'd be curious to see them. "megs" yes. "many megs" no. The traditional kernel limit was 32 pages (so 128kB on x86, explaining our MAX_ARG value). We moved to the much nider "two active VM's at the same time" model a fairly long time ago, though - it was back in v2.6.23 or so. So about 10 years ago. I would have expected lots of scripts to have been written since that just end up going *far* over the old 128kB limit, because it's really easy to do. Things like big directories and the shell expanding "*" can easily be a megabyte of arguments. I know I used to have scripts where I had to use "xargs" in the past, and with the > 128kB change I just stopped, because "a couple of megabytes" is enough for a lot of things where 128kB wasn't necessarily. Oh, one example is actually the kernel source tree. I don't do it any more (because "git grep" is much better), but I used to do things like grep something $(find . -name '*.[ch]') all the time. And that actually currently *just* overflows the 2MB argument size, but used to work (easily) ten years ago. Oh, how the kernel has grown.. Yes, yes, *portably* you should always have done find . -print0 -name '*.[ch]' | xargs -0 grep but be honest now: that first thing is what you actually write when you do some throw-away one-liner. So 2+MB is still definitely something people can do (and probably *do* do). Linus
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 10:39 PM, Linus Torvalds wrote: > On Thu, Jul 6, 2017 at 10:15 PM, Kees Cook wrote: >> >> I always say this backwards. :P Default is top-down (allocate at high >> addresses and work down toward low). With unlimited stack, allocations >> start at low addresses and work up. Here's the results (shown with >> randomize_va_space sysctl set to 0): > > But this doesn't affect the stack layout itself. > > So we could do the stack copying without much caring, because that > happens first, right? > > So I think we can do all the envp/argv copying first, and then - as we > change the credentials, change the rlimit. And the string copies > wouldn't need to care much - although I guess they are also fine > checking against a possible *smaller* stack rlimit, which is actually > what we'd want. Yup, agreed. > And I think the credentials switch (which is the point of no return > anyway) happens before we start mmap'ing the executable etc. We used > to have some odd code there and do it in the completely wrong order > (checking that the binary was executable for the *old* user, which > makes no sense, iirc) Yeah, it all happens in setup_new_exec(). The first thing is layout selection, then switching credentials. It could be made to take a hint from GNU_STACK (which was parsed before setup_new_exec() is called), check security_bprm_secureexec() and then make the rlimit changes, all before the layout selection. > So I'm getting the sense that none of this should be a problem. > > But it's entirely possible that I missed something, and am just full > of shit. Our execve() path has traditionally been very hard to read. > It's actually gotten a bit better, but the whole "jump back and forth > between the generic fs/exec.c code and the binfmt code" is certainly > still there. Yeah, there might be something else lurking here, but so far, I'm satisfied this will work too. -Kees -- Kees Cook Pixel Security
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 10:36 PM, Andy Lutomirski wrote: > On Thu, Jul 6, 2017 at 10:15 PM, Kees Cook wrote: >> On Thu, Jul 6, 2017 at 10:10 PM, Kees Cook wrote: >>> On Thu, Jul 6, 2017 at 9:48 PM, Andy Lutomirski wrote: How about a much simpler solution: don't read rlimit at all in copy_strings(), let alone try to enforce it. Instead, just before the point of no return, check how much stack space is already used and, if it's more than an appropriate threshold (e.g. 1/4 of the rlimit), abort. Sure, this adds overhead if we're going to abort, but does that really matter? >>> >>> We should avoid using up tons of memory and then failing. Better to >>> cap it as we use it. Plumbing a sane value into this shouldn't be hard >>> at all. Just making this a hardcoded 2MB seems sane (1/4 of 8MB). > > Aren't there real use cases that use many megs of arguments? They'd be relatively new since the args were pretty limited before. I'd be curious to see them. > We could probably get away with saying max(rlimit(RLIMIT_STACK), 2MB) > as long as we make sure later on that we don't screw up if we've > overallocated? min, not max, but yeah. Here's part of what I have for get_arg_page(): rlim = current->signal->rlim; - if (size > READ_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4) + arg_stack = READ_ONCE(rlim[RLIMIT_STACK].rlim_cur); + arg_stack = min_t(unsigned long, arg_stack, _STK_LIM) / 4; + if (size > arg_stack) goto fail; >>> IIUC, this is a big deal on 32-bit. Unlimited stack triggers top-down >>> mmap instead of bottom-up. I mean, I'd be delighted to get rid of >>> this, but I thought it was relied on by userspace. >> >> I always say this backwards. :P Default is top-down (allocate at high >> addresses and work down toward low). With unlimited stack, allocations >> start at low addresses and work up. Here's the results (shown with >> randomize_va_space sysctl set to 0): > > Uhh, crikey! Where's the code that does that? That was the call path I quoted earlier: > The stack rlimit defines the mmap layout too: > > do_execveat_common() -> > exec_binprm() -> > search_binary_handler() -> > fmt->load_binary (load_elf_binary()) -> > setup_new_exec() -> > arch_pick_mmap_layout() -> > mmap_is_legacy() -> > rlimit(RLIMIT_STACK) == RLIM_INFINITY i.e. arch_pick_mmap_layout(). -Kees -- Kees Cook Pixel Security
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 10:15 PM, Kees Cook wrote: > > I always say this backwards. :P Default is top-down (allocate at high > addresses and work down toward low). With unlimited stack, allocations > start at low addresses and work up. Here's the results (shown with > randomize_va_space sysctl set to 0): But this doesn't affect the stack layout itself. So we could do the stack copying without much caring, because that happens first, right? So I think we can do all the envp/argv copying first, and then - as we change the credentials, change the rlimit. And the string copies wouldn't need to care much - although I guess they are also fine checking against a possible *smaller* stack rlimit, which is actually what we'd want. And I think the credentials switch (which is the point of no return anyway) happens before we start mmap'ing the executable etc. We used to have some odd code there and do it in the completely wrong order (checking that the binary was executable for the *old* user, which makes no sense, iirc) So I'm getting the sense that none of this should be a problem. But it's entirely possible that I missed something, and am just full of shit. Our execve() path has traditionally been very hard to read. It's actually gotten a bit better, but the whole "jump back and forth between the generic fs/exec.c code and the binfmt code" is certainly still there. Linus
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 10:15 PM, Kees Cook wrote: > On Thu, Jul 6, 2017 at 10:10 PM, Kees Cook wrote: >> On Thu, Jul 6, 2017 at 9:48 PM, Andy Lutomirski wrote: >>> How about a much simpler solution: don't read rlimit at all in >>> copy_strings(), let alone try to enforce it. Instead, just before the >>> point of no return, check how much stack space is already used and, if >>> it's more than an appropriate threshold (e.g. 1/4 of the rlimit), >>> abort. Sure, this adds overhead if we're going to abort, but does >>> that really matter? >> >> We should avoid using up tons of memory and then failing. Better to >> cap it as we use it. Plumbing a sane value into this shouldn't be hard >> at all. Just making this a hardcoded 2MB seems sane (1/4 of 8MB). Aren't there real use cases that use many megs of arguments? We could probably get away with saying max(rlimit(RLIMIT_STACK), 2MB) as long as we make sure later on that we don't screw up if we've overallocated? >> >>> I don't see why using rlimit for layout control makes any sense >>> whatsoever. Is there some historical reason we need that? As far as >>> I can see (on insufficient inspection) is that the kernel is trying to >>> guarantee that, if we have so much arg crap that our remaining stack >>> is less than 128k, then we don't exceed our limit by a little bit. >> >> IIUC, this is a big deal on 32-bit. Unlimited stack triggers top-down >> mmap instead of bottom-up. I mean, I'd be delighted to get rid of >> this, but I thought it was relied on by userspace. > > I always say this backwards. :P Default is top-down (allocate at high > addresses and work down toward low). With unlimited stack, allocations > start at low addresses and work up. Here's the results (shown with > randomize_va_space sysctl set to 0): Uhh, crikey! Where's the code that does that?
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 10:10 PM, Kees Cook wrote: > On Thu, Jul 6, 2017 at 9:48 PM, Andy Lutomirski wrote: >> How about a much simpler solution: don't read rlimit at all in >> copy_strings(), let alone try to enforce it. Instead, just before the >> point of no return, check how much stack space is already used and, if >> it's more than an appropriate threshold (e.g. 1/4 of the rlimit), >> abort. Sure, this adds overhead if we're going to abort, but does >> that really matter? > > We should avoid using up tons of memory and then failing. Better to > cap it as we use it. Plumbing a sane value into this shouldn't be hard > at all. Just making this a hardcoded 2MB seems sane (1/4 of 8MB). > >> I don't see why using rlimit for layout control makes any sense >> whatsoever. Is there some historical reason we need that? As far as >> I can see (on insufficient inspection) is that the kernel is trying to >> guarantee that, if we have so much arg crap that our remaining stack >> is less than 128k, then we don't exceed our limit by a little bit. > > IIUC, this is a big deal on 32-bit. Unlimited stack triggers top-down > mmap instead of bottom-up. I mean, I'd be delighted to get rid of > this, but I thought it was relied on by userspace. I always say this backwards. :P Default is top-down (allocate at high addresses and work down toward low). With unlimited stack, allocations start at low addresses and work up. Here's the results (shown with randomize_va_space sysctl set to 0): $ ulimit -s 8192 $ cat /proc/self/maps 08048000-0805 r-xp fc:01 843/bin/cat 0805-08051000 r--p 7000 fc:01 843/bin/cat 08051000-08052000 rw-p 8000 fc:01 843/bin/cat 08052000-08073000 rw-p 00:00 0 [heap] b7be7000-b7de7000 r--p fc:01 403307 /usr/lib/locale/locale-archive b7de7000-b7f9a000 r-xp fc:01 276980 /lib/i386-linux-gnu/libc-2.24.so b7f9a000-b7f9b000 ---p 001b3000 fc:01 276980 /lib/i386-linux-gnu/libc-2.24.so b7f9b000-b7f9d000 r--p 001b3000 fc:01 276980 /lib/i386-linux-gnu/libc-2.24.so b7f9d000-b7f9e000 rw-p 001b5000 fc:01 276980 /lib/i386-linux-gnu/libc-2.24.so b7f9e000-b7fa1000 rw-p 00:00 0 b7fb2000-b7fd7000 rw-p 00:00 0 b7fd7000-b7fd9000 r--p 00:00 0 [vvar] b7fd9000-b7fdb000 r-xp 00:00 0 [vdso] b7fdb000-b7ffd000 r-xp fc:01 276959 /lib/i386-linux-gnu/ld-2.24.so b7ffd000-b7ffe000 r--p 002d7000 fc:01 403307 /usr/lib/locale/locale-archive b7ffe000-b7fff000 r--p 00022000 fc:01 276959 /lib/i386-linux-gnu/ld-2.24.so b7fff000-b800 rw-p 00023000 fc:01 276959 /lib/i386-linux-gnu/ld-2.24.so bffdf000-c000 rw-p 00:00 0 [stack] $ ulimit -s unlimited $ cat /proc/self/maps 08048000-0805 r-xp fc:01 843/bin/cat 0805-08051000 r--p 7000 fc:01 843/bin/cat 08051000-08052000 rw-p 8000 fc:01 843/bin/cat 08052000-08073000 rw-p 00:00 0 [heap] 4000-40022000 r-xp fc:01 276959 /lib/i386-linux-gnu/ld-2.24.so 40022000-40023000 r--p 002d7000 fc:01 403307 /usr/lib/locale/locale-archive 40023000-40024000 r--p 00022000 fc:01 276959 /lib/i386-linux-gnu/ld-2.24.so 40024000-40025000 rw-p 00023000 fc:01 276959 /lib/i386-linux-gnu/ld-2.24.so 40025000-40027000 r--p 00:00 0 [vvar] 40027000-40029000 r-xp 00:00 0 [vdso] 40029000-4004e000 rw-p 00:00 0 4005f000-40212000 r-xp fc:01 276980 /lib/i386-linux-gnu/libc-2.24.so 40212000-40213000 ---p 001b3000 fc:01 276980 /lib/i386-linux-gnu/libc-2.24.so 40213000-40215000 r--p 001b3000 fc:01 276980 /lib/i386-linux-gnu/libc-2.24.so 40215000-40216000 rw-p 001b5000 fc:01 276980 /lib/i386-linux-gnu/libc-2.24.so 40216000-40219000 rw-p 00:00 0 40219000-40419000 r--p fc:01 403307 /usr/lib/locale/locale-archive bffdf000-c000 rw-p 00:00 0 [stack] -- Kees Cook Pixel Security
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 9:48 PM, Andy Lutomirski wrote: > How about a much simpler solution: don't read rlimit at all in > copy_strings(), let alone try to enforce it. Instead, just before the > point of no return, check how much stack space is already used and, if > it's more than an appropriate threshold (e.g. 1/4 of the rlimit), > abort. Sure, this adds overhead if we're going to abort, but does > that really matter? We should avoid using up tons of memory and then failing. Better to cap it as we use it. Plumbing a sane value into this shouldn't be hard at all. Just making this a hardcoded 2MB seems sane (1/4 of 8MB). > I don't see why using rlimit for layout control makes any sense > whatsoever. Is there some historical reason we need that? As far as > I can see (on insufficient inspection) is that the kernel is trying to > guarantee that, if we have so much arg crap that our remaining stack > is less than 128k, then we don't exceed our limit by a little bit. IIUC, this is a big deal on 32-bit. Unlimited stack triggers top-down mmap instead of bottom-up. I mean, I'd be delighted to get rid of this, but I thought it was relied on by userspace. -Kees -- Kees Cook Pixel Security
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 9:48 PM, Andy Lutomirski wrote: > > How about a much simpler solution: don't read rlimit at all in > copy_strings(), let alone try to enforce it. People have historically relied on E2BIG and then splitting things into multiple chunks (ie do the whole 'xargs' thing). But I agree that *if* we abort cleanly with E2BIG, then it's fine if we copy a bit too much first. It's just that we need to check early enough that an E2BIG error is possible (ie not after the "point of no return"). And yes, I think we can get rid of the layout differences based on rlimit. After all, if I read that right the thing currently clamps that "gap" thing to 128MB anyway (and that's on the *low* side). Linus
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 12:12 PM, Kees Cook wrote: > On Thu, Jul 6, 2017 at 10:52 AM, Linus Torvalds > wrote: >> On Thu, Jul 6, 2017 at 10:29 AM, Kees Cook wrote: (a) minimal: just use our existing default stack (and stack _only_) limit value for suid binaries that actually get extra permissions: { _STK_LIM, RLIM_INFINITY }. >>> >>> This would look a lot like the existing patch; it'd just not copy the >>> init process rlimits. >> >> Can't we just do the final rlimit setting so late in execve that we >> don't need that whole "saved_rlimit" thing? > > The stack rlimit defines the mmap layout too: > > do_execveat_common() -> > exec_binprm() -> > search_binary_handler() -> > fmt->load_binary (load_elf_binary()) -> > setup_new_exec() -> > arch_pick_mmap_layout() -> > mmap_is_legacy() -> > rlimit(RLIMIT_STACK) == RLIM_INFINITY > > exec_binprm() happens after the other stack setup (copy_strings()), so > if we wanted to avoid saved_rlimit, we'd have to replumb how > arch_pick_mmap_layout() works and how copy_strings() performs its > calculations (neither looks too terrible). How about a much simpler solution: don't read rlimit at all in copy_strings(), let alone try to enforce it. Instead, just before the point of no return, check how much stack space is already used and, if it's more than an appropriate threshold (e.g. 1/4 of the rlimit), abort. Sure, this adds overhead if we're going to abort, but does that really matter? I don't see why using rlimit for layout control makes any sense whatsoever. Is there some historical reason we need that? As far as I can see (on insufficient inspection) is that the kernel is trying to guarantee that, if we have so much arg crap that our remaining stack is less than 128k, then we don't exceed our limit by a little bit.
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 10:52 AM, Linus Torvalds wrote: > On Thu, Jul 6, 2017 at 10:29 AM, Kees Cook wrote: >>> >>> (a) minimal: just use our existing default stack (and stack _only_) >>> limit value for suid binaries that actually get extra permissions: { >>> _STK_LIM, RLIM_INFINITY }. >> >> This would look a lot like the existing patch; it'd just not copy the >> init process rlimits. > > Can't we just do the final rlimit setting so late in execve that we > don't need that whole "saved_rlimit" thing? The stack rlimit defines the mmap layout too: do_execveat_common() -> exec_binprm() -> search_binary_handler() -> fmt->load_binary (load_elf_binary()) -> setup_new_exec() -> arch_pick_mmap_layout() -> mmap_is_legacy() -> rlimit(RLIMIT_STACK) == RLIM_INFINITY exec_binprm() happens after the other stack setup (copy_strings()), so if we wanted to avoid saved_rlimit, we'd have to replumb how arch_pick_mmap_layout() works and how copy_strings() performs its calculations (neither looks too terrible). > If the issue is the "people can use argv/envp to already fill the > stack", then I'd actually be happier with just limiting that. > > We already claim that our ARG_MAX is just 128kB (old legacy). And I > was really happy when we changed our execve() to not have that nasty > array of pages, and we could expand on the array sizes. But we could > *easily* just say "limit execve arrays to 8MB", because while our code > can handle more, you do have latency issues and just memory use issues > too. That would address the argv/envp calculation but not the layout control. > So right now we already limit the stack size artificially to 1/4 the > stack rlimit (see get_arg_page()), and we could easily just further > cap it at 8M total - right now people obviously actually run in > practice with much less (ie for me that argument size is capped at a > quarter of that 8MB default rlimit). > > I have heard of people who want a big stack due to crazy recursion or > due to just doing otherwise insane things. But needing more than 8MB > of arg/envp? Not happening. > > So I think we could easily do that stack rlimit thing at the very last > minute, and not have to worry about restoring anything. We should double check there isn't more than just argv and layout, but I think both cases could be passed down a value from above instead of examining current's rlimits. -Kees -- Kees Cook Pixel Security
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 10:29 AM, Kees Cook wrote: >> >> (a) minimal: just use our existing default stack (and stack _only_) >> limit value for suid binaries that actually get extra permissions: { >> _STK_LIM, RLIM_INFINITY }. > > This would look a lot like the existing patch; it'd just not copy the > init process rlimits. Can't we just do the final rlimit setting so late in execve that we don't need that whole "saved_rlimit" thing? If the issue is the "people can use argv/envp to already fill the stack", then I'd actually be happier with just limiting that. We already claim that our ARG_MAX is just 128kB (old legacy). And I was really happy when we changed our execve() to not have that nasty array of pages, and we could expand on the array sizes. But we could *easily* just say "limit execve arrays to 8MB", because while our code can handle more, you do have latency issues and just memory use issues too. So right now we already limit the stack size artificially to 1/4 the stack rlimit (see get_arg_page()), and we could easily just further cap it at 8M total - right now people obviously actually run in practice with much less (ie for me that argument size is capped at a quarter of that 8MB default rlimit). I have heard of people who want a big stack due to crazy recursion or due to just doing otherwise insane things. But needing more than 8MB of arg/envp? Not happening. So I think we could easily do that stack rlimit thing at the very last minute, and not have to worry about restoring anything. Linus
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 9:34 AM, Linus Torvalds wrote: > On Wed, Jul 5, 2017 at 9:32 PM, Kees Cook wrote: >> In an attempt to provide sensible rlimit defaults for setuid execs, this >> inherits the namespace's init rlimits: > > Yeah, so I have to admit to hating this patch. Yup! Totally fine. I just wanted to demonstrate the idea with some actual code. > I'd much rather see something like > > (a) minimal: just use our existing default stack (and stack _only_) > limit value for suid binaries that actually get extra permissions: { > _STK_LIM, RLIM_INFINITY }. This would look a lot like the existing patch; it'd just not copy the init process rlimits. > (c) perhaps encourage people to annotate their suid binaries with > initial resource requirements (and for stack, I mean the existing > GNU_STACK ELF annotation in particular). > > For an example of (a), that existing _STK_LIM define is what the > kernel defaults to, and it's a 8MB stack. And looking at my Fedora > install, I see that the default user rlimit is 8MB for the stack. > [...] > Is that just coincidence, or is that just a sign of "nobody ever even > modifies the default value"? So (a) feels like "nobody really cares, > and 8MB is fine, and nobody even bothers changing it - just do the > minimal thing". I would be terrified to see a setuid binary that needed >8MB stack. :P > And (c) would be the sane option, and what we already do for things > like GNU_STACK to enable/disable executable stacks. It really feels > like allowing the GNU_STACK segment to contain stack rlimit override > information would be the perfect tool for binaries to say "Yeah, I > need more stack than _STK_LIM". > > So I see many different approaches (that could be combined: I like > combining (a) and (c), for example), and absolutely none of them > involve the random "take some values from init". > > And yes, a large part of this may be that I no longer feel like I can > trust "init" to do the sane thing. You all presumably know why. Heh. Yeah, I think a+c could make sense. "if not GNU_STACK program header or 0 MemSiz GNU_STACK program header, use _STK_LIM, otherwise use GNU_STACK MemSiz". I'll send a patch... -Kees -- Kees Cook Pixel Security
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 9:34 AM, Linus Torvalds wrote: > > (a) minimal: just use our existing default stack (and stack _only_) > limit value for suid binaries that actually get extra permissions: { > _STK_LIM, RLIM_INFINITY }. > > (c) perhaps encourage people to annotate their suid binaries with > initial resource requirements (and for stack, I mean the existing > GNU_STACK ELF annotation in particular). The more I look at that combination, the more I like it. We already parse PT_GNU_STACK, and we already take the permission values from there. So taking the "p_memsz" field from there to mean the stack limit (if it is non-zero) would be not only simple, but logical and clean. And so if something ever wants more stack than 8MB, it would be trivial to just use elf tools to mark that segment. Ok, I admit that I don't know if there are any sane elf tools that do this easily, but it still sounds conceptually like the RightThing(tm) to do. Linus
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Wed, Jul 5, 2017 at 9:32 PM, Kees Cook wrote: > In an attempt to provide sensible rlimit defaults for setuid execs, this > inherits the namespace's init rlimits: Yeah, so I have to admit to hating this patch. As already mentioned by others, it's not only not clear that we want to do this on every setuid exec, it's also not clear that init is the right source of limits, or even which limits we'd want to copy. I can easily see init doing a rlimit for its own use, and then when it goes through the fork/exec process does it set up some other rlimit for what it is going to run. You'd presumably want that for any non-system thing, so it's actually fairly natural to do it for system things too, so it's not at all obvious that "init" itself would run with some generic "system limits". So to me this feels like a bad hack that was brought on by this particular attack. I'd much rather see something like (a) minimal: just use our existing default stack (and stack _only_) limit value for suid binaries that actually get extra permissions: { _STK_LIM, RLIM_INFINITY }. or (b) fancier: per-namespace defaults that can be explicitly set by something, and enabled individually. or (c) perhaps encourage people to annotate their suid binaries with initial resource requirements (and for stack, I mean the existing GNU_STACK ELF annotation in particular). For an example of (a), that existing _STK_LIM define is what the kernel defaults to, and it's a 8MB stack. And looking at my Fedora install, I see that the default user rlimit is 8MB for the stack. Is that just coincidence, or is that just a sign of "nobody ever even modifies the default value"? So (a) feels like "nobody really cares, and 8MB is fine, and nobody even bothers changing it - just do the minimal thing". As to (b), we could just have that whole INIT_RLIMITS per-namespace, but only enable the stack limit by default. But then system admins could cvhange those limits and enable/disable individual rlimits to be used by suid binaries. That feels like the "give the admin tools" And (c) would be the sane option, and what we already do for things like GNU_STACK to enable/disable executable stacks. It really feels like allowing the GNU_STACK segment to contain stack rlimit override information would be the perfect tool for binaries to say "Yeah, I need more stack than _STK_LIM". So I see many different approaches (that could be combined: I like combining (a) and (c), for example), and absolutely none of them involve the random "take some values from init". And yes, a large part of this may be that I no longer feel like I can trust "init" to do the sane thing. You all presumably know why. Linus
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 5:38 AM, Eric W. Biederman wrote: > Kees Cook writes: > >> In an attempt to provide sensible rlimit defaults for setuid execs, this >> inherits the namespace's init rlimits: >> >> $ ulimit -s >> 8192 >> $ ulimit -s unlimited >> $ /bin/sh -c 'ulimit -s' >> unlimited >> $ sudo /bin/sh -c 'ulimit -s' >> 8192 >> >> This is modified from Brad Spengler/PaX Team's hard-coded setuid exec >> stack rlimit (8MB) in the last public patch of grsecurity/PaX based on >> my understanding of the code. Changes or omissions from the original >> code are mine and don't reflect the original grsecurity/PaX code. >> >> Signed-off-by: Kees Cook >> --- >> Instead of copying all rlimits, we could also pick specific ones to copy >> (e.g. RLIMIT_STACK, or ones from Andy's list) or exclude from copying >> (probably better to blacklist than whitelist). >> >> I think this is the right way to find the ns init task, but maybe it >> needs locking? >> --- >> fs/exec.c | 34 ++ >> 1 file changed, 30 insertions(+), 4 deletions(-) >> >> diff --git a/fs/exec.c b/fs/exec.c >> index 904199086490..80e8b2bd4284 100644 >> --- a/fs/exec.c >> +++ b/fs/exec.c >> @@ -1675,6 +1675,12 @@ static int exec_binprm(struct linux_binprm *bprm) >> return ret; >> } >> >> +static inline bool is_setuid_exec(struct linux_binprm *bprm) >> +{ >> + return (!uid_eq(bprm->cred->euid, current_euid()) || >> + !gid_eq(bprm->cred->egid, current_egid())); >> +} > > Awesome I can make an executable setuid to myself and get all of roots > rlimits! > > Scratch inheritable rlimits as useful for any kind of policy decision. Most of them are already fairly useless. But it's a fair point. We don't want Jane to run a setuid-John program and thereby escape whatever limits are imposed. We don't really have a concept of bona-fide-trustworthy-setuid-root, though.
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Thu, Jul 6, 2017 at 5:45 AM, Eric W. Biederman wrote: > > How this is handled elsewhere in the code is to put the new values in > bprm. Putting new rlimits in bprm and changing them in flush_old_exec > or or setup_new_exec seems very sensible. It also allows for them to be > accessed before de_thread when setting up the new mm and we can still > fail. > Makes sense to me. --Andy
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
Andy Lutomirski writes: > On Wed, Jul 5, 2017 at 9:32 PM, Kees Cook wrote: >> In an attempt to provide sensible rlimit defaults for setuid execs, this >> inherits the namespace's init rlimits: >> >> $ ulimit -s >> 8192 >> $ ulimit -s unlimited >> $ /bin/sh -c 'ulimit -s' >> unlimited >> $ sudo /bin/sh -c 'ulimit -s' >> 8192 >> >> This is modified from Brad Spengler/PaX Team's hard-coded setuid exec >> stack rlimit (8MB) in the last public patch of grsecurity/PaX based on >> my understanding of the code. Changes or omissions from the original >> code are mine and don't reflect the original grsecurity/PaX code. >> >> Signed-off-by: Kees Cook >> --- >> Instead of copying all rlimits, we could also pick specific ones to copy >> (e.g. RLIMIT_STACK, or ones from Andy's list) or exclude from copying >> (probably better to blacklist than whitelist). >> >> I think this is the right way to find the ns init task, but maybe it >> needs locking? >> --- >> fs/exec.c | 34 ++ >> 1 file changed, 30 insertions(+), 4 deletions(-) >> >> diff --git a/fs/exec.c b/fs/exec.c >> index 904199086490..80e8b2bd4284 100644 >> --- a/fs/exec.c >> +++ b/fs/exec.c >> @@ -1675,6 +1675,12 @@ static int exec_binprm(struct linux_binprm *bprm) >> return ret; >> } >> >> +static inline bool is_setuid_exec(struct linux_binprm *bprm) >> +{ >> + return (!uid_eq(bprm->cred->euid, current_euid()) || >> + !gid_eq(bprm->cred->egid, current_egid())); >> +} >> + > > How about skipping this and using security_bprm_secureexec()? > >> /* >> * sys_execve() executes a new program. >> */ >> @@ -1687,6 +1693,7 @@ static int do_execveat_common(int fd, struct filename >> *filename, >> struct linux_binprm *bprm; >> struct file *file; >> struct files_struct *displaced; >> + struct rlimit saved_rlim[RLIM_NLIMITS]; >> int retval; >> >> if (IS_ERR(filename)) >> @@ -1771,24 +1778,38 @@ static int do_execveat_common(int fd, struct >> filename *filename, >> if (retval < 0) >> goto out; >> >> + /* >> +* From here forward, we've got credentials set up and we're >> +* using resources, so do rlimit replacement before we start >> +* copying strings. (Note that the RLIMIT_NPROC check has >> +* already happened.) >> +*/ >> + BUILD_BUG_ON(sizeof(saved_rlim) != sizeof(current->signal->rlim)); >> + if (is_setuid_exec(bprm)) { >> + memcpy(saved_rlim, current->signal->rlim, >> sizeof(saved_rlim)); >> + memcpy(current->signal->rlim, >> + >> task_active_pid_ns(current)->child_reaper->signal->rlim, >> + sizeof(current->signal->rlim)); >> + } >> + > > Is there any locking needed here? Is it possible that another thread > with the same current->signal is running? tasklist_lock protects child_reaper, and child_reaper changes. I suppose rcu_read_lock should be enough for the case above if we just want a snapshot in time. It is 100% possible another thread is running and could take advantage of the new limits. > I think the answer is that this needs to be after de_thread(), which > is called from exec_binprm(), which is rather annoying since the stack > rlimit is used before that. It's not *that* bad, since I think you > could replace all the RLIMIT_STACK accesses with explciit code to look > at the rlimits that are actually in play, but you just can't actually > install them until you've flushed the old exec. How this is handled elsewhere in the code is to put the new values in bprm. Putting new rlimits in bprm and changing them in flush_old_exec or or setup_new_exec seems very sensible. It also allows for them to be accessed before de_thread when setting up the new mm and we can still fail. Eric
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
Kees Cook writes: > In an attempt to provide sensible rlimit defaults for setuid execs, this > inherits the namespace's init rlimits: > > $ ulimit -s > 8192 > $ ulimit -s unlimited > $ /bin/sh -c 'ulimit -s' > unlimited > $ sudo /bin/sh -c 'ulimit -s' > 8192 > > This is modified from Brad Spengler/PaX Team's hard-coded setuid exec > stack rlimit (8MB) in the last public patch of grsecurity/PaX based on > my understanding of the code. Changes or omissions from the original > code are mine and don't reflect the original grsecurity/PaX code. > > Signed-off-by: Kees Cook > --- > Instead of copying all rlimits, we could also pick specific ones to copy > (e.g. RLIMIT_STACK, or ones from Andy's list) or exclude from copying > (probably better to blacklist than whitelist). > > I think this is the right way to find the ns init task, but maybe it > needs locking? > --- > fs/exec.c | 34 ++ > 1 file changed, 30 insertions(+), 4 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index 904199086490..80e8b2bd4284 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1675,6 +1675,12 @@ static int exec_binprm(struct linux_binprm *bprm) > return ret; > } > > +static inline bool is_setuid_exec(struct linux_binprm *bprm) > +{ > + return (!uid_eq(bprm->cred->euid, current_euid()) || > + !gid_eq(bprm->cred->egid, current_egid())); > +} Awesome I can make an executable setuid to myself and get all of roots rlimits! Scratch inheritable rlimits as useful for any kind of policy decision. > /* > * sys_execve() executes a new program. > */ > @@ -1687,6 +1693,7 @@ static int do_execveat_common(int fd, struct filename > *filename, > struct linux_binprm *bprm; > struct file *file; > struct files_struct *displaced; > + struct rlimit saved_rlim[RLIM_NLIMITS]; > int retval; > > if (IS_ERR(filename)) > @@ -1771,24 +1778,38 @@ static int do_execveat_common(int fd, struct filename > *filename, > if (retval < 0) > goto out; > > + /* > + * From here forward, we've got credentials set up and we're > + * using resources, so do rlimit replacement before we start > + * copying strings. (Note that the RLIMIT_NPROC check has > + * already happened.) > + */ > + BUILD_BUG_ON(sizeof(saved_rlim) != sizeof(current->signal->rlim)); > + if (is_setuid_exec(bprm)) { > + memcpy(saved_rlim, current->signal->rlim, sizeof(saved_rlim)); > + memcpy(current->signal->rlim, > +task_active_pid_ns(current)->child_reaper->signal->rlim, > +sizeof(current->signal->rlim)); > + } > + Caerful. child_reaper can change if you are not holding the tasklist lock. It would be better if we could move any rlimit changes after de_thread. Otherwise there are some really fun races you can play with. After de_thread is past the point of no return so you would not need to worry about restoring the rlimits either. > retval = copy_strings_kernel(1, &bprm->filename, bprm); > if (retval < 0) > - goto out; > + goto out_restore; > > bprm->exec = bprm->p; > retval = copy_strings(bprm->envc, envp, bprm); > if (retval < 0) > - goto out; > + goto out_restore; > > retval = copy_strings(bprm->argc, argv, bprm); > if (retval < 0) > - goto out; > + goto out_restore; > > would_dump(bprm, bprm->file); > > retval = exec_binprm(bprm); > if (retval < 0) > - goto out; > + goto out_restore; > > /* execve succeeded */ > current->fs->in_exec = 0; > @@ -1802,6 +1823,11 @@ static int do_execveat_common(int fd, struct filename > *filename, > put_files_struct(displaced); > return retval; > > +out_restore: > + if (is_setuid_exec(bprm)) { > + memcpy(current->signal->rlim, saved_rlim, sizeof(saved_rlim)); > + } > + > out: > if (bprm->mm) { > acct_arg_size(bprm, 0); > -- > 2.7.4 Eric
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Wed, Jul 05, 2017 at 09:32:35PM -0700, Kees Cook wrote: > In an attempt to provide sensible rlimit defaults for setuid execs, this > inherits the namespace's init rlimits: > > $ ulimit -s > 8192 > $ ulimit -s unlimited > $ /bin/sh -c 'ulimit -s' > unlimited > $ sudo /bin/sh -c 'ulimit -s' > 8192 > > This is modified from Brad Spengler/PaX Team's hard-coded setuid exec > stack rlimit (8MB) in the last public patch of grsecurity/PaX based on > my understanding of the code. Changes or omissions from the original > code are mine and don't reflect the original grsecurity/PaX code. > > Signed-off-by: Kees Cook > --- > Instead of copying all rlimits, we could also pick specific ones to copy > (e.g. RLIMIT_STACK, or ones from Andy's list) or exclude from copying > (probably better to blacklist than whitelist). > > I think this is the right way to find the ns init task, but maybe it > needs locking? > --- > fs/exec.c | 34 ++ > 1 file changed, 30 insertions(+), 4 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index 904199086490..80e8b2bd4284 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1675,6 +1675,12 @@ static int exec_binprm(struct linux_binprm *bprm) > return ret; > } > > +static inline bool is_setuid_exec(struct linux_binprm *bprm) > +{ > + return (!uid_eq(bprm->cred->euid, current_euid()) || > + !gid_eq(bprm->cred->egid, current_egid())); > +} > + > /* > * sys_execve() executes a new program. > */ > @@ -1687,6 +1693,7 @@ static int do_execveat_common(int fd, struct filename > *filename, > struct linux_binprm *bprm; > struct file *file; > struct files_struct *displaced; > + struct rlimit saved_rlim[RLIM_NLIMITS]; > int retval; > > if (IS_ERR(filename)) > @@ -1771,24 +1778,38 @@ static int do_execveat_common(int fd, struct filename > *filename, > if (retval < 0) > goto out; > > + /* > + * From here forward, we've got credentials set up and we're > + * using resources, so do rlimit replacement before we start > + * copying strings. (Note that the RLIMIT_NPROC check has > + * already happened.) > + */ > + BUILD_BUG_ON(sizeof(saved_rlim) != sizeof(current->signal->rlim)); > + if (is_setuid_exec(bprm)) { > + memcpy(saved_rlim, current->signal->rlim, sizeof(saved_rlim)); > + memcpy(current->signal->rlim, > +task_active_pid_ns(current)->child_reaper->signal->rlim, > +sizeof(current->signal->rlim)); > + } > + But you're not replacing just RLIMIT_STACK but all rlimits here. Please don't do this, as I mentionned elsewhere in the thread it will really become a pain at least for RLIMIT_CORE and RLIMIT_NOFILE. Better only apply RLIMIT_STACK as the commit message mentions it. Willy
Re: [RFC][PATCH] exec: Use init rlimits for setuid exec
On Wed, Jul 5, 2017 at 9:32 PM, Kees Cook wrote: > In an attempt to provide sensible rlimit defaults for setuid execs, this > inherits the namespace's init rlimits: > > $ ulimit -s > 8192 > $ ulimit -s unlimited > $ /bin/sh -c 'ulimit -s' > unlimited > $ sudo /bin/sh -c 'ulimit -s' > 8192 > > This is modified from Brad Spengler/PaX Team's hard-coded setuid exec > stack rlimit (8MB) in the last public patch of grsecurity/PaX based on > my understanding of the code. Changes or omissions from the original > code are mine and don't reflect the original grsecurity/PaX code. > > Signed-off-by: Kees Cook > --- > Instead of copying all rlimits, we could also pick specific ones to copy > (e.g. RLIMIT_STACK, or ones from Andy's list) or exclude from copying > (probably better to blacklist than whitelist). > > I think this is the right way to find the ns init task, but maybe it > needs locking? > --- > fs/exec.c | 34 ++ > 1 file changed, 30 insertions(+), 4 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index 904199086490..80e8b2bd4284 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1675,6 +1675,12 @@ static int exec_binprm(struct linux_binprm *bprm) > return ret; > } > > +static inline bool is_setuid_exec(struct linux_binprm *bprm) > +{ > + return (!uid_eq(bprm->cred->euid, current_euid()) || > + !gid_eq(bprm->cred->egid, current_egid())); > +} > + How about skipping this and using security_bprm_secureexec()? > /* > * sys_execve() executes a new program. > */ > @@ -1687,6 +1693,7 @@ static int do_execveat_common(int fd, struct filename > *filename, > struct linux_binprm *bprm; > struct file *file; > struct files_struct *displaced; > + struct rlimit saved_rlim[RLIM_NLIMITS]; > int retval; > > if (IS_ERR(filename)) > @@ -1771,24 +1778,38 @@ static int do_execveat_common(int fd, struct filename > *filename, > if (retval < 0) > goto out; > > + /* > +* From here forward, we've got credentials set up and we're > +* using resources, so do rlimit replacement before we start > +* copying strings. (Note that the RLIMIT_NPROC check has > +* already happened.) > +*/ > + BUILD_BUG_ON(sizeof(saved_rlim) != sizeof(current->signal->rlim)); > + if (is_setuid_exec(bprm)) { > + memcpy(saved_rlim, current->signal->rlim, sizeof(saved_rlim)); > + memcpy(current->signal->rlim, > + > task_active_pid_ns(current)->child_reaper->signal->rlim, > + sizeof(current->signal->rlim)); > + } > + Is there any locking needed here? Is it possible that another thread with the same current->signal is running? I think the answer is that this needs to be after de_thread(), which is called from exec_binprm(), which is rather annoying since the stack rlimit is used before that. It's not *that* bad, since I think you could replace all the RLIMIT_STACK accesses with explciit code to look at the rlimits that are actually in play, but you just can't actually install them until you've flushed the old exec. Perhaps, if you fix this, the changelog should say: Changes, omissions, or bugfixes from the original code are mine and don't reflect the original grsecurity/PaX code. (Hmm, is the grsecurity/PaX code exploitable? If you can exec a setuid program and arrange for it to fail while you're in a threaded program, it looks like the patch you sent will result in corruption of the rlimits in use by your other threads. Admittedy, bypassing rlimits is not exactly a huge deal.) Also, should this whole thing have a sysctl? --Andy
[RFC][PATCH] exec: Use init rlimits for setuid exec
In an attempt to provide sensible rlimit defaults for setuid execs, this inherits the namespace's init rlimits: $ ulimit -s 8192 $ ulimit -s unlimited $ /bin/sh -c 'ulimit -s' unlimited $ sudo /bin/sh -c 'ulimit -s' 8192 This is modified from Brad Spengler/PaX Team's hard-coded setuid exec stack rlimit (8MB) in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: Kees Cook --- Instead of copying all rlimits, we could also pick specific ones to copy (e.g. RLIMIT_STACK, or ones from Andy's list) or exclude from copying (probably better to blacklist than whitelist). I think this is the right way to find the ns init task, but maybe it needs locking? --- fs/exec.c | 34 ++ 1 file changed, 30 insertions(+), 4 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 904199086490..80e8b2bd4284 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1675,6 +1675,12 @@ static int exec_binprm(struct linux_binprm *bprm) return ret; } +static inline bool is_setuid_exec(struct linux_binprm *bprm) +{ + return (!uid_eq(bprm->cred->euid, current_euid()) || + !gid_eq(bprm->cred->egid, current_egid())); +} + /* * sys_execve() executes a new program. */ @@ -1687,6 +1693,7 @@ static int do_execveat_common(int fd, struct filename *filename, struct linux_binprm *bprm; struct file *file; struct files_struct *displaced; + struct rlimit saved_rlim[RLIM_NLIMITS]; int retval; if (IS_ERR(filename)) @@ -1771,24 +1778,38 @@ static int do_execveat_common(int fd, struct filename *filename, if (retval < 0) goto out; + /* +* From here forward, we've got credentials set up and we're +* using resources, so do rlimit replacement before we start +* copying strings. (Note that the RLIMIT_NPROC check has +* already happened.) +*/ + BUILD_BUG_ON(sizeof(saved_rlim) != sizeof(current->signal->rlim)); + if (is_setuid_exec(bprm)) { + memcpy(saved_rlim, current->signal->rlim, sizeof(saved_rlim)); + memcpy(current->signal->rlim, + task_active_pid_ns(current)->child_reaper->signal->rlim, + sizeof(current->signal->rlim)); + } + retval = copy_strings_kernel(1, &bprm->filename, bprm); if (retval < 0) - goto out; + goto out_restore; bprm->exec = bprm->p; retval = copy_strings(bprm->envc, envp, bprm); if (retval < 0) - goto out; + goto out_restore; retval = copy_strings(bprm->argc, argv, bprm); if (retval < 0) - goto out; + goto out_restore; would_dump(bprm, bprm->file); retval = exec_binprm(bprm); if (retval < 0) - goto out; + goto out_restore; /* execve succeeded */ current->fs->in_exec = 0; @@ -1802,6 +1823,11 @@ static int do_execveat_common(int fd, struct filename *filename, put_files_struct(displaced); return retval; +out_restore: + if (is_setuid_exec(bprm)) { + memcpy(current->signal->rlim, saved_rlim, sizeof(saved_rlim)); + } + out: if (bprm->mm) { acct_arg_size(bprm, 0); -- 2.7.4 -- Kees Cook Pixel Security