As some of you are likely aware, Qualys has just published fairly detailed information on using stack/heap clashes as an attack vector. Eric B, Michael M -- sorry I couldn't say more when I contact you about -fstack-check and some PPC specific stuff. This has been under embargo for the last month.
-- http://www.openwall.com/lists/oss-security/2017/06/19/1 Obviously various vulnerabilities pointed out in that advisory are being mitigated, particularly those found within glibc. But those are really just scratching the surface of this issue. At its core, this chained attack relies first upon using various techniques to bring the stack and heap close together. Then the exploits rely on large stack allocations to "jump the guard". Once the guard has been jumped, the stack and heap have collided and all hell breaks loose. The "jump the guard" step can be mitigated with help from the compiler. We just have to ensure that as we allocate chunks of stack space that we touch each allocated page. That ensures that the guard page is hit. This sounds a whole lot like -fstack-check and initially that's what folks were hoping could be used to eliminate this class of problems. -- Unfortunately, -fstack-check is actually not well suited for our purposes. Some background. -fstack-check was designed primarily for Ada's needs. It assumes the whole program is compiled with -fstack-check and it is designed to ensure there is enough stack space left so that if the program hits the guard (say via infinite recursion) the program can safely call into a signal handler and raise an exception. To ensure there's always enough space to meet that design requirement, -fstack-check probes stack space ahead of the actual need of the code. The assumption that all code was compiled with -fstack-check allows for elision of some stack probes as they are assumed to have been probed by earlier callers in the call chain. This elision is safe in an environment where all callers use -fstack-check, but fatally flawed in a mixed environment. Most ports first probe by pages for whatever space is requested, then after all probing is done, they actually allocate space. This runs afoul of valgrind in various unpleasant ways (including crashing valgrind on two targets). Only x86-linux currently uses a "moving sp" allocation and probing strategy. ie, it actually allocates space, then probes the space. -- After much poking around I concluded that we really need to implement allocation and probing via a "moving sp" strategy. Probing into unallocated areas runs afoul of valgrind, so that's a non-starter. Allocating stack space, then probing the pages within the space is vulnerable to async signal delivery between the allocation point and the probe point. If that occurs the signal handler could end up running on a stack that has collided with the heap. Ideally we would allocate and probe a page as an atomic unit (which is feasible on PPC). Alternatively, due to ISA restrictions, allocate a page, then probe the page as distinct instructions. The latter still has a race, but we'd have to take the async signal in a single instruction window. A key point to remember is that you can never have an allocation (potentially using more than one allocation site) which is larger than a page without probing the page. Furthermore, we can not assume that earlier functions in the call stack were compiled with stack checking enabled. Thus we can not make any assumptions about what pages other functions in the callstack have probed or not probed. Finally, we need not ensure the ability to handle a signal at stack overflow. It is fine for the kernel to halt the process immediately if it detects a reference to the guard page. -- With all that in mind, we also want to be as efficient as possible and I think we do pretty good on x86 and ppc. On x86, the call instruction itself stores into the stack and on ppc stack is only supposed to be allocated via the store-with-base-register-modification instructions which also store into *sp. Those "implicit probes" allow us to greatly reduce the amount of probing we do on those architectures. If a function allocates less than a page of space, no probing is needed -- this covers the vast majority of functions. Furthermore, if we allocate N pages + M bytes of residuals, we need only explicitly probe the N pages, but not any of the residual allocation. On glibc, we end up creating probes in ~1.5% of the functions on those two architectures. We could probably do even better on PPC, but we currently assume 4k pages which is overly-conservative on that target. aarch64 is significantly worse. There are no implicit probes we can exploit. Furthermore, the prologue may allocate stack space 3-4 times. So we have the track the distance to the most recent probe and when that distance grows too large, we have to emit a probe. Of course we have to make worst case assumptions at function entry. s390 is much like aarch64 in that it doesn't have implicit probes. However, it has simpler prologue code. Dynamic (alloca) space is handled fairly generically with simple code to allocate a page and probe the just allocated page. Michael Matz has suggested some generic support so that we don't have to write target specific code for each and every target we support. THe idea is to have a helper function which allocates and probes stack space. THe port can then call that helper function from within its prologue generator. I think this is wise -- I wouldn't want to go through this exercise on every port. -- So, time to open the discussion to questions & comments. I've got patches I need to cleanup and post for comments that implement this for x86, ppc, aarch64 and s390. x86 and ppc are IMHO in good shape. THere's an unhandled case for s390. I've got evaluation still to do on aarch64. Jeff