On 07/12/2018 11:39 AM, Tamar Christina wrote: >>> + >>> + /* Round size to the nearest multiple of guard_size, and calculate the >>> + residual as the difference between the original size and the rounded >>> + size. */ >>> + HOST_WIDE_INT rounded_size = size & -guard_size; >>> + HOST_WIDE_INT residual = size - rounded_size; >>> + >>> + /* We can handle a small number of allocations/probes inline. Otherwise >>> + punt to a loop. */ >>> + if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size) >>> + { >>> + for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size) >>> + { >>> + aarch64_sub_sp (NULL, temp2, guard_size, true); >>> + emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx, >>> + STACK_CLASH_CALLER_GUARD)); >>> + } >> So the only concern with this code is that it'll be inefficient and >> possibly incorrect for probe sizes larger than ARITH_FACTOR. >> Ultimately, that's a case I don't think we really care that much about. >> I wouldn't lose sleep if the port clamped the requested probing interval >> at ARITH_FACTOR. >> > I'm a bit confused here, the ARITH_FACTOR seems to have to do with the Ada > stack probing implementation, which isn't used by this new code aside > from the part that emits the actual probe when doing a variable or large > allocation in aarch64_output_probe_stack_range. > > Clamping the probing interval at ARITH_FACTOR means we can't do 64KB > probing intervals. It may have been a misunderstanding on my part. My understanding is that ARITH_FACTOR represents the largest immediate constant we could handle in this code using a single insn. Anything above ARITH_FACTOR needed a scratch register and I couldn't convince myself that we had a suitable scratch register available.
But I'm really not well versed on the aarch64 architecture or the various implementation details in aarch64.c. So I'm happy to defer to you and others @ ARM on what's best to do here. >> That can be problematical in a couple cases. First it can run afoul of >> combine-stack-adjustments. Essentially that pass will combine a series >> of stack adjustments into a single insn so your unrolled probing turns >> into something like this: >> >> multi-page stack adjust >> probe first page >> probe second page >> probe third page >> and so on.. >> > This is something we actually do want, particularly in the case of 4KB pages > as the immediates would fit in the store. It's one of the things we were > thinking about for future improvements. > >> That violates the design constraint that we never allocate more than a >> page at a time. > Would there be a big issue here if we didn't adhere to this constraint? Yes, because it enlarges a window for exploitation. Consider signal delivery occurring after the adjustment but before the probes. The signal handler could then be running on a clashed heap/stack. > >> Do you happen to have a libc.so and ld.so compiled with this code? I've >> got a scanner here which will look at a DSO and flag obviously invalid >> stack allocations. If you've got a suitable libc.so and ld.so I can run >> them through the scanner. >> >> >> Along similar lines, have you run the glibc testsuite with this stuff >> enabled by default. That proved useful to find codegen issues, >> particularly with the register inheritance in the epilogue. >> > I'm running one now, I'll send the two binaries once I get the results back > from the run. Thanks! Great. Looking forward getting those .so files I can can throw them into the scanner. >>> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md >>> index >>> 830f97603487d6ed07c4dc854a7898c4d17894d1..d1ed54c7160ab682c78d5950947608244d293025 >>> 100644 >>> --- a/gcc/config/aarch64/aarch64.md >>> +++ b/gcc/config/aarch64/aarch64.md >>> @@ -3072,7 +3072,7 @@ >>> >>> (define_insn "cmp<mode>" >>> [(set (reg:CC CC_REGNUM) >>> - (compare:CC (match_operand:GPI 0 "register_operand" "r,r,r") >>> + (compare:CC (match_operand:GPI 0 "register_operand" "rk,r,r") >>> (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))] >>> "" >>> "@ >> I don't think I needed this, but I can certainly understand how it might >> be necessary. THe only one I know we needed as the probe_stack_range >> constraint change. >> > It's not strictly needed, but it allows us to do the comparison with the > stack pointer in the loop without needing to emit sp to a temporary first. > So it's just a tiny optimization. :) Understood. A similar tweak for cmp<mode> was done in a patch from Richard E in his patches for spectre v1 mitigation. I'm certainly not objecting :-) Jeff