This patch introduces aarch64 -fstack-check=clash prologue support. aarch64 is the first target that does not have any implicit probes in the caller. Thus at prologue entry it must make conservative assumptions about the offset of the most recent probed address relative to the stack pointer.
However, it can and does use stores into the stack generated by the prologue to avoid some explicit probes. It's not as efficient as x86 or ppc, but it does try hard to be efficient. We track the offset of the most recent probe relative to the stack pointer. The initial state is PROBE_INTERVAL - (STACK_BOUNDARY / BITS_PER_UNIT) bytes away. Allocations increase that value. Stores decrease the value. When the value crosses PROBE_INTERVAL we emit an explicit probe. aarch64 prologues can allocate space in 3 different places. There's INITIAL_ADJUST and FINAL_ADJUST which are simple adjustments and they're trivial to handle. We just add the adjustment to the current offset and if we cross PROBE_INTERVAL, we emit a probe before the stack adjustment. Handling of CALLEE_ADJUST is more subtle because its two actions in one instruction. The instruction both allocates space and stores into two words at the lowest address in the allocated space. Its easiest to mentally model the instruction as two distinct operations. First the instruction adjusts the stack by CALLEE_ADJUST - 2 * WORD_SIZE and thus increases the offset to the most recent probe. If the offset to the most recent probe crosses PROBE_INTERVAL, then we must emit a probe prior to the CALLEE_ADJUST instruction. Note that CALLEE_ADJUST might be 0 or any multiple of 16. For 0/16 this step is a nop. Second the CALLEE_ADJUST instruction adjusts the stack by 2 * WORD_SIZE and stores into both just allocated words. We never need a probe for this action. Furthermore, this action always zeros the offset to the most recent probe which is very helpful in avoiding subsequent explicit probes. Between CALLEE_ADJUST and FINAL_ADJUST are other register saves. We track the offsets for those and decrease the offset to the most recent probe appropriately. I'm pretty sure this implementation is about as good as it can get with the ISA/ABI in place on aarch64. I'm pretty sure that the loop to allocate large stacks does not update CFIs correctly. Again, you'll note that like ppc, this does not add any new tests. It uses the tests that were added with the x86 support unmodifed. Thoughts/Comments? OK for the trunk?
* config/aarch/aarch64.c (output_probe_stack_range): Handle -fstack-check=clash probing too. (aarch64_save_callee_saves): Return smallest offset from SP that was written. (aarch64_allocate_and_probe_stack_space): New function. (aarch64_expand_prologue): Track distance between SP and most recent probe. Use aarch64_allocate_and_probe_stack_space when -fstack-check=clash rather than just adjusting sp. Dump actions via dump_stack_clash_frame_info. Handle probing and probe offset update for CALLEE_ADJUST space. Use return value from aarch64_save_callee_saves to reduce last_probe_offset. commit 86871749daad396cf825b3982dee8e80eda96d8a Author: root <r...@gigabyte-r120-05.khw.lab.eng.bos.redhat.com> Date: Fri Jul 7 11:17:06 2017 -0400 aarch64 support with dumping, adjust tests diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 0a8b40a..40bc183 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -2830,6 +2830,9 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2) char loop_lab[32]; rtx xops[2]; + if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK) + reg1 = stack_pointer_rtx; + ASM_GENERATE_INTERNAL_LABEL (loop_lab, "LPSRL", labelno++); /* Loop. */ @@ -2841,7 +2844,14 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2) output_asm_insn ("sub\t%0, %0, %1", xops); /* Probe at TEST_ADDR. */ - output_asm_insn ("str\txzr, [%0]", xops); + if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK) + { + gcc_assert (xops[0] == stack_pointer_rtx); + xops[1] = GEN_INT (PROBE_INTERVAL - 8); + output_asm_insn ("str\txzr, [%0, %1]", xops); + } + else + output_asm_insn ("str\txzr, [%0]", xops); /* Test if TEST_ADDR == LAST_ADDR. */ xops[1] = reg2; @@ -3245,9 +3255,11 @@ aarch64_return_address_signing_enabled (void) /* Emit code to save the callee-saved registers from register number START to LIMIT to the stack at the location starting at offset START_OFFSET, - skipping any write-back candidates if SKIP_WB is true. */ + skipping any write-back candidates if SKIP_WB is true. -static void + Return the smallest offset from SP that is written. */ + +static HOST_WIDE_INT aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset, unsigned start, unsigned limit, bool skip_wb) { @@ -3257,6 +3269,8 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset, unsigned regno; unsigned regno2; + HOST_WIDE_INT lowest_offset = PROBE_INTERVAL; + for (regno = aarch64_next_callee_save (start, limit); regno <= limit; regno = aarch64_next_callee_save (regno + 1, limit)) @@ -3274,6 +3288,8 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset, reg = gen_rtx_REG (mode, regno); offset = start_offset + cfun->machine->frame.reg_offset[regno]; + if (offset < lowest_offset) + lowest_offset = offset; mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx, offset)); @@ -3289,6 +3305,8 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset, rtx mem2; offset = start_offset + cfun->machine->frame.reg_offset[regno2]; + if (offset < lowest_offset) + lowest_offset = offset; mem2 = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx, offset)); insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, @@ -3306,6 +3324,7 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset, RTX_FRAME_RELATED_P (insn) = 1; } + return lowest_offset; } /* Emit code to restore the callee registers of mode MODE from register @@ -3605,6 +3624,71 @@ aarch64_set_handled_components (sbitmap components) cfun->machine->reg_is_wrapped_separately[regno] = true; } +/* Allocate SIZE bytes of stack space using SCRATCH_REG as a scratch + register. + + LAST_PROBE_OFFSET contains the offset between the stack pointer and + the last known probe. As LAST_PROBE_OFFSET crosses PROBE_INTERVAL + emit a probe and adjust LAST_PROBE_OFFSET. */ +static void +aarch64_allocate_and_probe_stack_space (int scratchreg, HOST_WIDE_INT size, + HOST_WIDE_INT *last_probe_offset) +{ + rtx temp = gen_rtx_REG (word_mode, scratchreg); + + HOST_WIDE_INT rounded_size = size & -PROBE_INTERVAL; + + /* We can handle a small number of allocations/probes inline. Otherwise + punt to a loop. */ + if (rounded_size && rounded_size <= 4 * PROBE_INTERVAL) + { + for (HOST_WIDE_INT i = 0; i < rounded_size; i += PROBE_INTERVAL) + { + /* We should never need a scratch register for this adjustment. */ + aarch64_sub_sp (-1, PROBE_INTERVAL, true); + + /* We just allocated PROBE_INTERVAL bytes. Thus, a probe is + mandatory. Note that LAST_PROBE_OFFSET does not change here. */ + emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx, + (PROBE_INTERVAL + - GET_MODE_SIZE (word_mode)))); + } + dump_stack_clash_frame_info (PROBE_INLINE, size != rounded_size); + } + else if (rounded_size) + { + /* Compute the ending address. */ + emit_move_insn (temp, GEN_INT (-rounded_size)); + emit_insn (gen_add3_insn (temp, stack_pointer_rtx, temp)); + + /* This allocates and probes the stack. Like the inline version above + it does not need to change LAST_PROBE_OFFSET. + + It almost certainly does not update CFIs correctly. */ + emit_insn (gen_probe_stack_range (temp, temp, temp)); + dump_stack_clash_frame_info (PROBE_LOOP, size != rounded_size); + } + else + dump_stack_clash_frame_info (PROBE_INLINE, size != rounded_size); + + + /* Handle any residuals. */ + HOST_WIDE_INT residual = size - rounded_size; + if (residual) + { + aarch64_sub_sp (-1, residual, true); + *last_probe_offset += residual; + if (*last_probe_offset >= PROBE_INTERVAL) + { + *last_probe_offset -= PROBE_INTERVAL; + emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx, + (residual + - GET_MODE_SIZE (word_mode)))); + } + } + return; +} + /* AArch64 stack frames generated by this compiler look like: +-------------------------------+ @@ -3661,6 +3745,11 @@ aarch64_expand_prologue (void) unsigned reg2 = cfun->machine->frame.wb_candidate2; rtx_insn *insn; + /* We know nothing about the last write into the stack, so we have to make + worst case assumptions here. */ + HOST_WIDE_INT last_probe_offset + = PROBE_INTERVAL - (STACK_BOUNDARY / BITS_PER_UNIT); + /* Sign return address for functions. */ if (aarch64_return_address_signing_enabled ()) { @@ -3686,16 +3775,64 @@ aarch64_expand_prologue (void) aarch64_emit_probe_stack_range (get_stack_check_protect (), frame_size); } - aarch64_sub_sp (IP0_REGNUM, initial_adjust, true); + if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK + && initial_adjust != 0) + aarch64_allocate_and_probe_stack_space (IP0_REGNUM, initial_adjust, + &last_probe_offset); + else + aarch64_sub_sp (IP0_REGNUM, initial_adjust, true); + + /* There are no explicit probes in this case. */ + if (initial_adjust == 0 + && callee_adjust < 32 + && final_adjust == 0) + dump_stack_clash_frame_info (NO_PROBE_NO_FRAME, callee_adjust != 0); if (callee_adjust != 0) - aarch64_push_regs (reg1, reg2, callee_adjust); + { + aarch64_push_regs (reg1, reg2, callee_adjust); + + /* For a small CALLEE_ADJUST we don't need a probe here as the store of + REG1/REG2 will hit the page we want. */ + if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK + && callee_adjust >= 32) + { + /* callee_adjust has to fit in a 7 bit signed scaled offset, + so it should never be near PROBE_INTERVAL. */ + gcc_assert (callee_adjust < PROBE_INTERVAL); + last_probe_offset += callee_adjust; + /* Probe if necessary. */ + if (last_probe_offset >= PROBE_INTERVAL) + { + emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx, + (callee_adjust + - GET_MODE_SIZE (word_mode)))); + dump_stack_clash_frame_info (PROBE_INLINE, true); + } + + /* Regardless of whether or not we explicitly probed, the register + push constitutes an implicit probe at *sp. */ + last_probe_offset = 0; + } + + } + + /* The calls to aarch64_save_callee_saves store into the stack and + can act as implicit probes. So it tracks the offset closest to + SP that is written and returns it. If that return value is less + than LAST_PROBE_OFFSET then we can adjust LAST_PROBE_OFFSET, possibly + saving an explicit probe for FINAL_ADJUST. */ + HOST_WIDE_INT tmp; if (frame_pointer_needed) { if (callee_adjust == 0) - aarch64_save_callee_saves (DImode, callee_offset, R29_REGNUM, - R30_REGNUM, false); + { + tmp = aarch64_save_callee_saves (DImode, callee_offset, R29_REGNUM, + R30_REGNUM, false); + if (tmp < last_probe_offset) + last_probe_offset = tmp; + } insn = emit_insn (gen_add3_insn (hard_frame_pointer_rtx, stack_pointer_rtx, GEN_INT (callee_offset))); @@ -3703,11 +3840,22 @@ aarch64_expand_prologue (void) emit_insn (gen_stack_tie (stack_pointer_rtx, hard_frame_pointer_rtx)); } - aarch64_save_callee_saves (DImode, callee_offset, R0_REGNUM, R30_REGNUM, - callee_adjust != 0 || frame_pointer_needed); - aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM, - callee_adjust != 0 || frame_pointer_needed); - aarch64_sub_sp (IP1_REGNUM, final_adjust, !frame_pointer_needed); + tmp = aarch64_save_callee_saves (DImode, callee_offset, R0_REGNUM, R30_REGNUM, + callee_adjust != 0 || frame_pointer_needed); + if (tmp < last_probe_offset) + last_probe_offset = tmp; + + tmp = aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM, + callee_adjust != 0 || frame_pointer_needed); + if (tmp < last_probe_offset) + last_probe_offset = tmp; + + if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK + && final_adjust != 0) + aarch64_allocate_and_probe_stack_space (IP1_REGNUM, final_adjust, + &last_probe_offset); + else + aarch64_sub_sp (IP1_REGNUM, final_adjust, !frame_pointer_needed); } /* Return TRUE if we can use a simple_return insn.