> > From: Lili Cui <[email protected]>
> >
> > Hi Uros,
> >
> > Thank you very much for providing detailed BKM to reproduce Linux kernel
> > boot
> failure. My patch and Matz's patch have this problem. We inserted a SUB
> between
> TEST and JLE, and the SUB changes the value of EFlags. The branch JLE here
> went
> wrong, and a null pointer appeared after passing through many functions. I
> created a small case for it and added it into my patch. I use lea instead of
> sub/add
> to fix this issue. Shrink-wrap also benefits from the lea instruction.
>
> Thanks for your thorough testing! I think that introducing a boot test for
> the linux
> kernel (besides SPEC tests) in your testing is a great addition, and I would
> recommend to perform it especially for an invasive change like this. In
> addition,
> Intel has a continuous integration testing for the linux kernel patches,
> perhaps
> something similar can be used to boot and test linux with GCC changes.
>
Agree, your suggestion makes sense.
> > Real case:
> > Avoid inserting sub between test-je-jle to change EFlags, lea should be
> > used here
> > xorl %eax, %eax
> > testl %edi, %edi
> > je .L11
> > sub $16, %rsp ------> leaq -16(%rsp), %rsp
>
> FAOD, you mean to insert SUB between TESTL and JE? The above case is not
> problematic, because flags are produced and consumed without being clobbered.
>
> > movq %r13, 8(%rsp)
> > movl $1, %r13d
> > jle .L4
> >
Not between TESTL and JE, but between TESTL and JLE.
Before shrink-wrapping the sequence should be:
push %r13
push ...
xorl %eax, %eax
testl %edi, %edi
je .L11 --- ---> It is an early return.
movl $1, %r13d
jle .L4
shrink-wrapping-separate replaces "push %r13" with "sub $16, %rsp" + "movq
%r13, 8(%rsp)" and inserts it before "movl $1, %r13d". "sub $16, %rsp" modifies
the SF flag, and then we take the wrong branch.
> > GCC change:
> > Emit lea in ix86_expand_prologue/ ix86_expand_epilogue, peephole2 will
> > decide
> whether to replace lea with sub/add depending on the situation.
> >
> >
> > I learned Matz's patch, the effect we achieved is basically the same,
> > although
> there are some differences in the implementation methods.
> >
> > I incorporated some of Matz's good ideas into my patch.
> > 1. Matz added a lot of assertions to detect errors in time, and I also
> > added some
> in this patch.
> > 2. Matz turned off shrink-wrap-separate judgments in some special scenarios,
> and I also added these judgments.
> >
> > Still keep the difference:
> > 1. The position of inserting SUB is different. Matz inserts SUB at the end
> > of
> prolog, I put SUB at the first line of prologue. I keep my way here.
> > 2. Matz handles both SSE and general registers, while I only handle general
> registers. Enabling SSE registers is easy, but the SSE part seems to be
> Windows ABI
> related. I have no way to test it yet, so I won't add it for now.
> > 3. After reloading, some registers were eliminated by optimization are
> > still placed
> in regs_ever_live, resulting in redundant push/pop when generating prologue
> later, my patch will eliminate these redundant registers and keep the stack
> frame
> layout the same as before.
> >
> > Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No
> performance regression was observed.
> > For O2 multi-copy :
> > 511.povray_r improved by 2.8% on ZNVER5.
> > 511.povray_r improved by 4% on EMR
> > 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE.
> >
> > Bootstrapped & regtested on x86-64-pc-linux-gnu.
> > Use this patch to build the latest Linux kernel and boot successfully.
> >
> > Thanks,
> > Lili.
> >
> >
> > This commit implements the target macros (TARGET_SHRINK_WRAP_*) that
> > enable separate shrink wrapping for function prologues/epilogues in
> > x86.
> >
> > When performing separate shrink wrapping, we choose to use mov instead
> > of push/pop, because using push/pop is more complicated to handle rsp
> > adjustment and may lose performance, so here we choose to use mov,
> > which has a small impact on code size, but guarantees performance.
> >
> > Using mov means we need to use sub/add to maintain the stack frame. In
> > some special cases, we need to use lea to prevent affecting EFlags.
> >
> > Avoid inserting sub between test-je-jle to change EFlags, lea should
> > be used here.
> >
> > foo:
> > xorl %eax, %eax
> > testl %edi, %edi
> > je .L11
> > sub $16, %rsp ------> leaq -16(%rsp), %rsp
>
> Also here, is the placement of SUB correct?
>
Yes, it is before JLE.
> > movq %r13, 8(%rsp)
> > movl $1, %r13d
> > jle .L4
> >
> > Tested against SPEC CPU 2017, this change always has a net-positive
> > effect on the dynamic instruction count. See the following table for
> > the breakdown on how this reduces the number of dynamic instructions
> > per workload on a like-for-like (with/without this commit):
> >
> > instruction count base with commit (commit-base)/commit
> > 502.gcc_r 98666845943 96891561634 -1.80%
> > 526.blender_r 6.21226E+11 6.12992E+11 -1.33%
> > 520.omnetpp_r 1.1241E+11 1.11093E+11 -1.17%
> > 500.perlbench_r 1271558717 1263268350 -0.65%
> > 523.xalancbmk_r 2.20103E+11 2.18836E+11 -0.58%
> > 531.deepsjeng_r 2.73591E+11 2.72114E+11 -0.54%
> > 500.perlbench_r 64195557393 63881512409 -0.49%
> > 541.leela_r 2.99097E+11 2.98245E+11 -0.29%
> > 548.exchange2_r 1.27976E+11 1.27784E+11 -0.15%
> > 527.cam4_r 88981458425 88887334679 -0.11%
> > 554.roms_r 2.60072E+11 2.59809E+11 -0.10%
> >
> > Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No
> performance regression was observed.
> >
> > For O2 multi-copy :
> > 511.povray_r improved by 2.8% on ZNVER5.
> > 511.povray_r improved by 4% on EMR
> > 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE.
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386-protos.h (ix86_get_separate_components):
> > New function.
> > (ix86_components_for_bb): Likewise.
> > (ix86_disqualify_components): Likewise.
> > (ix86_emit_prologue_components): Likewise.
> > (ix86_emit_epilogue_components): Likewise.
> > (ix86_set_handled_components): Likewise.
> > * config/i386/i386.cc (save_regs_using_push_pop):
> > Encapsulate code.
>
> Better say: "Split from ix86_compute_frame_layout."
Done.
>
> > (ix86_compute_frame_layout):
> > Handle save_regs_using_push_pop.
>
> Use save_regs_using_push_pop.
Done.
>
> > (ix86_emit_save_regs_using_mov):
> > Skip registers that are wrapped separately.
> > (pro_epilogue_adjust_stack): Emit insn without clobber.
> > (ix86_expand_prologue): Likewise.
>
> Looks like "Likewise" applies to ix86_emit_save_regs_using_mov. There are
> several
> other changes in ix86_expand_prologue. Please document them in the ChangeLog
> entry.
>
> > (ix86_emit_restore_regs_using_mov): Likewise.
> > (ix86_expand_epilogue): Likewise.
>
> Also in the above tow functions.
>
Done.
> > (ix86_get_separate_components): New function.
> > (ix86_components_for_bb): Likewise.
> > (ix86_disqualify_components): Likewise.
> > (ix86_emit_prologue_components): Likewise.
> > (ix86_emit_epilogue_components): Likewise.
> > (ix86_set_handled_components): Likewise.
> > (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS): Define.
> > (TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB): Likewise.
> > (TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS): Likewise.
> > (TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS): Likewise.
> > (TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS): Likewise.
> > (TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Likewise.
> > * config/i386/i386.h (struct machine_function):Add
> > reg_is_wrapped_separately array for register wrapping
> > information.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/x86_64/abi/callabi/leaf-2.c: Adjust the test.
> > * gcc.target/i386/interrupt-16.c: Likewise.
> > * gfortran.dg/guality/arg1.f90: Likewise.
> > * g++.target/i386/shrink_wrap_separate.C: New test.
> > * gcc.target/i386/shrink_wrap_separate_check_lea.c: Likewise.
> >
> > Co-authored-by: Michael Matz <[email protected]>
> > ---
> > gcc/config/i386/i386-protos.h | 7 +
> > gcc/config/i386/i386.cc | 321 +++++++++++++++---
> > gcc/config/i386/i386.h | 1 +
> > .../g++.target/i386/shrink_wrap_separate.C | 24 ++
> > gcc/testsuite/gcc.target/i386/interrupt-16.c | 4 +-
> > .../i386/shrink_wrap_separate_check_lea.c | 29 ++
> > .../gcc.target/x86_64/abi/callabi/leaf-2.c | 2 +-
> > gcc/testsuite/gfortran.dg/guality/arg1.f90 | 2 +-
> > 8 files changed, 343 insertions(+), 47 deletions(-) create mode
> > 100644 gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
> > create mode 100644
> > gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c
> >
> > diff --git a/gcc/config/i386/i386-protos.h
> > b/gcc/config/i386/i386-protos.h index 10863ab9e9d..86194b3d319 100644
> > --- a/gcc/config/i386/i386-protos.h
> > +++ b/gcc/config/i386/i386-protos.h
> > @@ -437,6 +437,13 @@ extern rtl_opt_pass *make_pass_align_tight_loops
> > (gcc::context *); extern bool ix86_has_no_direct_extern_access;
> > extern bool ix86_rpad_gate ();
> >
> > +extern sbitmap ix86_get_separate_components (void); extern sbitmap
> > +ix86_components_for_bb (basic_block); extern void
> > +ix86_disqualify_components (sbitmap, edge, sbitmap, bool); extern
> > +void ix86_emit_prologue_components (sbitmap); extern void
> > +ix86_emit_epilogue_components (sbitmap); extern void
> > +ix86_set_handled_components (sbitmap);
> > +
> > /* In i386-expand.cc. */
> > bool ix86_check_builtin_isa_match (unsigned int, HOST_WIDE_INT*,
> > HOST_WIDE_INT*); diff --git
> > a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index
> > d48654a729a..8f2193efe29 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -6905,6 +6905,26 @@ ix86_pro_and_epilogue_can_use_push2pop2 (int
> nregs)
> > && (nregs + aligned) >= 3;
> > }
> >
> > +/* Check if push/pop should be used to save/restore registers. */
> > +static bool save_regs_using_push_pop (HOST_WIDE_INT to_allocate) {
> > + return ((!to_allocate && cfun->machine->frame.nregs <= 1)
> > + || (TARGET_64BIT && to_allocate >= HOST_WIDE_INT_C (0x80000000))
> > + /* If static stack checking is enabled and done with probes,
> > + the registers need to be saved before allocating the frame. */
> > + || flag_stack_check == STATIC_BUILTIN_STACK_CHECK
> > + /* If stack clash probing needs a loop, then it needs a
> > + scratch register. But the returned register is only guaranteed
> > + to be safe to use after register saves are complete. So if
> > + stack clash protections are enabled and the allocated frame is
> > + larger than the probe interval, then use pushes to save
> > + callee saved registers. */
> > + || (flag_stack_clash_protection
> > + && !ix86_target_stack_probe ()
> > + && to_allocate > get_probe_interval ())); }
> > +
> > /* Fill structure ix86_frame about frame of currently computed
> > function. */
> >
> > static void
> > @@ -7189,20 +7209,7 @@ ix86_compute_frame_layout (void)
> > /* Size prologue needs to allocate. */
> > to_allocate = offset - frame->sse_reg_save_offset;
> >
> > - if ((!to_allocate && frame->nregs <= 1)
> > - || (TARGET_64BIT && to_allocate >= HOST_WIDE_INT_C (0x80000000))
> > - /* If static stack checking is enabled and done with probes,
> > - the registers need to be saved before allocating the frame. */
> > - || flag_stack_check == STATIC_BUILTIN_STACK_CHECK
> > - /* If stack clash probing needs a loop, then it needs a
> > - scratch register. But the returned register is only guaranteed
> > - to be safe to use after register saves are complete. So if
> > - stack clash protections are enabled and the allocated frame is
> > - larger than the probe interval, then use pushes to save
> > - callee saved registers. */
> > - || (flag_stack_clash_protection
> > - && !ix86_target_stack_probe ()
> > - && to_allocate > get_probe_interval ()))
> > + if (save_regs_using_push_pop (to_allocate))
> > frame->save_regs_using_mov = false;
> >
> > if (ix86_using_red_zone ()
> > @@ -7660,7 +7667,9 @@ ix86_emit_save_regs_using_mov (HOST_WIDE_INT
> cfa_offset)
> > for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> > if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true))
> > {
> > - ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset);
> > + /* Skip registers, already processed by shrink wrap separate. */
> > + if (!cfun->machine->reg_is_wrapped_separately[regno])
> > + ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset);
> > cfa_offset -= UNITS_PER_WORD;
> > }
> > }
> > @@ -7753,8 +7762,12 @@ pro_epilogue_adjust_stack (rtx dest, rtx src, rtx
> offset,
> > add_frame_related_expr = true;
> > }
> >
> > + if (crtl->shrink_wrapped_separate)
> > + insn = emit_insn (gen_rtx_SET (dest, gen_rtx_PLUS (Pmode, src,
> > + addend)));
>
> Please use ix86_expand_binary_operator here, it will correctly generate LEA
> during
> pro/epilogue generation. Also, please check indenting.
>
Ok.
> > + else
> > insn = emit_insn (gen_pro_epilogue_adjust_stack_add
> > (Pmode, dest, src, addend));
> > +
> > if (style >= 0)
> > ix86_add_queued_cfa_restore_notes (insn);
> >
> > @@ -9218,13 +9231,30 @@ ix86_expand_prologue (void)
> > the stack frame saving one cycle of the prologue. However, avoid
> > doing this if we have to probe the stack; at least on x86_64 the
> > stack probe can turn into a call that clobbers a red zone
> > location. */
> > - else if (ix86_using_red_zone ()
> > - && (! TARGET_STACK_PROBE
> > - || frame.stack_pointer_offset < CHECK_STACK_LIMIT))
> > + else if ((ix86_using_red_zone ()
> > + && (! TARGET_STACK_PROBE
> > + || frame.stack_pointer_offset < CHECK_STACK_LIMIT))
> > + || crtl->shrink_wrapped_separate)
> > {
> > + HOST_WIDE_INT allocate_offset;
> > + if (crtl->shrink_wrapped_separate)
> > + {
> > + allocate_offset = m->fs.sp_offset -
> > + frame.stack_pointer_offset;
> > +
> > + /* Adjust the total offset at the beginning of the function.
> > */
> > + pro_epilogue_adjust_stack (stack_pointer_rtx,
> > stack_pointer_rtx,
> > + GEN_INT (allocate_offset), -1,
> > + m->fs.cfa_reg ==
> > stack_pointer_rtx);
> > + m->fs.sp_offset = cfun->machine->frame.stack_pointer_offset;
> > + }
> > +
> > ix86_emit_save_regs_using_mov (frame.reg_save_offset);
> > - cfun->machine->red_zone_used = true;
> > int_registers_saved = true;
> > +
> > + if (ix86_using_red_zone ()
> > + && (! TARGET_STACK_PROBE
> > + || frame.stack_pointer_offset < CHECK_STACK_LIMIT))
> > + cfun->machine->red_zone_used = true;
> > }
> > }
> >
> > @@ -9344,6 +9374,7 @@ ix86_expand_prologue (void)
> > && flag_stack_clash_protection
> > && !ix86_target_stack_probe ())
> > {
> > + gcc_assert (!crtl->shrink_wrapped_separate);
>
> Please put some vertical space after gcc_assert, so it will stand out more...
>
Done.
> > ix86_adjust_stack_and_probe (allocate, int_registers_saved, false);
> > allocate = 0;
> > }
> > @@ -9354,6 +9385,8 @@ ix86_expand_prologue (void)
> > {
> > const HOST_WIDE_INT probe_interval = get_probe_interval ();
> >
> > + gcc_assert (!crtl->shrink_wrapped_separate);
> > +
> > if (STACK_CHECK_MOVING_SP)
> > {
> > if (crtl->is_leaf
> > @@ -9410,12 +9443,14 @@ ix86_expand_prologue (void)
> > else if (!ix86_target_stack_probe ()
> > || frame.stack_pointer_offset < CHECK_STACK_LIMIT)
> > {
> > + gcc_assert (!crtl->shrink_wrapped_separate);
>
> ... also here and in other places.
Done.
> > +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS. */
> sbitmap
> > +ix86_get_separate_components (void) {
> > + HOST_WIDE_INT offset, to_allocate;
> > + sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER);
> > + bitmap_clear (components);
> > + struct machine_function *m = cfun->machine;
> > +
> > + offset = m->frame.stack_pointer_offset; to_allocate = offset -
> > + m->frame.sse_reg_save_offset;
> > +
> > + /* Shrink wrap separate uses MOV, which means APX PPX cannot be used.
> > + Experiments show that APX PPX can speed up the prologue. If the
> > function
> > + does not exit early during actual execution, then using APX PPX is
> > faster.
> > + If the function always exits early during actual execution, then
> > shrink
> > + wrap separate reduces the number of MOV (PUSH/POP) instructions
> > actually
> > + executed, thus speeding up execution.
> > + foo:
> > + movl $1, %eax
> > + testq %rdi, %rdi
> > + jne.L60
> > + ret ---> early return.
> > + .L60:
> > + subq $88, %rsp ---> belong to prologue.
> > + xorl %eax, %eax
> > + movq %rbx, 40 (%rsp) ---> belong to prologue.
> > + movq 8 (%rdi), %rbx
> > + movq %rbp, 48 (%rsp) ---> belong to prologue.
> > + movq %rdi, %rbp
> > + testq %rbx, %rbx
> > + jne.L61
> > + movq 40 (%rsp), %rbx
> > + movq 48 (%rsp), %rbp
> > + addq $88, %rsp
> > + ret
> > + .L61:
> > + movq %r12, 56 (%rsp) ---> belong to prologue.
> > + movq %r13, 64 (%rsp) ---> belong to prologue.
> > + movq %r14, 72 (%rsp) ---> belong to prologue.
>
> Please remove leading spaces before tab in the above assembly. "git diff"
> will mark
> this part in red.
>
Done.
> > + ... ...
> > +
> > + It is a trade-off. Disable shrink wrap separate when PPX is
> > + enabled. */
>
> No need to reiterate that this is a trade-off. You already explained it above
> in great
> detail.
>
Done.
> Please put a one line comment as is the case with the above functions.
>
Done.
> > +void
> > +ix86_set_handled_components (sbitmap components) {
> > + for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> > + if (bitmap_bit_p (components, regno))
> > + {
> > + cfun->machine->reg_is_wrapped_separately[regno] = true;
> > + cfun->machine->use_fast_prologue_epilogue = true;
> > + cfun->machine->frame.save_regs_using_mov = true;
> > + }
> > +}
> > +
> > +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
> > +ix86_get_separate_components #undef
> > +TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
> > +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
> ix86_components_for_bb
> > +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
> > +ix86_disqualify_components #undef
> > +TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
> > + ix86_emit_prologue_components
> > +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
> > + ix86_emit_epilogue_components
> > +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
> > +ix86_set_handled_components
> > +
> > struct gcc_target targetm = TARGET_INITIALIZER;
> >
> > #include "gt-i386.h"
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index
> > d32d9ad997e..5d4fd333169 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -2821,6 +2821,7 @@ struct GTY(()) machine_function {
> > /* Cached initial frame layout for the current function. */
> > struct ix86_frame frame;
>
> A short comment would be nice here...
Done.
> > + bool reg_is_wrapped_separately[FIRST_PSEUDO_REGISTER];
>
> ... and some vertical space here.
>
Done.
> > /* For -fsplit-stack support: A stack local which holds a pointer to
> > the stack arguments for a function with a variable number of
> > arguments. This is set at the start of the function and is used
> > diff --git a/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
> > b/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
> > new file mode 100644
> > index 00000000000..31bd984df02
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
> > @@ -0,0 +1,24 @@
> > +/* { dg-do compile { target { ! ia32 } } } */
>
> Does new functionality also handle 32-bit targets? If this is the case, then
> please
> use two scan-rtl-dumps, one for ia32 and the other for ! ia32. Don't disable
> the
> test just because of a different string in the dump.
>
You are right, I will fix this.
Thanks,
Lili.