Hi Richard, Ping. I have provided root cause analysis of two test failures on arm with my patch for PR111673. I have also provided a solution (a fix in LRA) to fix these failures. Please let me know if the LRA fix is fine. If so, I can ignore these two arm test failures for now, and checkin the LRA patch after checking in the patch for PR111673.
Regards, Surya On 15/12/23 10:34 pm, Surya Kumari Jangala wrote: > Hi Richard, > Here are more details about the testcase failure and my analysis/fix: > > Testcase: > > void f(int *i) > { > if (!i) > return; > else > { > __builtin_printf("Hi"); > *i=0; > } > } > > ---------- > > Assembly w/o patch: > cbz x0, .L7 > stp x29, x30, [sp, -32]! > mov x29, sp > str x19, [sp, 16] > mov x19, x0 > adrp x0, .LC0 > add x0, x0, :lo12:.LC0 > bl printf > str wzr, [x19] > ldr x19, [sp, 16] > ldp x29, x30, [sp], 32 > ret > .p2align 2,,3 > .L7: > ret > > ----------- > > Assembly w/ patch: > stp x29, x30, [sp, -32]! > mov x29, sp > str x0, [sp, 24] > cbz x0, .L1 > adrp x0, .LC0 > add x0, x0, :lo12:.LC0 > bl printf > ldr x1, [sp, 24] > str wzr, [x1] > .L1: > ldp x29, x30, [sp], 32 > ret > > > As we can see above, w/o patch the test case gets shrink wrapped. > > Input RTL to the LRA pass (the RTL is same both w/ and w/o patch): > > BB2: > set r95, x0 > set r92, r95 > if (r92 eq 0) jump BB4 > BB3: > set x0, symbol-ref("Hi") > x0 = call printf > set mem(r92), 0 > BB4: > ret > > > Register assignment by IRA: > w/o patch: > r92-->x19 > r95-->x0 > r94-->x0 > > w/ patch: > r92-->x1 > r95-->x0 > r94-->x0 > > > RTL after LRA: > > w/o patch: > BB2: > set x19, x0 > if (x19 eq 0) jump BB4 > BB3: > set x0, symbol-ref("Hi") > x0 = call printf > set mem(x19), 0 > BB4: > ret > > > w/ patch: > BB2: > set x1, x0 > set mem(sp+24), x1 > if (x1 eq 0) jump BB4 > BB3: > set x0, symbol-ref("Hi") > x0 = call printf > set x1, mem(sp+24) > set mem(x1), 0 > BB4: > ret > > > The difference between w/o patch and w/ patch is that w/o patch, a callee-save > register (x19) is chosen to hold the value of x0 (input parameter register). > While > w/ patch, a caller-save register (x1) is chosen. > > W/o patch, during the shrink wrap pass, first copy propagation is done and > the 'if' insn in BB2 is changed as follows: > set x19, x0 > if (x19 eq 0) jump BB4 > > changed to: > set x19, x0 > if (x0 eq 0) jump BB4 > > Next, the insn "set x19, x0" is moved down the cfg to BB3. Since x19 is a > callee-save register, prolog gets generated in BB3 thereby resulting in > successful shrink wrapping. > > W/ patch, during the shrink wrap pass, copy propagation changes BB2 as > follows: > set x1, x0 > set mem(sp+24), x1 > if (x1 eq 0) jump BB4 > > changed to: > set x1, x0 > set mem(sp+24), x0 > if (x0 eq 0) jump BB4 > > However the store insn (set mem[sp+24], x0) cannot be moved down to BB3. > hence prolog gets generated in BB2 itself due to the use of 'sp'. Thereby > shrink wrap fails. > > The store insn (which basically saves x1 to stack) is generated by the > LRA pass. This insn is needed because x1 is a caller-save register and we > have a call insn that will clobber this register. However, the store insn is > generated > in the entry BB (BB2) instead of in BB3 which has the call insn. If the store > is generated in BB3, then the testcase will be shrink wrapped successfully. > In fact, it is more efficient if the store occurs only in the path containing > the printf call instead of occurring in the entry bb. > > The reason why LRA generates the store insn in the entry bb is as follows: > LRA emits insns to save caller-save registers in the inheritance/splitting > pass. > In this pass, LRA builds EBBs (Extended Basic Block) and traverses the insns > in > the EBBs in reverse order from the last insn to the first insn. When LRA sees > a > write to a pseudo (that has been assigned a caller-save register), and there > is a > read following the write, with an intervening call insn between the write and > read, > then LRA generates a spill immediately after the write and a restore > immediately > before the read. The spill is needed because the call insn will clobber the > caller-save register. > > In the above testcase, LRA forms two EBBs: the first EBB contains BB2 & BB3 > while > the second EBB contains BB4. > > In BB2, there is a write to x1 in the insn : > set r92, r95 //r92 is assigned x1 and r95 is assigned x0 > > In BB3, there is a read of x1 after the call > insn. > set mem(r92), 0 // r92 is assigned x1 > > So LRA generates a spill in BB2 after the write to x1. > > I have a patch (bootstrapped and regtested on powerpc) that makes changes in > LRA to save caller-save registers before a call instead of after the write to > the > caller-save register. With this patch, both the above test gets successfully > shrink wrapped. After committing the patch for PR111673, I plan to get the > LRA fix reviewed. > > Please let me know if you need more information. > > Regards, > Surya > > > On 14/12/23 9:41 pm, Richard Earnshaw (lists) wrote: >> On 14/12/2023 07:17, Surya Kumari Jangala via Gcc wrote: >>> Hi Richard, >>> Thanks a lot for your response! >>> >>> Another failure reported by the Linaro CI is as follows: >>> >>> Running gcc:gcc.dg/dg.exp ... >>> FAIL: gcc.dg/ira-shrinkwrap-prep-1.c scan-rtl-dump pro_and_epilogue >>> "Performing shrink-wrapping" >>> FAIL: gcc.dg/pr10474.c scan-rtl-dump pro_and_epilogue "Performing >>> shrink-wrapping" >>> >>> I analyzed the failures and the root cause is the same for both the >>> failures. >>> >>> The test pr10474.c is as follows: >>> >>> void f(int *i) >>> { >>> if (!i) >>> return; >>> else >>> { >>> __builtin_printf("Hi"); >>> *i=0; >>> } >>> } >>> >>> >>> With the patch (for PR111673), x1 (volatile) is being assigned to hold >>> value of >>> x0 (first parameter). Since it is a volatile, x1 is saved to the stack as >>> there >>> is a call later on. The save to the stack is generated in the LRA pass. The >>> save >>> is generated in the entry basic block. Due to the usage of the stack >>> pointer in >>> the entry bb, the testcase fails to be shrink wrapped. >> >> I'm not entirely sure I understand what you mean from a quick glance. Do >> you mean that X1 has the /v flag marked on it (ie it's printed in dumps as >> "reg/v")? If so, that's not volatile, it just means that the register is >> associated with a user variable (as opposed to a compiler-generated >> temporary variable): >> >> From the manual: >> >> @item REG_USERVAR_P (@var{x}) >> In a @code{reg}, nonzero if it corresponds to a variable present in >> the user's source code. Zero for temporaries generated internally by >> the compiler. Stored in the @code{volatil} field and printed as >> @samp{/v}. >> >> There are several other cases where we re-use this bit on different RTL >> constructs to mean things other than 'volatile': it pretty much only has the >> conventional meaning on MEM objects. >> >>> >>> The reason why LRA generates the store insn in the entry bb is as follows: >>> LRA emits insns to save volatile registers in the inheritance/splitting >>> pass. >>> In this pass, LRA builds EBBs (Extended Basic Block) and traverses the >>> insns in >>> the EBBs in reverse order from the last insn to the first insn. When LRA >>> sees a >>> write to a pseudo (that has been assigned a volatile register), and there >>> is a >>> read following the write, with an intervening call insn between the write >>> and read, >>> then LRA generates a spill immediately after the write and a restore >>> immediately >>> before the read. In the above test, there is an EBB containing the entry bb >>> and >>> the bb with the printf call. In the entry bb, there is a write to x1 >>> (basically >>> a copy from x0 to x1) and in the printf bb, there is a read of x1 after the >>> call >>> insn. So LRA generates a spill in the entry bb. >>> >>> Without patch, x19 is chosen to hold the value of x0. Since x19 is a >>> non-volatile, >>> the input RTL to the shrink wrap pass does not have any code to save x19 to >>> the >>> stack. Only the insn that copies x0 to x19 is present in the entry bb. In >>> the >>> shrink wrap pass, this insn is moved down the cfg to the bb containing the >>> call >>> to printf, thereby allowing prolog to be allocated only where needed. Thus >>> shrink >>> wrap succeeds. >>> >>> >>> Shrink wrap can be made to succeed if the save of x1 occurs just before the >>> call >>> insn, instead of generating it after the write to x1. This will ensure that >>> the >>> spill does not occur in the entry bb. In fact, it is more efficient if the >>> save >>> occurs only in the path containing the printf call instead of occurring in >>> the >>> entry bb. >>> >>> I have a patch (bootstrapped and regtested on powerpc) that makes changes in >>> LRA to save volatile registers before a call instead of after the write to >>> the >>> volatile. With this patch, both the above tests pass. >>> >>> Since the patch for PR111673 has been approved by Vladimir, I plan to >>> commit the patch to trunk. And I will fix the test failures after doing the >>> commit. >>> >> >> I think I'd probably understand this better if you could give some example >> RTL (before and after). Do you have that? >> >> R. >> >>> Regards, >>> Surya >>> >>> >>> >>> On 28/11/23 7:18 pm, Richard Earnshaw wrote: >>>> >>>> >>>> On 28/11/2023 12:52, Surya Kumari Jangala wrote: >>>>> Hi Richard, >>>>> Thanks a lot for your response! >>>>> >>>>> Another failure reported by the Linaro CI is as follows : >>>>> (Note: I am planning to send a separate mail for each failure, as this >>>>> will make >>>>> the discussion easy to track) >>>>> >>>>> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve >>>>> -moverride=tune=none check-function-bodies dup_x0_m >>>>> >>>>> Expected code: >>>>> >>>>> ... >>>>> add (x[0-9]+), x0, #?1 >>>>> mov (p[0-7])\.b, p15\.b >>>>> mov z0\.d, \2/m, \1 >>>>> ... >>>>> ret >>>>> >>>>> >>>>> Code obtained w/o patch: >>>>> addvl sp, sp, #-1 >>>>> str p15, [sp] >>>>> add x0, x0, 1 >>>>> mov p3.b, p15.b >>>>> mov z0.d, p3/m, x0 >>>>> ldr p15, [sp] >>>>> addvl sp, sp, #1 >>>>> ret >>>>> >>>>> Code obtained w/ patch: >>>>> addvl sp, sp, #-1 >>>>> str p15, [sp] >>>>> mov p3.b, p15.b >>>>> add x0, x0, 1 >>>>> mov z0.d, p3/m, x0 >>>>> ldr p15, [sp] >>>>> addvl sp, sp, #1 >>>>> ret >>>>> >>>>> As we can see, with the patch, the following two instructions are >>>>> interchanged: >>>>> add x0, x0, 1 >>>>> mov p3.b, p15.b >>>> >>>> Indeed, both look acceptable results to me, especially given that we don't >>>> schedule results at -O1. >>>> >>>> There's two ways of fixing this: >>>> 1) Simply swap the order to what the compiler currently generates (which >>>> is a little fragile, since it might flip back someday). >>>> 2) Write the test as >>>> >>>> >>>> ** ( >>>> ** add (x[0-9]+), x0, #?1 >>>> ** mov (p[0-7])\.b, p15\.b >>>> ** mov z0\.d, \2/m, \1 >>>> ** | >>>> ** mov (p[0-7])\.b, p15\.b >>>> ** add (x[0-9]+), x0, #?1 >>>> ** mov z0\.d, \1/m, \2 >>>> ** ) >>>> >>>> Note, we need to swap the match names in the third insn to account for the >>>> different order of the earlier instructions. >>>> >>>> Neither is ideal, but the second is perhaps a little more bomb proof. >>>> >>>> I don't really have a strong feeling either way, but perhaps the second is >>>> slightly preferable. >>>> >>>> Richard S: thoughts? >>>> >>>> R. >>>> >>>>> I believe that this is fine and the test can be modified to allow it to >>>>> pass on >>>>> aarch64. Please let me know what you think. >>>>> >>>>> Regards, >>>>> Surya >>>>> >>>>> >>>>> On 24/11/23 4:18 pm, Richard Earnshaw wrote: >>>>>> >>>>>> >>>>>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote: >>>>>>> Hi Richard, >>>>>>> Ping. Please let me know if the test failure that I mentioned in the >>>>>>> mail below can be handled by changing the expected generated code. I am >>>>>>> not conversant with arm, and hence would appreciate your help. >>>>>>> >>>>>>> Regards, >>>>>>> Surya >>>>>>> >>>>>>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote: >>>>>>>> Hi Richard, >>>>>>>> I had submitted a patch for review >>>>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html) >>>>>>>> regarding scaling save/restore costs of callee save registers with >>>>>>>> block >>>>>>>> frequency in the IRA pass (PR111673). >>>>>>>> >>>>>>>> This patch has been approved by VMakarov >>>>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html). >>>>>>>> >>>>>>>> With this patch, we are seeing performance improvements with spec on >>>>>>>> x86 >>>>>>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%). >>>>>>>> >>>>>>>> I received a mail from Linaro about some failures seen in the CI >>>>>>>> pipeline with >>>>>>>> this patch. I have analyzed the failures and I wish to discuss the >>>>>>>> analysis with you. >>>>>>>> >>>>>>>> One failure reported by the Linaro CI is: >>>>>>>> >>>>>>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, >>>>>>>> r[0-9]+, \\[r[0-9]+\\] 2 >>>>>>>> >>>>>>>> The diff in the assembly between trunk and patch is: >>>>>>>> >>>>>>>> 93c93 >>>>>>>> < push {r4, r5} >>>>>>>> --- >>>>>>>>> push {fp} >>>>>>>> 95c95 >>>>>>>> < ldrexd r4, r5, [r0] >>>>>>>> --- >>>>>>>>> ldrexd fp, ip, [r0] >>>>>>>> 99c99 >>>>>>>> < pop {r4, r5} >>>>>>>> --- >>>>>>>>> ldr fp, [sp], #4 >>>>>>>> >>>>>>>> >>>>>>>> The test fails with patch because the ldrexd insn uses fp & ip >>>>>>>> registers instead >>>>>>>> of r[0-9]+ >>>>>>>> >>>>>>>> But the code produced by patch is better because it is pushing and >>>>>>>> restoring only >>>>>>>> one register (fp) instead of two registers (r4, r5). Hence, this test >>>>>>>> can be >>>>>>>> modified to allow it to pass on arm. Please let me know what you think. >>>>>>>> >>>>>>>> If you need more information, please let me know. I will be sending >>>>>>>> separate mails >>>>>>>> for the other test failures. >>>>>>>> >>>>>> >>>>>> Thanks for looking at this. >>>>>> >>>>>> >>>>>> The key part of this test is that the compiler generates LDREXD. The >>>>>> registers used for that are pretty much irrelevant as we don't match >>>>>> them to any other operations within the test. So I'd recommend just >>>>>> testing for the mnemonic and not for any of the operands (ie just match >>>>>> "ldrexd\t"). >>>>>> >>>>>> R. >>>>>> >>>>>>>> Regards, >>>>>>>> Surya >>>>>>>> >>>>>>>> >>>>>>>> >>