Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673

Surya Kumari Jangala via Gcc Sun, 28 Jan 2024 22:15:38 -0800

Hi Richard,
Ping. 
I have provided root cause analysis of two test failures on arm with my patch 
for PR111673. I have also provided a solution (a fix in LRA) to fix these 
failures. 
Please let me know if the LRA fix is fine. If so, I can ignore these two arm 
test failures for now, and checkin the LRA patch after checking in the patch 
for PR111673.


Regards,
Surya

On 15/12/23 10:34 pm, Surya Kumari Jangala wrote:
> Hi Richard,
> Here are more details about the testcase failure and my analysis/fix:
> 
> Testcase:
> 
> void f(int *i)
> {
>         if (!i)
>                 return;
>         else
>         {
>                 __builtin_printf("Hi");
>                 *i=0;
>         }
> }
> 
> ----------
> 
> Assembly w/o patch:
>         cbz     x0, .L7
>         stp     x29, x30, [sp, -32]!
>         mov     x29, sp
>         str     x19, [sp, 16]
>         mov     x19, x0
>         adrp    x0, .LC0
>         add     x0, x0, :lo12:.LC0
>         bl      printf
>         str     wzr, [x19]
>         ldr     x19, [sp, 16]
>         ldp     x29, x30, [sp], 32
>         ret
>         .p2align 2,,3
> .L7:
>         ret
> 
> -----------
> 
> Assembly w/ patch:
>         stp     x29, x30, [sp, -32]!
>         mov     x29, sp
>         str     x0, [sp, 24]
>         cbz     x0, .L1
>         adrp    x0, .LC0
>         add     x0, x0, :lo12:.LC0
>         bl      printf
>         ldr     x1, [sp, 24]
>         str     wzr, [x1]
> .L1:
>         ldp     x29, x30, [sp], 32
>         ret
> 
> 
> As we can see above, w/o patch the test case gets shrink wrapped.
> 
> Input RTL to the LRA pass (the RTL is same both w/ and w/o patch):
> 
> BB2:
>   set r95, x0
>   set r92, r95
>   if (r92 eq 0) jump BB4
> BB3:
>   set x0, symbol-ref("Hi")
>   x0 = call printf
>   set mem(r92), 0
> BB4:
>   ret
> 
> 
> Register assignment by IRA:
> w/o patch:
>   r92-->x19
>   r95-->x0
>   r94-->x0
> 
> w/ patch:
>   r92-->x1
>   r95-->x0
>   r94-->x0
> 
> 
> RTL after LRA:
> 
> w/o patch:
> BB2:
>   set x19, x0
>   if (x19 eq 0) jump BB4
> BB3:
>   set x0, symbol-ref("Hi")
>   x0 = call printf
>   set mem(x19), 0
> BB4:
>   ret
> 
> 
> w/ patch:
> BB2:
>   set x1, x0
>   set mem(sp+24), x1
>   if (x1 eq 0) jump BB4
> BB3:
>   set x0, symbol-ref("Hi")
>   x0 = call printf
>   set x1, mem(sp+24)
>   set mem(x1), 0
> BB4:
>   ret
> 
> 
> The difference between w/o patch and w/ patch is that w/o patch, a callee-save
> register (x19) is chosen to hold the value of x0 (input parameter register). 
> While
> w/ patch, a caller-save register (x1) is chosen.
> 
> W/o patch, during the shrink wrap pass, first copy propagation is done and
> the 'if' insn in BB2 is changed as follows:
>   set x19, x0
>   if (x19 eq 0) jump BB4
> 
> changed to:
>   set x19, x0
>   if (x0 eq 0) jump BB4   
> 
> Next, the insn "set x19, x0" is moved down the cfg to BB3. Since x19 is a
> callee-save register, prolog gets generated in BB3 thereby resulting in
> successful shrink wrapping.
> 
> W/ patch, during the shrink wrap pass, copy propagation changes BB2 as 
> follows:
>   set x1, x0
>   set mem(sp+24), x1
>   if (x1 eq 0) jump BB4
> 
> changed to:
>   set x1, x0
>   set mem(sp+24), x0
>   if (x0 eq 0) jump BB4
> 
> However the store insn (set mem[sp+24], x0) cannot be moved down to BB3.
> hence prolog gets generated in BB2 itself due to the use of 'sp'. Thereby
> shrink wrap fails.
> 
> The store insn (which basically saves x1 to stack) is generated by the
> LRA pass. This insn is needed because x1 is a caller-save register and we
> have a call insn that will clobber this register. However, the store insn is 
> generated
> in the entry BB (BB2) instead of in BB3 which has the call insn. If the store
> is generated in BB3, then the testcase will be shrink wrapped successfully.
> In fact, it is more efficient if the store occurs only in the path containing
> the printf call instead of occurring in the entry bb.
> 
> The reason why LRA generates the store insn in the entry bb is as follows:
> LRA emits insns to save caller-save registers in the inheritance/splitting 
> pass.
> In this pass, LRA builds EBBs (Extended Basic Block) and traverses the insns 
> in
> the EBBs in reverse order from the last insn to the first insn. When LRA sees 
> a
> write to a pseudo (that has been assigned a caller-save register), and there 
> is a
> read following the write, with an intervening call insn between the write and 
> read,
> then LRA generates a spill immediately after the write and a restore 
> immediately
> before the read. The spill is needed because the call insn will clobber the
> caller-save register.
> 
> In the above testcase, LRA forms two EBBs: the first EBB contains BB2 & BB3 
> while
> the second EBB contains BB4. 
> 
> In BB2, there is a write to x1 in the insn : 
> set r92, r95 //r92 is assigned x1 and r95 is assigned x0
> 
> In BB3, there is a read of x1 after the call
> insn.
> set mem(r92), 0   // r92 is assigned x1
> 
> So LRA generates a spill in BB2 after the write to x1.
> 
> I have a patch (bootstrapped and regtested on powerpc) that makes changes in
> LRA to save caller-save registers before a call instead of after the write to 
> the
> caller-save register. With this patch, both the above test gets successfully
> shrink wrapped. After committing the patch for PR111673, I plan to get the 
> LRA fix reviewed.
> 
> Please let me know if you need more information.
> 
> Regards,
> Surya
> 
> 
> On 14/12/23 9:41 pm, Richard Earnshaw (lists) wrote:
>> On 14/12/2023 07:17, Surya Kumari Jangala via Gcc wrote:
>>> Hi Richard,
>>> Thanks a lot for your response!
>>>
>>> Another failure reported by the Linaro CI is as follows:
>>>
>>> Running gcc:gcc.dg/dg.exp ...
>>> FAIL: gcc.dg/ira-shrinkwrap-prep-1.c scan-rtl-dump pro_and_epilogue 
>>> "Performing shrink-wrapping"
>>> FAIL: gcc.dg/pr10474.c scan-rtl-dump pro_and_epilogue "Performing 
>>> shrink-wrapping"
>>>
>>> I analyzed the failures and the root cause is the same for both the 
>>> failures.
>>>
>>> The test pr10474.c is as follows:
>>>
>>> void f(int *i)
>>> {
>>>         if (!i)
>>>                 return;
>>>         else
>>>         {
>>>                 __builtin_printf("Hi");
>>>                 *i=0;
>>>         }
>>> }
>>>
>>>
>>> With the patch (for PR111673), x1 (volatile) is being assigned to hold 
>>> value of
>>> x0 (first parameter). Since it is a volatile, x1 is saved to the stack as 
>>> there
>>> is a call later on. The save to the stack is generated in the LRA pass. The 
>>> save
>>> is generated in the entry basic block. Due to the usage of the stack 
>>> pointer in
>>> the entry bb, the testcase fails to be shrink wrapped.
>>
>> I'm not entirely sure I understand what you mean from a quick glance.  Do 
>> you mean that X1 has the /v flag marked on it (ie it's printed in dumps as 
>> "reg/v")?  If so, that's not volatile, it just means that the register is 
>> associated with a user variable (as opposed to a compiler-generated 
>> temporary variable):
>>
>> From the manual:
>>
>> @item REG_USERVAR_P (@var{x})
>> In a @code{reg}, nonzero if it corresponds to a variable present in
>> the user's source code.  Zero for temporaries generated internally by
>> the compiler.  Stored in the @code{volatil} field and printed as
>> @samp{/v}.
>>
>> There are several other cases where we re-use this bit on different RTL 
>> constructs to mean things other than 'volatile': it pretty much only has the 
>> conventional meaning on MEM objects.
>>
>>>
>>> The reason why LRA generates the store insn in the entry bb is as follows:
>>> LRA emits insns to save volatile registers in the inheritance/splitting 
>>> pass.
>>> In this pass, LRA builds EBBs (Extended Basic Block) and traverses the 
>>> insns in
>>> the EBBs in reverse order from the last insn to the first insn. When LRA 
>>> sees a
>>> write to a pseudo (that has been assigned a volatile register), and there 
>>> is a
>>> read following the write, with an intervening call insn between the write 
>>> and read,
>>> then LRA generates a spill immediately after the write and a restore 
>>> immediately
>>> before the read. In the above test, there is an EBB containing the entry bb 
>>> and
>>> the bb with the printf call. In the entry bb, there is a write to x1 
>>> (basically
>>> a copy from x0 to x1) and in the printf bb, there is a read of x1 after the 
>>> call
>>> insn. So LRA generates a spill in the entry bb.
>>>
>>> Without patch, x19 is chosen to hold the value of x0. Since x19 is a 
>>> non-volatile,
>>> the input RTL to the shrink wrap pass does not have any code to save x19 to 
>>> the
>>> stack. Only the insn that copies x0 to x19 is present in the entry bb. In 
>>> the
>>> shrink wrap pass, this insn is moved down the cfg to the bb containing the 
>>> call
>>> to printf, thereby allowing prolog to be allocated only where needed. Thus 
>>> shrink
>>> wrap succeeds.
>>>
>>>
>>> Shrink wrap can be made to succeed if the save of x1 occurs just before the 
>>> call
>>> insn, instead of generating it after the write to x1. This will ensure that 
>>> the
>>> spill does not occur in the entry bb. In fact, it is more efficient if the 
>>> save
>>> occurs only in the path containing the printf call instead of occurring in 
>>> the
>>> entry bb.
>>>
>>> I have a patch (bootstrapped and regtested on powerpc) that makes changes in
>>> LRA to save volatile registers before a call instead of after the write to 
>>> the
>>> volatile. With this patch, both the above tests pass.
>>>
>>> Since the patch for PR111673 has been approved by Vladimir, I plan to
>>> commit the patch to trunk. And I will fix the test failures after doing the
>>> commit.
>>>
>>
>> I think I'd probably understand this better if you could give some example 
>> RTL (before and after).  Do you have that?
>>
>> R.
>>
>>> Regards,
>>> Surya
>>>
>>>
>>>
>>> On 28/11/23 7:18 pm, Richard Earnshaw wrote:
>>>>
>>>>
>>>> On 28/11/2023 12:52, Surya Kumari Jangala wrote:
>>>>> Hi Richard,
>>>>> Thanks a lot for your response!
>>>>>
>>>>> Another failure reported by the Linaro CI is as follows :
>>>>> (Note: I am planning to send a separate mail for each failure, as this 
>>>>> will make
>>>>> the discussion easy to track)
>>>>>
>>>>> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve 
>>>>> -moverride=tune=none  check-function-bodies dup_x0_m
>>>>>
>>>>> Expected code:
>>>>>
>>>>>        ...
>>>>>        add     (x[0-9]+), x0, #?1
>>>>>        mov     (p[0-7])\.b, p15\.b
>>>>>        mov     z0\.d, \2/m, \1
>>>>>        ...
>>>>>        ret
>>>>>
>>>>>
>>>>> Code obtained w/o patch:
>>>>>          addvl   sp, sp, #-1
>>>>>          str     p15, [sp]
>>>>>          add     x0, x0, 1
>>>>>          mov     p3.b, p15.b
>>>>>          mov     z0.d, p3/m, x0
>>>>>          ldr     p15, [sp]
>>>>>          addvl   sp, sp, #1
>>>>>          ret
>>>>>
>>>>> Code obtained w/ patch:
>>>>>     addvl   sp, sp, #-1
>>>>>          str     p15, [sp]
>>>>>          mov     p3.b, p15.b
>>>>>          add     x0, x0, 1
>>>>>          mov     z0.d, p3/m, x0
>>>>>          ldr     p15, [sp]
>>>>>          addvl   sp, sp, #1
>>>>>          ret
>>>>>
>>>>> As we can see, with the patch, the following two instructions are 
>>>>> interchanged:
>>>>>          add     x0, x0, 1
>>>>>          mov     p3.b, p15.b
>>>>
>>>> Indeed, both look acceptable results to me, especially given that we don't 
>>>> schedule results at -O1.
>>>>
>>>> There's two ways of fixing this:
>>>> 1) Simply swap the order to what the compiler currently generates (which 
>>>> is a little fragile, since it might flip back someday).
>>>> 2) Write the test as
>>>>
>>>>
>>>> ** (
>>>> **       add     (x[0-9]+), x0, #?1
>>>> **       mov     (p[0-7])\.b, p15\.b
>>>> **       mov     z0\.d, \2/m, \1
>>>> ** |
>>>> **       mov     (p[0-7])\.b, p15\.b
>>>> **       add     (x[0-9]+), x0, #?1
>>>> **       mov     z0\.d, \1/m, \2
>>>> ** )
>>>>
>>>> Note, we need to swap the match names in the third insn to account for the 
>>>> different order of the earlier instructions.
>>>>
>>>> Neither is ideal, but the second is perhaps a little more bomb proof.
>>>>
>>>> I don't really have a strong feeling either way, but perhaps the second is 
>>>> slightly preferable.
>>>>
>>>> Richard S: thoughts?
>>>>
>>>> R.
>>>>
>>>>> I believe that this is fine and the test can be modified to allow it to 
>>>>> pass on
>>>>> aarch64. Please let me know what you think.
>>>>>
>>>>> Regards,
>>>>> Surya
>>>>>
>>>>>
>>>>> On 24/11/23 4:18 pm, Richard Earnshaw wrote:
>>>>>>
>>>>>>
>>>>>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
>>>>>>> Hi Richard,
>>>>>>> Ping. Please let me know if the test failure that I mentioned in the 
>>>>>>> mail below can be handled by changing the expected generated code. I am 
>>>>>>> not conversant with arm, and hence would appreciate your help.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Surya
>>>>>>>
>>>>>>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>>>>>>>> Hi Richard,
>>>>>>>> I had submitted a patch for review 
>>>>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>>>>>>>> regarding scaling save/restore costs of callee save registers with 
>>>>>>>> block
>>>>>>>> frequency in the IRA pass (PR111673).
>>>>>>>>
>>>>>>>> This patch has been approved by VMakarov
>>>>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>>>>>>>
>>>>>>>> With this patch, we are seeing performance improvements with spec on 
>>>>>>>> x86
>>>>>>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>>>>>>>
>>>>>>>> I received a mail from Linaro about some failures seen in the CI 
>>>>>>>> pipeline with
>>>>>>>> this patch. I have analyzed the failures and I wish to discuss the 
>>>>>>>> analysis with you.
>>>>>>>>
>>>>>>>> One failure reported by the Linaro CI is:
>>>>>>>>
>>>>>>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, 
>>>>>>>> r[0-9]+, \\[r[0-9]+\\] 2
>>>>>>>>
>>>>>>>> The diff in the assembly between trunk and patch is:
>>>>>>>>
>>>>>>>> 93c93
>>>>>>>> <       push    {r4, r5}
>>>>>>>> ---
>>>>>>>>>         push    {fp}
>>>>>>>> 95c95
>>>>>>>> <       ldrexd  r4, r5, [r0]
>>>>>>>> ---
>>>>>>>>>         ldrexd  fp, ip, [r0]
>>>>>>>> 99c99
>>>>>>>> <       pop     {r4, r5}
>>>>>>>> ---
>>>>>>>>>         ldr     fp, [sp], #4
>>>>>>>>
>>>>>>>>
>>>>>>>> The test fails with patch because the ldrexd insn uses fp & ip 
>>>>>>>> registers instead
>>>>>>>> of r[0-9]+
>>>>>>>>
>>>>>>>> But the code produced by patch is better because it is pushing and 
>>>>>>>> restoring only
>>>>>>>> one register (fp) instead of two registers (r4, r5). Hence, this test 
>>>>>>>> can be
>>>>>>>> modified to allow it to pass on arm. Please let me know what you think.
>>>>>>>>
>>>>>>>> If you need more information, please let me know. I will be sending 
>>>>>>>> separate mails
>>>>>>>> for the other test failures.
>>>>>>>>
>>>>>>
>>>>>> Thanks for looking at this.
>>>>>>
>>>>>>
>>>>>> The key part of this test is that the compiler generates LDREXD.  The 
>>>>>> registers used for that are pretty much irrelevant as we don't match 
>>>>>> them to any other operations within the test.  So I'd recommend just 
>>>>>> testing for the mnemonic and not for any of the operands (ie just match 
>>>>>> "ldrexd\t").
>>>>>>
>>>>>> R.
>>>>>>
>>>>>>>> Regards,
>>>>>>>> Surya
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>

Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673

Reply via email to