Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673

Richard Earnshaw (lists) via Gcc Thu, 14 Dec 2023 08:12:05 -0800

On 14/12/2023 07:17, Surya Kumari Jangala via Gcc wrote:
> Hi Richard,
> Thanks a lot for your response!
> 
> Another failure reported by the Linaro CI is as follows:
> 
> Running gcc:gcc.dg/dg.exp ...
> FAIL: gcc.dg/ira-shrinkwrap-prep-1.c scan-rtl-dump pro_and_epilogue 
> "Performing shrink-wrapping"
> FAIL: gcc.dg/pr10474.c scan-rtl-dump pro_and_epilogue "Performing 
> shrink-wrapping"
> 
> I analyzed the failures and the root cause is the same for both the failures.
> 
> The test pr10474.c is as follows:
> 
> void f(int *i)
> {
>         if (!i)
>                 return;
>         else
>         {
>                 __builtin_printf("Hi");
>                 *i=0;
>         }
> }
> 
> 
> With the patch (for PR111673), x1 (volatile) is being assigned to hold value 
> of
> x0 (first parameter). Since it is a volatile, x1 is saved to the stack as 
> there
> is a call later on. The save to the stack is generated in the LRA pass. The 
> save
> is generated in the entry basic block. Due to the usage of the stack pointer 
> in
> the entry bb, the testcase fails to be shrink wrapped.


I'm not entirely sure I understand what you mean from a quick glance.  Do you 
mean that X1 has the /v flag marked on it (ie it's printed in dumps as 
"reg/v")?  If so, that's not volatile, it just means that the register is 
associated with a user variable (as opposed to a compiler-generated temporary 
variable):

>From the manual:

@item REG_USERVAR_P (@var{x})
In a @code{reg}, nonzero if it corresponds to a variable present in
the user's source code.  Zero for temporaries generated internally by
the compiler.  Stored in the @code{volatil} field and printed as
@samp{/v}.

There are several other cases where we re-use this bit on different RTL 
constructs to mean things other than 'volatile': it pretty much only has the 
conventional meaning on MEM objects.

> 
> The reason why LRA generates the store insn in the entry bb is as follows:
> LRA emits insns to save volatile registers in the inheritance/splitting pass.
> In this pass, LRA builds EBBs (Extended Basic Block) and traverses the insns 
> in
> the EBBs in reverse order from the last insn to the first insn. When LRA sees 
> a
> write to a pseudo (that has been assigned a volatile register), and there is a
> read following the write, with an intervening call insn between the write and 
> read,
> then LRA generates a spill immediately after the write and a restore 
> immediately
> before the read. In the above test, there is an EBB containing the entry bb 
> and
> the bb with the printf call. In the entry bb, there is a write to x1 
> (basically
> a copy from x0 to x1) and in the printf bb, there is a read of x1 after the 
> call
> insn. So LRA generates a spill in the entry bb.
> 
> Without patch, x19 is chosen to hold the value of x0. Since x19 is a 
> non-volatile,
> the input RTL to the shrink wrap pass does not have any code to save x19 to 
> the
> stack. Only the insn that copies x0 to x19 is present in the entry bb. In the
> shrink wrap pass, this insn is moved down the cfg to the bb containing the 
> call
> to printf, thereby allowing prolog to be allocated only where needed. Thus 
> shrink
> wrap succeeds.
> 
> 
> Shrink wrap can be made to succeed if the save of x1 occurs just before the 
> call
> insn, instead of generating it after the write to x1. This will ensure that 
> the
> spill does not occur in the entry bb. In fact, it is more efficient if the 
> save
> occurs only in the path containing the printf call instead of occurring in the
> entry bb.
> 
> I have a patch (bootstrapped and regtested on powerpc) that makes changes in
> LRA to save volatile registers before a call instead of after the write to the
> volatile. With this patch, both the above tests pass.
> 
> Since the patch for PR111673 has been approved by Vladimir, I plan to
> commit the patch to trunk. And I will fix the test failures after doing the
> commit.
> 

I think I'd probably understand this better if you could give some example RTL 
(before and after).  Do you have that?

R.

> Regards,
> Surya
> 
> 
> 
> On 28/11/23 7:18 pm, Richard Earnshaw wrote:
>>
>>
>> On 28/11/2023 12:52, Surya Kumari Jangala wrote:
>>> Hi Richard,
>>> Thanks a lot for your response!
>>>
>>> Another failure reported by the Linaro CI is as follows :
>>> (Note: I am planning to send a separate mail for each failure, as this will 
>>> make
>>> the discussion easy to track)
>>>
>>> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve 
>>> -moverride=tune=none  check-function-bodies dup_x0_m
>>>
>>> Expected code:
>>>
>>>        ...
>>>        add     (x[0-9]+), x0, #?1
>>>        mov     (p[0-7])\.b, p15\.b
>>>        mov     z0\.d, \2/m, \1
>>>        ...
>>>        ret
>>>
>>>
>>> Code obtained w/o patch:
>>>          addvl   sp, sp, #-1
>>>          str     p15, [sp]
>>>          add     x0, x0, 1
>>>          mov     p3.b, p15.b
>>>          mov     z0.d, p3/m, x0
>>>          ldr     p15, [sp]
>>>          addvl   sp, sp, #1
>>>          ret
>>>
>>> Code obtained w/ patch:
>>>     addvl   sp, sp, #-1
>>>          str     p15, [sp]
>>>          mov     p3.b, p15.b
>>>          add     x0, x0, 1
>>>          mov     z0.d, p3/m, x0
>>>          ldr     p15, [sp]
>>>          addvl   sp, sp, #1
>>>          ret
>>>
>>> As we can see, with the patch, the following two instructions are 
>>> interchanged:
>>>          add     x0, x0, 1
>>>          mov     p3.b, p15.b
>>
>> Indeed, both look acceptable results to me, especially given that we don't 
>> schedule results at -O1.
>>
>> There's two ways of fixing this:
>> 1) Simply swap the order to what the compiler currently generates (which is 
>> a little fragile, since it might flip back someday).
>> 2) Write the test as
>>
>>
>> ** (
>> **       add     (x[0-9]+), x0, #?1
>> **       mov     (p[0-7])\.b, p15\.b
>> **       mov     z0\.d, \2/m, \1
>> ** |
>> **       mov     (p[0-7])\.b, p15\.b
>> **       add     (x[0-9]+), x0, #?1
>> **       mov     z0\.d, \1/m, \2
>> ** )
>>
>> Note, we need to swap the match names in the third insn to account for the 
>> different order of the earlier instructions.
>>
>> Neither is ideal, but the second is perhaps a little more bomb proof.
>>
>> I don't really have a strong feeling either way, but perhaps the second is 
>> slightly preferable.
>>
>> Richard S: thoughts?
>>
>> R.
>>
>>> I believe that this is fine and the test can be modified to allow it to 
>>> pass on
>>> aarch64. Please let me know what you think.
>>>
>>> Regards,
>>> Surya
>>>
>>>
>>> On 24/11/23 4:18 pm, Richard Earnshaw wrote:
>>>>
>>>>
>>>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
>>>>> Hi Richard,
>>>>> Ping. Please let me know if the test failure that I mentioned in the mail 
>>>>> below can be handled by changing the expected generated code. I am not 
>>>>> conversant with arm, and hence would appreciate your help.
>>>>>
>>>>> Regards,
>>>>> Surya
>>>>>
>>>>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>>>>>> Hi Richard,
>>>>>> I had submitted a patch for review 
>>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>>>>>> regarding scaling save/restore costs of callee save registers with block
>>>>>> frequency in the IRA pass (PR111673).
>>>>>>
>>>>>> This patch has been approved by VMakarov
>>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>>>>>
>>>>>> With this patch, we are seeing performance improvements with spec on x86
>>>>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>>>>>
>>>>>> I received a mail from Linaro about some failures seen in the CI 
>>>>>> pipeline with
>>>>>> this patch. I have analyzed the failures and I wish to discuss the 
>>>>>> analysis with you.
>>>>>>
>>>>>> One failure reported by the Linaro CI is:
>>>>>>
>>>>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, 
>>>>>> r[0-9]+, \\[r[0-9]+\\] 2
>>>>>>
>>>>>> The diff in the assembly between trunk and patch is:
>>>>>>
>>>>>> 93c93
>>>>>> <       push    {r4, r5}
>>>>>> ---
>>>>>>>         push    {fp}
>>>>>> 95c95
>>>>>> <       ldrexd  r4, r5, [r0]
>>>>>> ---
>>>>>>>         ldrexd  fp, ip, [r0]
>>>>>> 99c99
>>>>>> <       pop     {r4, r5}
>>>>>> ---
>>>>>>>         ldr     fp, [sp], #4
>>>>>>
>>>>>>
>>>>>> The test fails with patch because the ldrexd insn uses fp & ip registers 
>>>>>> instead
>>>>>> of r[0-9]+
>>>>>>
>>>>>> But the code produced by patch is better because it is pushing and 
>>>>>> restoring only
>>>>>> one register (fp) instead of two registers (r4, r5). Hence, this test 
>>>>>> can be
>>>>>> modified to allow it to pass on arm. Please let me know what you think.
>>>>>>
>>>>>> If you need more information, please let me know. I will be sending 
>>>>>> separate mails
>>>>>> for the other test failures.
>>>>>>
>>>>
>>>> Thanks for looking at this.
>>>>
>>>>
>>>> The key part of this test is that the compiler generates LDREXD.  The 
>>>> registers used for that are pretty much irrelevant as we don't match them 
>>>> to any other operations within the test.  So I'd recommend just testing 
>>>> for the mnemonic and not for any of the operands (ie just match 
>>>> "ldrexd\t").
>>>>
>>>> R.
>>>>
>>>>>> Regards,
>>>>>> Surya
>>>>>>
>>>>>>
>>>>>>

Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673

Reply via email to