On 14/12/2023 07:17, Surya Kumari Jangala via Gcc wrote:
> Hi Richard,
> Thanks a lot for your response!
>
> Another failure reported by the Linaro CI is as follows:
>
> Running gcc:gcc.dg/dg.exp ...
> FAIL: gcc.dg/ira-shrinkwrap-prep-1.c scan-rtl-dump pro_and_epilogue
> "Performing shrink-wrapping"
> FAIL: gcc.dg/pr10474.c scan-rtl-dump pro_and_epilogue "Performing
> shrink-wrapping"
>
> I analyzed the failures and the root cause is the same for both the failures.
>
> The test pr10474.c is as follows:
>
> void f(int *i)
> {
> if (!i)
> return;
> else
> {
> __builtin_printf("Hi");
> *i=0;
> }
> }
>
>
> With the patch (for PR111673), x1 (volatile) is being assigned to hold value
> of
> x0 (first parameter). Since it is a volatile, x1 is saved to the stack as
> there
> is a call later on. The save to the stack is generated in the LRA pass. The
> save
> is generated in the entry basic block. Due to the usage of the stack pointer
> in
> the entry bb, the testcase fails to be shrink wrapped.
I'm not entirely sure I understand what you mean from a quick glance. Do you
mean that X1 has the /v flag marked on it (ie it's printed in dumps as
"reg/v")? If so, that's not volatile, it just means that the register is
associated with a user variable (as opposed to a compiler-generated temporary
variable):
>From the manual:
@item REG_USERVAR_P (@var{x})
In a @code{reg}, nonzero if it corresponds to a variable present in
the user's source code. Zero for temporaries generated internally by
the compiler. Stored in the @code{volatil} field and printed as
@samp{/v}.
There are several other cases where we re-use this bit on different RTL
constructs to mean things other than 'volatile': it pretty much only has the
conventional meaning on MEM objects.
>
> The reason why LRA generates the store insn in the entry bb is as follows:
> LRA emits insns to save volatile registers in the inheritance/splitting pass.
> In this pass, LRA builds EBBs (Extended Basic Block) and traverses the insns
> in
> the EBBs in reverse order from the last insn to the first insn. When LRA sees
> a
> write to a pseudo (that has been assigned a volatile register), and there is a
> read following the write, with an intervening call insn between the write and
> read,
> then LRA generates a spill immediately after the write and a restore
> immediately
> before the read. In the above test, there is an EBB containing the entry bb
> and
> the bb with the printf call. In the entry bb, there is a write to x1
> (basically
> a copy from x0 to x1) and in the printf bb, there is a read of x1 after the
> call
> insn. So LRA generates a spill in the entry bb.
>
> Without patch, x19 is chosen to hold the value of x0. Since x19 is a
> non-volatile,
> the input RTL to the shrink wrap pass does not have any code to save x19 to
> the
> stack. Only the insn that copies x0 to x19 is present in the entry bb. In the
> shrink wrap pass, this insn is moved down the cfg to the bb containing the
> call
> to printf, thereby allowing prolog to be allocated only where needed. Thus
> shrink
> wrap succeeds.
>
>
> Shrink wrap can be made to succeed if the save of x1 occurs just before the
> call
> insn, instead of generating it after the write to x1. This will ensure that
> the
> spill does not occur in the entry bb. In fact, it is more efficient if the
> save
> occurs only in the path containing the printf call instead of occurring in the
> entry bb.
>
> I have a patch (bootstrapped and regtested on powerpc) that makes changes in
> LRA to save volatile registers before a call instead of after the write to the
> volatile. With this patch, both the above tests pass.
>
> Since the patch for PR111673 has been approved by Vladimir, I plan to
> commit the patch to trunk. And I will fix the test failures after doing the
> commit.
>
I think I'd probably understand this better if you could give some example RTL
(before and after). Do you have that?
R.
> Regards,
> Surya
>
>
>
> On 28/11/23 7:18 pm, Richard Earnshaw wrote:
>>
>>
>> On 28/11/2023 12:52, Surya Kumari Jangala wrote:
>>> Hi Richard,
>>> Thanks a lot for your response!
>>>
>>> Another failure reported by the Linaro CI is as follows :
>>> (Note: I am planning to send a separate mail for each failure, as this will
>>> make
>>> the discussion easy to track)
>>>
>>> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve
>>> -moverride=tune=none check-function-bodies dup_x0_m
>>>
>>> Expected code:
>>>
>>> ...
>>> add (x[0-9]+), x0, #?1
>>> mov (p[0-7])\.b, p15\.b
>>> mov z0\.d, \2/m, \1
>>> ...
>>> ret
>>>
>>>
>>> Code obtained w/o patch:
>>> addvl sp, sp, #-1
>>> str p15, [sp]
>>> add x0, x0, 1
>>> mov p3.b, p15.b
>>> mov z0.d, p3/m, x0
>>> ldr p15, [sp]
>>> addvl sp, sp, #1
>>> ret
>>>
>>> Code obtained w/ patch:
>>> addvl sp, sp, #-1
>>> str p15, [sp]
>>> mov p3.b, p15.b
>>> add x0, x0, 1
>>> mov z0.d, p3/m, x0
>>> ldr p15, [sp]
>>> addvl sp, sp, #1
>>> ret
>>>
>>> As we can see, with the patch, the following two instructions are
>>> interchanged:
>>> add x0, x0, 1
>>> mov p3.b, p15.b
>>
>> Indeed, both look acceptable results to me, especially given that we don't
>> schedule results at -O1.
>>
>> There's two ways of fixing this:
>> 1) Simply swap the order to what the compiler currently generates (which is
>> a little fragile, since it might flip back someday).
>> 2) Write the test as
>>
>>
>> ** (
>> ** add (x[0-9]+), x0, #?1
>> ** mov (p[0-7])\.b, p15\.b
>> ** mov z0\.d, \2/m, \1
>> ** |
>> ** mov (p[0-7])\.b, p15\.b
>> ** add (x[0-9]+), x0, #?1
>> ** mov z0\.d, \1/m, \2
>> ** )
>>
>> Note, we need to swap the match names in the third insn to account for the
>> different order of the earlier instructions.
>>
>> Neither is ideal, but the second is perhaps a little more bomb proof.
>>
>> I don't really have a strong feeling either way, but perhaps the second is
>> slightly preferable.
>>
>> Richard S: thoughts?
>>
>> R.
>>
>>> I believe that this is fine and the test can be modified to allow it to
>>> pass on
>>> aarch64. Please let me know what you think.
>>>
>>> Regards,
>>> Surya
>>>
>>>
>>> On 24/11/23 4:18 pm, Richard Earnshaw wrote:
>>>>
>>>>
>>>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
>>>>> Hi Richard,
>>>>> Ping. Please let me know if the test failure that I mentioned in the mail
>>>>> below can be handled by changing the expected generated code. I am not
>>>>> conversant with arm, and hence would appreciate your help.
>>>>>
>>>>> Regards,
>>>>> Surya
>>>>>
>>>>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>>>>>> Hi Richard,
>>>>>> I had submitted a patch for review
>>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>>>>>> regarding scaling save/restore costs of callee save registers with block
>>>>>> frequency in the IRA pass (PR111673).
>>>>>>
>>>>>> This patch has been approved by VMakarov
>>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>>>>>
>>>>>> With this patch, we are seeing performance improvements with spec on x86
>>>>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>>>>>
>>>>>> I received a mail from Linaro about some failures seen in the CI
>>>>>> pipeline with
>>>>>> this patch. I have analyzed the failures and I wish to discuss the
>>>>>> analysis with you.
>>>>>>
>>>>>> One failure reported by the Linaro CI is:
>>>>>>
>>>>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+,
>>>>>> r[0-9]+, \\[r[0-9]+\\] 2
>>>>>>
>>>>>> The diff in the assembly between trunk and patch is:
>>>>>>
>>>>>> 93c93
>>>>>> < push {r4, r5}
>>>>>> ---
>>>>>>> push {fp}
>>>>>> 95c95
>>>>>> < ldrexd r4, r5, [r0]
>>>>>> ---
>>>>>>> ldrexd fp, ip, [r0]
>>>>>> 99c99
>>>>>> < pop {r4, r5}
>>>>>> ---
>>>>>>> ldr fp, [sp], #4
>>>>>>
>>>>>>
>>>>>> The test fails with patch because the ldrexd insn uses fp & ip registers
>>>>>> instead
>>>>>> of r[0-9]+
>>>>>>
>>>>>> But the code produced by patch is better because it is pushing and
>>>>>> restoring only
>>>>>> one register (fp) instead of two registers (r4, r5). Hence, this test
>>>>>> can be
>>>>>> modified to allow it to pass on arm. Please let me know what you think.
>>>>>>
>>>>>> If you need more information, please let me know. I will be sending
>>>>>> separate mails
>>>>>> for the other test failures.
>>>>>>
>>>>
>>>> Thanks for looking at this.
>>>>
>>>>
>>>> The key part of this test is that the compiler generates LDREXD. The
>>>> registers used for that are pretty much irrelevant as we don't match them
>>>> to any other operations within the test. So I'd recommend just testing
>>>> for the mnemonic and not for any of the operands (ie just match
>>>> "ldrexd\t").
>>>>
>>>> R.
>>>>
>>>>>> Regards,
>>>>>> Surya
>>>>>>
>>>>>>
>>>>>>