> On 5 Mar 2025, at 11:14, Richard Biener <[email protected]> wrote:
>
> On Tue, Mar 4, 2025 at 10:01 PM Richard Sandiford
> <[email protected]> wrote:
>>
>> Kyrylo Tkachov <[email protected]> writes:
>>> Hi all,
>>>
>>> In this testcase late-combine was failing to merge:
>>> dup v31.4s, v31.s[3]
>>> fmla v30.4s, v31.4s, v29.4s
>>> into the lane-wise fmla form.
>>> This is because late-combine checks may_trap_p under the hood on the dup
>>> insn.
>>> This ended up returning true for the insn:
>>> (set (reg:V4SF 152 [ _32 ])
>>> (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ])
>>> (parallel:V4SF [
>>> (const_int 3 [0x3])]))))
>>>
>>> Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of
>>> floating-point modes can't trap, it assumed that the V4SF parallel can trap.
>>> The correct behaviour is to recurse into vector inside the PARALLEL and
>>> check
>>> the sub-expression. This patch adjusts may_trap_p_1 to do just that.
>>> With this check the above insn is not deemed to be trapping and is
>>> propagated
>>> into the FMLA giving:
>>> fmla vD.4s, vA.4s, vB.s[3]
>>>
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>> Apparently this also fixes a regression in
>>> gcc.target/aarch64/vmul_element_cost.c that I observed.
>>>
>>> Signed-off-by: Kyrylo Tkachov <[email protected]>
>>>
>>> gcc/
>>>
>>> PR rtl-optimization/119046
>>> * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as trapping.
>>>
>>> gcc/testsuite/
>>>
>>> PR rtl-optimization/119046
>>> * gcc.target/aarch64/pr119046.c: New test.
>>>
>>> From 39fa1be73e8fed7fc673329e1854ba41587623c4 Mon Sep 17 00:00:00 2001
>>> From: Kyrylo Tkachov <[email protected]>
>>> Date: Thu, 27 Feb 2025 09:00:25 -0800
>>> Subject: [PATCH] PR rtl-optimization/119046: Don't mark PARALLEL RTXes with
>>> floating-point mode as trapping
>>>
>>> In this testcase late-combine was failing to merge:
>>> dup v31.4s, v31.s[3]
>>> fmla v30.4s, v31.4s, v29.4s
>>> into the lane-wise fmla form.
>>> This is because late-combine checks may_trap_p under the hood on the dup
>>> insn.
>>> This ended up returning true for the insn:
>>> (set (reg:V4SF 152 [ _32 ])
>>> (vec_duplicate:V4SF (vec_select:SF (reg:V4SF 111 [ rhs_panel.8_31 ])
>>> (parallel:V4SF [
>>> (const_int 3 [0x3])]))))
>>>
>>> Although mem_trap_p correctly reasoned that vec_duplicate and vec_select of
>>> floating-point modes can't trap, it assumed that the V4SF parallel can trap.
>>> The correct behaviour is to recurse into vector inside the PARALLEL and
>>> check
>>> the sub-expression. This patch adjusts may_trap_p_1 to do just that.
>>> With this check the above insn is not deemed to be trapping and is
>>> propagated
>>> into the FMLA giving:
>>> fmla vD.4s, vA.4s, vB.s[3]
>>>
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>> Apparently this also fixes a regression in
>>> gcc.target/aarch64/vmul_element_cost.c that I observed.
>>>
>>> Signed-off-by: Kyrylo Tkachov <[email protected]>
>>>
>>> gcc/
>>>
>>> PR rtl-optimization/119046
>>> * rtlanal.cc (may_trap_p_1): Don't mark FP-mode PARALLELs as trapping.
>>
>> I don't object to this, if someone else agrees that it's the right fix.
>> But the mode on the parallel looks iffy to me, in that the data is not
>> floating-point data. VOIDmode would probably be more accurate and
>> seems to be what x86 uses.
>
> I agree that looks iffy, the fix is still OK IMO - we shouldn't decide
> that a parallel
> traps based on its mode (which has no meaning - it's not even documented to
> have a mode).
Thanks, I’ve tracked down where the V4SF PARALLEL is being created and I’ll
post a separate patch to fix it after some testing.
Kyrill
>
> Richard.
>
>>
>> Thanks,
>> Richard
>>
>>>
>>> gcc/testsuite/
>>>
>>> PR rtl-optimization/119046
>>> * gcc.target/aarch64/pr119046.c: New test.
>>> ---
>>> gcc/rtlanal.cc | 1 +
>>> gcc/testsuite/gcc.target/aarch64/pr119046.c | 16 ++++++++++++++++
>>> 2 files changed, 17 insertions(+)
>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/pr119046.c
>>>
>>> diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
>>> index 8caffafdaa4..7ad67afb9fe 100644
>>> --- a/gcc/rtlanal.cc
>>> +++ b/gcc/rtlanal.cc
>>> @@ -3252,6 +3252,7 @@ may_trap_p_1 (const_rtx x, unsigned flags)
>>> return true;
>>> break;
>>>
>>> + case PARALLEL:
>>> case NEG:
>>> case ABS:
>>> case SUBREG:
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr119046.c
>>> b/gcc/testsuite/gcc.target/aarch64/pr119046.c
>>> new file mode 100644
>>> index 00000000000..aa5fa7c848c
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/aarch64/pr119046.c
>>> @@ -0,0 +1,16 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-additional-options "-O2" } */
>>> +
>>> +#include <arm_neon.h>
>>> +
>>> +float32x4_t madd_helper_1(float32x4_t a, float32x4_t b, float32x4_t d)
>>> +{
>>> + float32x4_t t = a;
>>> + t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
>>> + t = vfmaq_f32 (t, vdupq_n_f32(vgetq_lane_f32 (b, 1)), d);
>>> + return t;
>>> +}
>>> +
>>> +/* { dg-final { scan-assembler-not {\tdup\tv[0-9]+\.4s, v[0-9]+.s\[1\]\n}
>>> } } */
>>> +/* { dg-final { scan-assembler-times {\tfmla\tv[0-9]+\.4s, v[0-9]+\.4s,
>>> v[0-9]+\.s\[1\]\n} 2 } } */
>>> +