https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117072
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|tree-optimization |target
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Compared to gcc14 I have for example for cond_op_fma__Float16-1.c
foo1_fnms:
.LFB7:
.cfi_startproc
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L24:
vmovdqa b(%rax), %ymm1
vmovdqa d(%rax), %ymm0
addq $32, %rax
vcmpph $1, c-32(%rax), %ymm1, %k1
vmovdqa e-32(%rax), %ymm1
vfnmsub213ph a-32(%rax), %ymm0, %ymm1
vmovdqu16 %ymm1, %ymm0{%k1}
vmovdqa %ymm0, a-32(%rax)
cmpq $1600, %rax
jne .L24
vzeroupper
ret
instead of the expected
foo1_fnms:
.LFB7:
.cfi_startproc
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L24:
vmovdqa b(%rax), %ymm1
vmovdqa a(%rax), %ymm2
addq $32, %rax
vmovdqa d-32(%rax), %ymm0
vcmpph $1, c-32(%rax), %ymm1, %k1
vfnmsub132ph e-32(%rax), %ymm2, %ymm0{%k1}
vmovdqa %ymm0, a-32(%rax)
cmpq $1600, %rax
jne .L24
vzeroupper
ret
.combine shows in gcc14:
Trying 15 -> 16:
15: r113:V16HF={-r102:V16HF*[r98:DI+`e']+-[r98:DI+`a']}
16: r99:V16HF=vec_merge(r113:V16HF,r102:V16HF,r110:HI)
REG_DEAD r113:V16HF
REG_DEAD r110:HI
REG_DEAD r102:V16HF
Successfully matched this instruction:
(set (reg:V16HF 99 [ _37 ])
(vec_merge:V16HF (fma:V16HF (neg:V16HF (reg:V16HF 102 [ vect_pretmp_14.315
]))
(mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.333 ])
(symbol_ref:DI ("e") [flags 0x2] <var_decl 0x7ffff6810ea0
e>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.333_9 * 1]+0 S32
A256])
(neg:V16HF (mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.333 ])
(symbol_ref:DI ("a") [flags 0x2] <var_decl
0x7ffff6810c60 a>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&a + ivtmp.333_9
* 1]+0 S32 A256])))
(reg:V16HF 102 [ vect_pretmp_14.315 ])
(reg:HI 110 [ mask__11.325_55 ])))
but
Trying 15 -> 16:
15: r113:V16HF={-[r98:DI+`e']*r104:V16HF+-[r98:DI+`a']}
16: r99:V16HF=vec_merge(r113:V16HF,r104:V16HF,r110:HI)
REG_DEAD r113:V16HF
REG_DEAD r110:HI
REG_DEAD r104:V16HF
Failed to match this instruction:
(set (reg:V16HF 99 [ _37 ])
(vec_merge:V16HF (fma:V16HF (neg:V16HF (mem:V16HF (plus:DI (reg:DI 98 [
ivtmp.329 ])
(symbol_ref:DI ("e") [flags 0x2] <var_decl
0x7ffff6810ea0 e>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.329_9
* 1]+0 S32 A256]))
(reg:V16HF 104 [ vect_pretmp_14.315 ])
(neg:V16HF (mem:V16HF (plus:DI (reg:DI 98 [ ivtmp.329 ])
(symbol_ref:DI ("a") [flags 0x2] <var_decl
0x7ffff6810c60 a>)) [1 MEM <vector(16) _Float16> [(_Float16 *)&a + ivtmp.329_9
* 1]+0 S32 A256])))
(reg:V16HF 104 [ vect_pretmp_14.315 ])
(reg:HI 110 [ mask__11.309_43 ])))
see how the commutative multiply part of insn 15 differs and causes the
matching to fail:
good: 15: r113:V16HF={-r102:V16HF*[r98:DI+`e']+-[r98:DI+`a']}
bad: 15: r113:V16HF={-[r98:DI+`e']*r104:V16HF+-[r98:DI+`a']}
this ordering is already present on GIMPLE:
vect_pretmp_14.315_45 = MEM <vector(16) _Float16> [(_Float16 *)&d +
ivtmp.333_9 * 1];
vect__5.322_52 = MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.333_9 *
1];
_37 = .COND_FNMS (mask__11.325_55, vect_pretmp_14.315_45, vect__5.322_52,
vect__3.318_48, vect_pretmp_14.315_45);
vs.
vect_pretmp_14.315_49 = MEM <vector(16) _Float16> [(_Float16 *)&d +
ivtmp.329_9 * 1];
vect__5.312_46 = MEM <vector(16) _Float16> [(_Float16 *)&e + ivtmp.329_9 *
1];
_37 = .COND_FNMS (mask__11.309_43, vect__5.312_46, vect_pretmp_14.315_49,
vect__3.319_53, vect_pretmp_14.315_49);
both are canonicalized correctly (after SSA name version).
This is a spurious difference, if we rely on these combines for the now
missed micro-optimization we need to beef up the patterns to allow both
orders. (avx512vl_fnmsub_v16hf_mask)
A target issue IMO?
Alternatively make sure RTL canonicalizes (fma (neg non-reg) (reg) ...)
to (fma (neg reg) (non-reg) ...) or stop matching that as pattern and
thus force RTL expansion + combine to arrive at the correct variant?