[Bug target/81904] FMA and addsub instructions

2023-08-01 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904

--- Comment #8 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:f0b7a61d83534fc8f7aa593b1f0f0357a371a800

commit r14-2919-gf0b7a61d83534fc8f7aa593b1f0f0357a371a800
Author: liuhongt 
Date:   Mon Jul 31 16:03:45 2023 +0800

Support vec_fmaddsub/vec_fmsubadd for vector HFmode.

AVX512FP16 supports vfmaddsubXXXph and vfmsubaddXXXph.
Also remove scalar mode from fmaddsub/fmsubadd pattern since there's
no scalar instruction for that.

gcc/ChangeLog:

PR target/81904
* config/i386/sse.md (vec_fmaddsub4): Extend to vector
HFmode, use mode iterator VFH instead.
(vec_fmsubadd4): Ditto.
(fma_fmaddsub_):
Remove scalar mode from iterator, use VFH_AVX512VL instead.
(fma_fmsubadd_):
Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr81904.c: New test.

[Bug target/81904] FMA and addsub instructions

2023-07-31 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904

--- Comment #7 from Hongtao.liu  ---

> 
> to .VEC_ADDSUB possibly loses exceptions (the vectorizer now directly
> creates .VEC_ADDSUB when possible).
Let's put it under -fno-trapping-math.

[Bug target/81904] FMA and addsub instructions

2023-07-31 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904

--- Comment #6 from Richard Biener  ---
(In reply to Hongtao.liu from comment #5)
> (In reply to Richard Biener from comment #1)
> > Hmm, I think the issue is we see
> > 
> > f (__m128d x, __m128d y, __m128d z)
> > {
> >   vector(2) double _4;
> >   vector(2) double _6;
> > 
> >[100.00%]:
> >   _4 = x_2(D) * y_3(D);
> >   _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call]
> We can fold the builtin into .VEC_ADDSUB, and optimize MUL + VEC_ADDSUB ->
> VEC_FMADDSUB in match.pd?

I think MUL + .VEC_ADDSUB can be handled in the FMA pass.  For my example
above we early (before FMA recog) get

  _4 = x_2(D) * y_3(D);
  tem2_7 = _4 + z_6(D);
  tem3_8 = _4 - z_6(D);
  _9 = VEC_PERM_EXPR ;

we could recognize that as .VEC_ADDSUB.  I think we want to avoid doing
this too early, not sure if doing this within the FMA pass itself will
work since we key FMAs on the mult but would need to key the addsub
on the VEC_PERM (we are walking stmts from BB start to end).  Looking
at the code it seems changing the walking order should work.

Note matching

  tem2_7 = _4 + z_6(D);
  tem3_8 = _4 - z_6(D);
  _9 = VEC_PERM_EXPR ;

to .VEC_ADDSUB possibly loses exceptions (the vectorizer now directly
creates .VEC_ADDSUB when possible).

[Bug target/81904] FMA and addsub instructions

2023-07-30 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904

--- Comment #5 from Hongtao.liu  ---
(In reply to Richard Biener from comment #1)
> Hmm, I think the issue is we see
> 
> f (__m128d x, __m128d y, __m128d z)
> {
>   vector(2) double _4;
>   vector(2) double _6;
> 
>[100.00%]:
>   _4 = x_2(D) * y_3(D);
>   _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call]
We can fold the builtin into .VEC_ADDSUB, and optimize MUL + VEC_ADDSUB ->
VEC_FMADDSUB in match.pd?

[Bug target/81904] FMA and addsub instructions

2023-07-30 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904

--- Comment #4 from Hongtao.liu  ---
(In reply to Richard Biener from comment #2)
> __m128d h(__m128d x, __m128d y, __m128d z){
> __m128d tem = _mm_mul_pd (x,y);
> __m128d tem2 = tem + z;
> __m128d tem3 = tem - z;
> return __builtin_shuffle (tem2, tem3, (__m128i) {0, 3});
> }
> 
> doesn't quite work (the combiner pattern for fmaddsub is missing).  Tried
> {0, 2} as well.
> 
> :
> .LFB5021:
> .cfi_startproc
> vmovapd %xmm0, %xmm3
> vfmsub132pd %xmm1, %xmm2, %xmm0
> vfmadd132pd %xmm1, %xmm2, %xmm3
> vshufpd $2, %xmm0, %xmm3, %xmm0

  tem2_6 = .FMA (x_2(D), y_3(D), z_5(D));
  # DEBUG tem2 => tem2_6
  # DEBUG BEGIN_STMT
  tem3_7 = .FMS (x_2(D), y_3(D), z_5(D));
  # DEBUG tem3 => NULL
  # DEBUG BEGIN_STMT
  _8 = VEC_PERM_EXPR ;

Can it be handled in match.pd? rewrite fmaddsub pattern into vec_merge fma fms
 looks too complex.

Similar for VEC_ADDSUB + MUL -> VEC_FMADDSUB.

[Bug target/81904] FMA and addsub instructions

2023-07-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904

--- Comment #3 from Richard Biener  ---
*** Bug 84361 has been marked as a duplicate of this bug. ***

[Bug target/81904] FMA and addsub instructions

2017-08-21 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904

--- Comment #2 from Richard Biener  ---
__m128d h(__m128d x, __m128d y, __m128d z){
__m128d tem = _mm_mul_pd (x,y);
__m128d tem2 = tem + z;
__m128d tem3 = tem - z;
return __builtin_shuffle (tem2, tem3, (__m128i) {0, 3});
}

doesn't quite work (the combiner pattern for fmaddsub is missing).  Tried {0,
2} as well.

:
.LFB5021:
.cfi_startproc
vmovapd %xmm0, %xmm3
vfmsub132pd %xmm1, %xmm2, %xmm0
vfmadd132pd %xmm1, %xmm2, %xmm3
vshufpd $2, %xmm0, %xmm3, %xmm0

[Bug target/81904] FMA and addsub instructions

2017-08-21 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-08-21
 CC||rguenth at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
Hmm, I think the issue is we see

f (__m128d x, __m128d y, __m128d z)
{
  vector(2) double _4;
  vector(2) double _6;

   [100.00%]:
  _4 = x_2(D) * y_3(D);
  _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call]
  return _6;

the vectorizer will implement addsub as

  _6 = _4 + z_5(D);
  _7 = _4 - z_5(D);
  _8 = __builtin_shuffle (_6, _7, {0, 1});
  return _8;

which would then end up as (if the non-single use allows)

  _6 = FMA 
  _9 = -z_5(D);
  _7 = FMA 
  _8 = __builtin_shuffle (_6, _7, {0, 1});
  return _8;

a bit interesting for combine to figure out but theoretically possible?
(I think we expand both FMAs properly).

Look at the addsub patterns.

That is, handling this requires open-coding _mm_addsub_pd with add, sub
and suffle ...