[Bug target/81904] FMA and addsub instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904 --- Comment #8 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:f0b7a61d83534fc8f7aa593b1f0f0357a371a800 commit r14-2919-gf0b7a61d83534fc8f7aa593b1f0f0357a371a800 Author: liuhongt Date: Mon Jul 31 16:03:45 2023 +0800 Support vec_fmaddsub/vec_fmsubadd for vector HFmode. AVX512FP16 supports vfmaddsubXXXph and vfmsubaddXXXph. Also remove scalar mode from fmaddsub/fmsubadd pattern since there's no scalar instruction for that. gcc/ChangeLog: PR target/81904 * config/i386/sse.md (vec_fmaddsub4): Extend to vector HFmode, use mode iterator VFH instead. (vec_fmsubadd4): Ditto. (fma_fmaddsub_): Remove scalar mode from iterator, use VFH_AVX512VL instead. (fma_fmsubadd_): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr81904.c: New test.
[Bug target/81904] FMA and addsub instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904 --- Comment #7 from Hongtao.liu --- > > to .VEC_ADDSUB possibly loses exceptions (the vectorizer now directly > creates .VEC_ADDSUB when possible). Let's put it under -fno-trapping-math.
[Bug target/81904] FMA and addsub instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904 --- Comment #6 from Richard Biener --- (In reply to Hongtao.liu from comment #5) > (In reply to Richard Biener from comment #1) > > Hmm, I think the issue is we see > > > > f (__m128d x, __m128d y, __m128d z) > > { > > vector(2) double _4; > > vector(2) double _6; > > > >[100.00%]: > > _4 = x_2(D) * y_3(D); > > _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call] > We can fold the builtin into .VEC_ADDSUB, and optimize MUL + VEC_ADDSUB -> > VEC_FMADDSUB in match.pd? I think MUL + .VEC_ADDSUB can be handled in the FMA pass. For my example above we early (before FMA recog) get _4 = x_2(D) * y_3(D); tem2_7 = _4 + z_6(D); tem3_8 = _4 - z_6(D); _9 = VEC_PERM_EXPR ; we could recognize that as .VEC_ADDSUB. I think we want to avoid doing this too early, not sure if doing this within the FMA pass itself will work since we key FMAs on the mult but would need to key the addsub on the VEC_PERM (we are walking stmts from BB start to end). Looking at the code it seems changing the walking order should work. Note matching tem2_7 = _4 + z_6(D); tem3_8 = _4 - z_6(D); _9 = VEC_PERM_EXPR ; to .VEC_ADDSUB possibly loses exceptions (the vectorizer now directly creates .VEC_ADDSUB when possible).
[Bug target/81904] FMA and addsub instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904 --- Comment #5 from Hongtao.liu --- (In reply to Richard Biener from comment #1) > Hmm, I think the issue is we see > > f (__m128d x, __m128d y, __m128d z) > { > vector(2) double _4; > vector(2) double _6; > >[100.00%]: > _4 = x_2(D) * y_3(D); > _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call] We can fold the builtin into .VEC_ADDSUB, and optimize MUL + VEC_ADDSUB -> VEC_FMADDSUB in match.pd?
[Bug target/81904] FMA and addsub instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904 --- Comment #4 from Hongtao.liu --- (In reply to Richard Biener from comment #2) > __m128d h(__m128d x, __m128d y, __m128d z){ > __m128d tem = _mm_mul_pd (x,y); > __m128d tem2 = tem + z; > __m128d tem3 = tem - z; > return __builtin_shuffle (tem2, tem3, (__m128i) {0, 3}); > } > > doesn't quite work (the combiner pattern for fmaddsub is missing). Tried > {0, 2} as well. > > : > .LFB5021: > .cfi_startproc > vmovapd %xmm0, %xmm3 > vfmsub132pd %xmm1, %xmm2, %xmm0 > vfmadd132pd %xmm1, %xmm2, %xmm3 > vshufpd $2, %xmm0, %xmm3, %xmm0 tem2_6 = .FMA (x_2(D), y_3(D), z_5(D)); # DEBUG tem2 => tem2_6 # DEBUG BEGIN_STMT tem3_7 = .FMS (x_2(D), y_3(D), z_5(D)); # DEBUG tem3 => NULL # DEBUG BEGIN_STMT _8 = VEC_PERM_EXPR ; Can it be handled in match.pd? rewrite fmaddsub pattern into vec_merge fma fms looks too complex. Similar for VEC_ADDSUB + MUL -> VEC_FMADDSUB.
[Bug target/81904] FMA and addsub instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904 --- Comment #3 from Richard Biener --- *** Bug 84361 has been marked as a duplicate of this bug. ***
[Bug target/81904] FMA and addsub instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904 --- Comment #2 from Richard Biener --- __m128d h(__m128d x, __m128d y, __m128d z){ __m128d tem = _mm_mul_pd (x,y); __m128d tem2 = tem + z; __m128d tem3 = tem - z; return __builtin_shuffle (tem2, tem3, (__m128i) {0, 3}); } doesn't quite work (the combiner pattern for fmaddsub is missing). Tried {0, 2} as well. : .LFB5021: .cfi_startproc vmovapd %xmm0, %xmm3 vfmsub132pd %xmm1, %xmm2, %xmm0 vfmadd132pd %xmm1, %xmm2, %xmm3 vshufpd $2, %xmm0, %xmm3, %xmm0
[Bug target/81904] FMA and addsub instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-08-21 CC||rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- Hmm, I think the issue is we see f (__m128d x, __m128d y, __m128d z) { vector(2) double _4; vector(2) double _6; [100.00%]: _4 = x_2(D) * y_3(D); _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call] return _6; the vectorizer will implement addsub as _6 = _4 + z_5(D); _7 = _4 - z_5(D); _8 = __builtin_shuffle (_6, _7, {0, 1}); return _8; which would then end up as (if the non-single use allows) _6 = FMA _9 = -z_5(D); _7 = FMA _8 = __builtin_shuffle (_6, _7, {0, 1}); return _8; a bit interesting for combine to figure out but theoretically possible? (I think we expand both FMAs properly). Look at the addsub patterns. That is, handling this requires open-coding _mm_addsub_pd with add, sub and suffle ...