[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #19 from CVS Commits --- The master branch has been updated by Uros Bizjak : https://gcc.gnu.org/g:ade30fad6669e5f34ca4c587c724d74ecc953175 commit r14-2786-gade30fad6669e5f34ca4c587c724d74ecc953175 Author: Uros Bizjak Date: Wed Jul 26 11:10:46 2023 +0200 i386: Clear upper half of XMM register for V2SFmode operations [PR110762] Clear the upper half of a V4SFmode operand register in front of all potentially trapping instructions. The testcase: --cut here-- typedef float v2sf __attribute__((vector_size(8))); typedef float v4sf __attribute__((vector_size(16))); v2sf test(v4sf x, v4sf y) { v2sf x2, y2; x2 = __builtin_shufflevector (x, x, 0, 1); y2 = __builtin_shufflevector (y, y, 0, 1); return x2 + y2; } --cut here-- now compiles to: movq%xmm1, %xmm1# 9 [c=4 l=4] *vec_concatv4sf_0 movq%xmm0, %xmm0# 10[c=4 l=4] *vec_concatv4sf_0 addps %xmm1, %xmm0# 11[c=12 l=3] *addv4sf3/0 This approach addresses issues with exceptions, as well as issues with denormal/invalid values. An obvious exception to the rule is a division, where the value != 0.0 should be loaded into the upper half of the denominator to avoid division by zero exception. The patch effectively tightens the solution from PR95046 by clearing upper halves of all operand registers before every potentially trapping instruction. The testcase: --cut here-- typedef float __attribute__((vector_size(8))) v2sf; v2sf test (v2sf a, v2sf b, v2sf c) { return a * b - c; } --cut here-- compiles to: movq%xmm1, %xmm1# 8 [c=4 l=4] *vec_concatv4sf_0 movq%xmm0, %xmm0# 9 [c=4 l=4] *vec_concatv4sf_0 movq%xmm2, %xmm2# 12[c=4 l=4] *vec_concatv4sf_0 mulps %xmm1, %xmm0# 10[c=16 l=3] *mulv4sf3/0 movq%xmm0, %xmm0# 13[c=4 l=4] *vec_concatv4sf_0 subps %xmm2, %xmm0# 14[c=12 l=3] *subv4sf3/0 The implementation emits V4SFmode operation, so we can remove all "emulated" SSE2 V2SFmode trapping instructions and remove "emulated" SSE2 V2SFmode alternatives from 3dNOW! insn patterns. PR target/110762 gcc/ChangeLog: * config/i386/i386.md (plusminusmult): New code iterator. * config/i386/mmx.md (mmxdoublevecmode): New mode attribute. (movq__to_sse): New expander. (v2sf3): Macroize expander from addv2sf3, subv2sf3 and mulv2sf3 using plusminusmult code iterator. Rewrite as a wrapper around V4SFmode operation. (mmx_addv2sf3): Change operand 1 and operand 2 predicates to nonimmediate_operand. (*mmx_addv2sf3): Remove SSE alternatives. Change operand 1 and operand 2 predicates to nonimmediate_operand. (mmx_subv2sf3): Change operand 2 predicate to nonimmediate_operand. (mmx_subrv2sf3): Change operand 1 predicate to nonimmediate_operand. (*mmx_subv2sf3): Remove SSE alternatives. Change operand 1 and operand 2 predicates to nonimmediate_operand. (mmx_mulv2sf3): Change operand 1 and operand 2 predicates to nonimmediate_operand. (*mmx_mulv2sf3): Remove SSE alternatives. Change operand 1 and operand 2 predicates to nonimmediate_operand. (divv2sf3): Rewrite as a wrapper around V4SFmode operation. (v2sf3): Ditto. (mmx_v2sf3): Change operand 1 and operand 2 predicates to nonimmediate_operand. (*mmx_v2sf3): Remove SSE alternatives. Change operand 1 and operand 2 predicates to nonimmediate_operand. (mmx_ieee_v2sf3): Ditto. (sqrtv2sf2): Rewrite as a wrapper around V4SFmode operation. (*mmx_haddv2sf3_low): Ditto. (*mmx_hsubv2sf3_low): Ditto. (vec_addsubv2sf3): Ditto. (*mmx_maskcmpv2sf3_comm): Remove. (*mmx_maskcmpv2sf3): Remove. (vec_cmpv2sfv2si): Rewrite as a wrapper around V4SFmode operation. (vcondv2sf): Ditto. (fmav2sf4): Ditto. (fmsv2sf4): Ditto. (fnmav2sf4): Ditto. (fnmsv2sf4): Ditto. (fix_truncv2sfv2si2): Ditto. (fixuns_truncv2sfv2si2): Ditto. (mmx_fix_truncv2sfv2si2): Remove SSE alternatives. Change operand 1 predicate to nonimmediate_operand. (floatv2siv2sf2): Rewrite as a wrapper around V4SFmode operation. (floatunsv2siv2sf2): Ditto. (mmx_floatv2siv2sf2): Remove SSE alternatives. Change operand 1 predicate to nonimmediate_operand. (nearbyintv2sf2): Rewrite as a wrapper around V4SFmode operation.
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #18 from Uroš Bizjak --- (In reply to Richard Biener from comment #17) > > compiles to: > > > > movq%xmm1, %xmm1# 8 [c=4 l=4] *vec_concatv4sf_0 > > movq%xmm0, %xmm0# 9 [c=4 l=4] *vec_concatv4sf_0 > > movq%xmm2, %xmm2# 12[c=4 l=4] *vec_concatv4sf_0 > > mulps %xmm1, %xmm0# 10[c=16 l=3] *mulv4sf3/0 > > movq%xmm0, %xmm0# 13[c=4 l=4] *vec_concatv4sf_0 > > so this one is obviously redundant - I suppose at the RTL level we have > no chance of noticing this. I hope for integer vector operations we > avoid these ops? I think this will make epilog vectorization with V2SFmode > a bad idea, we'd need to appropriately disqualify this in the costing > hooks. Yes, the redundant movq is emitted only in front of V2SFmode trapping operations. So, all integer, V2SF logic and swizzling operations are still implemented directly with "emulated" instructions. > > I wonder if combine could for example combine a v2sf load with the > upper half zeroing for the next use? Likewise for arithmetics. The patch already does that. We know that V2SF load zeroes the upper half, so there is no additional MOVQ emitted. To illustrate, the testcase: --cut here-- typedef float __attribute__((vector_size(8))) v2sf; v2sf m; v2sf test (v2sf a) { return a - m; } --cut here-- compiles to: movqm(%rip), %xmm1 # 6 [c=4 l=8] *vec_concatv4sf_0 movq%xmm0, %xmm0# 7 [c=4 l=4] *vec_concatv4sf_0 subps %xmm1, %xmm0# 8 [c=12 l=3] *subv4sf3/0 As far as arithmetic is concerned, perhaps some back-walking RTL optimization pass can figure out that the preceding trapping V2SFmode operation guarantees zeros in the upper half and remove clearing insn. However, MOVQ xmm,xmm is an extremely fast instruction with latency of 1 and reciprocal throughput of 0.33, so I guess it is not of much concern.
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #17 from Richard Biener --- (In reply to Uroš Bizjak from comment #16) > Created attachment 55636 [details] > Proposed patch > > Proposed patch clears the upper half of a V4SFmode operand register before > all potentially trapping instructions. The testcase from comment #12 now > compiles to: > > movq%xmm1, %xmm1# 9 [c=4 l=4] *vec_concatv4sf_0 > movq%xmm0, %xmm0# 10[c=4 l=4] *vec_concatv4sf_0 > addps %xmm1, %xmm0# 11[c=12 l=3] *addv4sf3/0 > > This approach addresses issues with traps (Comment #0), as well as with > denormal/invalid values (Comment #14). An obvious exception to the rule is a > division, where the value != 0.0 should be loaded into the upper half of the > denominator. > > The patch effectively tightens the solution from PR95046 by clearing upper > halves of all operand registers before every potentially trapping > instruction. The testcase: > > --cut here-- > typedef float __attribute__((vector_size(8))) v2sf; > > v2sf test (v2sf a, v2sf b, v2sf c) > { > return a * b - c; > } > --cut here-- > > compiles to: > > movq%xmm1, %xmm1# 8 [c=4 l=4] *vec_concatv4sf_0 > movq%xmm0, %xmm0# 9 [c=4 l=4] *vec_concatv4sf_0 > movq%xmm2, %xmm2# 12[c=4 l=4] *vec_concatv4sf_0 > mulps %xmm1, %xmm0# 10[c=16 l=3] *mulv4sf3/0 > movq%xmm0, %xmm0# 13[c=4 l=4] *vec_concatv4sf_0 so this one is obviously redundant - I suppose at the RTL level we have no chance of noticing this. I hope for integer vector operations we avoid these ops? I think this will make epilog vectorization with V2SFmode a bad idea, we'd need to appropriately disqualify this in the costing hooks. I wonder if combine could for example combine a v2sf load with the upper half zeroing for the next use? Likewise for arithmetics. > subps %xmm2, %xmm0# 14[c=12 l=3] *subv4sf3/0 > > The implementation simply calls V4SFmode operation, so we can remove all > "emulated" SSE2 V2SFmode instructions and SSE2 V2SFmode alternatives from > 3dNOW! insn patterns.
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 Uroš Bizjak changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |ubizjak at gmail dot com Status|NEW |ASSIGNED --- Comment #16 from Uroš Bizjak --- Created attachment 55636 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55636&action=edit Proposed patch Proposed patch clears the upper half of a V4SFmode operand register before all potentially trapping instructions. The testcase from comment #12 now compiles to: movq%xmm1, %xmm1# 9 [c=4 l=4] *vec_concatv4sf_0 movq%xmm0, %xmm0# 10[c=4 l=4] *vec_concatv4sf_0 addps %xmm1, %xmm0# 11[c=12 l=3] *addv4sf3/0 This approach addresses issues with traps (Comment #0), as well as with denormal/invalid values (Comment #14). An obvious exception to the rule is a division, where the value != 0.0 should be loaded into the upper half of the denominator. The patch effectively tightens the solution from PR95046 by clearing upper halves of all operand registers before every potentially trapping instruction. The testcase: --cut here-- typedef float __attribute__((vector_size(8))) v2sf; v2sf test (v2sf a, v2sf b, v2sf c) { return a * b - c; } --cut here-- compiles to: movq%xmm1, %xmm1# 8 [c=4 l=4] *vec_concatv4sf_0 movq%xmm0, %xmm0# 9 [c=4 l=4] *vec_concatv4sf_0 movq%xmm2, %xmm2# 12[c=4 l=4] *vec_concatv4sf_0 mulps %xmm1, %xmm0# 10[c=16 l=3] *mulv4sf3/0 movq%xmm0, %xmm0# 13[c=4 l=4] *vec_concatv4sf_0 subps %xmm2, %xmm0# 14[c=12 l=3] *subv4sf3/0 The implementation simply calls V4SFmode operation, so we can remove all "emulated" SSE2 V2SFmode instructions and SSE2 V2SFmode alternatives from 3dNOW! insn patterns.
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #15 from jbeulich at suse dot com --- (In reply to Richard Biener from comment #12) > _mm_storel_pi could be implemented using __builtin_shufflevector these days. > Which shows exactly the same issue: (also related to comment 10) I don't think the problem is how the registers are filled (and in my example I simply used the first approach that came to mind and worked). The problem is that the arithmetic insn assumes the upper parts to not hold certain special values (or pairs thereof). Aiui one could create the exact same situation with inline assembly instead of any of the builtins. This isn't any different from using 512-bit operations for more narrow vectors when AVX512VL isn't enabled. Afaict such uses are carefully avoided for floating point vectors, and are used only in a limited number of cases on integer vectors (Hongtao recently asked me to not go any further in that direction either).
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #14 from Alexander Monakov --- That seems undesirable in light of comment #4, you'd risk creating a situation when -fno-trapping-math is unpredictably slower when denormals appear in dirty upper halves.
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #13 from Uroš Bizjak --- I think we should put all partial vector V2SF operations under !flag_trapping_math.
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #12 from Richard Biener --- _mm_storel_pi could be implemented using __builtin_shufflevector these days. Which shows exactly the same issue: typedef float __attribute__((vector_size(8))) v2sf_t; typedef float __attribute__((vector_size(16))) v4sf_t; v2sf_t test(v4sf_t x, v4sf_t y) { v2sf_t x2, y2; x2 = __builtin_shufflevector (x, x, 0, 1); y2 = __builtin_shufflevector (y, x, 0, 1); return x2 + y2; } expands to (insn 7 4 8 2 (set (reg:DI 88) (vec_select:DI (subreg:V2DI (reg/v:V4SF 85 [ x ]) 0) (parallel [ (const_int 0 [0]) ]))) "t.c":7:5 -1 (nil)) (insn 8 7 9 2 (set (reg:DI 89) (vec_select:DI (subreg:V2DI (reg/v:V4SF 86 [ y ]) 0) (parallel [ (const_int 0 [0]) ]))) "t.c":8:5 -1 (nil)) (insn 9 8 10 2 (set (reg:V2SF 87) (plus:V2SF (subreg:V2SF (reg:DI 88) 0) (subreg:V2SF (reg:DI 89) 0))) "t.c":12:12 -1 (nil)) and is recognized by the same set_noop_p code. On GIMPLE we have x2_2 = BIT_FIELD_REF ; y2_4 = BIT_FIELD_REF ; _5 = x2_2 + y2_4;
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #11 from Richard Biener --- (In reply to Richard Biener from comment #2) > The > >(insn 13 4 14 2 (set (reg:V2SF 20 xmm0 [orig:91 x2 ] [91]) > (vec_select:V2SF (reg:V4SF 20 xmm0 [94]) > (parallel [ > (const_int 0 [0]) > (const_int 1 [0x1]) > ]))) "t.c":10:12 4394 {sse_storelps} > (nil)) > > insns are gone in split after reload. The opinion is that the above insn leaves the upper half of xmm0 undefined.
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #10 from Uroš Bizjak --- (In reply to Richard Biener from comment #7) > I guess for the specific usage we need to wrap this in an UNSPEC? Probably, so a MOVQ xmm, xmm insn should be emitted for __builtin_ia32_storelps (AKA _mm_storel_pi), so the top 64bits will be cleared. There is already *vec_concatv4sf_0 that looks appropriate to implement the move.
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #9 from jbeulich at suse dot com --- (In reply to Richard Biener from comment #1) > So what's the issue? That this is wrong for -ftrapping-math? Even without that option MXCSR may be modified for reasons contained to just the upper halves of the registers. > Or that the > return value has undefined contents in the upper half? (I don't think the > ABI specifies how V2SF is returned) This part is fine, aiui.
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #8 from Richard Biener --- OTOH the set isn't noop for the xmm0 hardreg (it zeros the upper parts)
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #7 from Richard Biener --- I guess for the specific usage we need to wrap this in an UNSPEC?
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #6 from Richard Biener --- (In reply to Segher Boessenkool from comment #5) > (In reply to Richard Biener from comment #2) > > The > > > >(insn 13 4 14 2 (set (reg:V2SF 20 xmm0 [orig:91 x2 ] [91]) > > (vec_select:V2SF (reg:V4SF 20 xmm0 [94]) > > (parallel [ > > (const_int 0 [0]) > > (const_int 1 [0x1]) > > ]))) "t.c":10:12 4394 {sse_storelps} > > (nil)) > > > > insns are gone in split after reload. > > Insns 13 and 14 are deleted by split2, yes. Although the very next insn > (15) obviously uses the regs (20 and 21) those insns set?! set_noop_p returns true for it ...
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 Segher Boessenkool changed: What|Removed |Added CC||segher at gcc dot gnu.org --- Comment #5 from Segher Boessenkool --- (In reply to Richard Biener from comment #2) > The > >(insn 13 4 14 2 (set (reg:V2SF 20 xmm0 [orig:91 x2 ] [91]) > (vec_select:V2SF (reg:V4SF 20 xmm0 [94]) > (parallel [ > (const_int 0 [0]) > (const_int 1 [0x1]) > ]))) "t.c":10:12 4394 {sse_storelps} > (nil)) > > insns are gone in split after reload. Insns 13 and 14 are deleted by split2, yes. Although the very next insn (15) obviously uses the regs (20 and 21) those insns set?!
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #4 from Alexander Monakov --- In addition to FPU exception issue, it's also a performance trap due to handling of accidental denormals in upper halves.
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #3 from Uroš Bizjak --- (In reply to Richard Biener from comment #1) > So what's the issue? That this is wrong for -ftrapping-math? Or that the > return value has undefined contents in the upper half? (I don't think the > ABI specifies how V2SF is returned) __m64 is classified as SSE class, returned in XMM register.
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 Richard Biener changed: What|Removed |Added Keywords||wrong-code Last reconfirmed||2023-07-21 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #2 from Richard Biener --- The (insn 13 4 14 2 (set (reg:V2SF 20 xmm0 [orig:91 x2 ] [91]) (vec_select:V2SF (reg:V4SF 20 xmm0 [94]) (parallel [ (const_int 0 [0]) (const_int 1 [0x1]) ]))) "t.c":10:12 4394 {sse_storelps} (nil)) insns are gone in split after reload.
[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #1 from Richard Biener --- So what's the issue? That this is wrong for -ftrapping-math? Or that the return value has undefined contents in the upper half? (I don't think the ABI specifies how V2SF is returned)