[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-26 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #19 from CVS Commits  ---
The master branch has been updated by Uros Bizjak :

https://gcc.gnu.org/g:ade30fad6669e5f34ca4c587c724d74ecc953175

commit r14-2786-gade30fad6669e5f34ca4c587c724d74ecc953175
Author: Uros Bizjak 
Date:   Wed Jul 26 11:10:46 2023 +0200

i386: Clear upper half of XMM register for V2SFmode operations [PR110762]

Clear the upper half of a V4SFmode operand register in front of all
potentially trapping instructions. The testcase:

--cut here--
typedef float v2sf __attribute__((vector_size(8)));
typedef float v4sf __attribute__((vector_size(16)));

v2sf test(v4sf x, v4sf y)
{
  v2sf x2, y2;

  x2 = __builtin_shufflevector (x, x, 0, 1);
  y2 = __builtin_shufflevector (y, y, 0, 1);

  return x2 + y2;
}
--cut here--

now compiles to:

movq%xmm1, %xmm1# 9 [c=4 l=4]  *vec_concatv4sf_0
movq%xmm0, %xmm0# 10[c=4 l=4]  *vec_concatv4sf_0
addps   %xmm1, %xmm0# 11[c=12 l=3]  *addv4sf3/0

This approach addresses issues with exceptions, as well as issues with
denormal/invalid values. An obvious exception to the rule is a division,
where the value != 0.0 should be loaded into the upper half of the
denominator to avoid division by zero exception.

The patch effectively tightens the solution from PR95046 by clearing upper
halves of all operand registers before every potentially trapping
instruction.
The testcase:

--cut here--
typedef float __attribute__((vector_size(8))) v2sf;

v2sf test (v2sf a, v2sf b, v2sf c)
{
  return a * b - c;
}
--cut here--

compiles to:

movq%xmm1, %xmm1# 8 [c=4 l=4]  *vec_concatv4sf_0
movq%xmm0, %xmm0# 9 [c=4 l=4]  *vec_concatv4sf_0
movq%xmm2, %xmm2# 12[c=4 l=4]  *vec_concatv4sf_0
mulps   %xmm1, %xmm0# 10[c=16 l=3]  *mulv4sf3/0
movq%xmm0, %xmm0# 13[c=4 l=4]  *vec_concatv4sf_0
subps   %xmm2, %xmm0# 14[c=12 l=3]  *subv4sf3/0

The implementation emits V4SFmode operation, so we can remove all
"emulated"
SSE2 V2SFmode trapping instructions and remove "emulated" SSE2 V2SFmode
alternatives from 3dNOW! insn patterns.

PR target/110762

gcc/ChangeLog:

* config/i386/i386.md (plusminusmult): New code iterator.
* config/i386/mmx.md (mmxdoublevecmode): New mode attribute.
(movq__to_sse): New expander.
(v2sf3): Macroize expander from addv2sf3,
subv2sf3 and mulv2sf3 using plusminusmult code iterator.  Rewrite
as a wrapper around V4SFmode operation.
(mmx_addv2sf3): Change operand 1 and operand 2 predicates to
nonimmediate_operand.
(*mmx_addv2sf3): Remove SSE alternatives.  Change operand 1 and
operand 2 predicates to nonimmediate_operand.
(mmx_subv2sf3): Change operand 2 predicate to nonimmediate_operand.
(mmx_subrv2sf3): Change operand 1 predicate to
nonimmediate_operand.
(*mmx_subv2sf3): Remove SSE alternatives.  Change operand 1 and
operand 2 predicates to nonimmediate_operand.
(mmx_mulv2sf3): Change operand 1 and operand 2 predicates to
nonimmediate_operand.
(*mmx_mulv2sf3): Remove SSE alternatives.  Change operand 1 and
operand 2 predicates to nonimmediate_operand.
(divv2sf3): Rewrite as a wrapper around V4SFmode operation.
(v2sf3): Ditto.
(mmx_v2sf3): Change operand 1 and operand 2
predicates to nonimmediate_operand.
(*mmx_v2sf3): Remove SSE alternatives.  Change
operand 1 and operand 2 predicates to nonimmediate_operand.
(mmx_ieee_v2sf3): Ditto.
(sqrtv2sf2): Rewrite as a wrapper around V4SFmode operation.
(*mmx_haddv2sf3_low): Ditto.
(*mmx_hsubv2sf3_low): Ditto.
(vec_addsubv2sf3): Ditto.
(*mmx_maskcmpv2sf3_comm): Remove.
(*mmx_maskcmpv2sf3): Remove.
(vec_cmpv2sfv2si): Rewrite as a wrapper around V4SFmode operation.
(vcondv2sf): Ditto.
(fmav2sf4): Ditto.
(fmsv2sf4): Ditto.
(fnmav2sf4): Ditto.
(fnmsv2sf4): Ditto.
(fix_truncv2sfv2si2): Ditto.
(fixuns_truncv2sfv2si2): Ditto.
(mmx_fix_truncv2sfv2si2): Remove SSE alternatives.
Change operand 1 predicate to nonimmediate_operand.
(floatv2siv2sf2): Rewrite as a wrapper around V4SFmode operation.
(floatunsv2siv2sf2): Ditto.
(mmx_floatv2siv2sf2): Remove SSE alternatives.
Change operand 1 predicate to nonimmediate_operand.
(nearbyintv2sf2): Rewrite as a wrapper around V4SFmode operation.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-26 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #18 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #17)
> > compiles to:
> > 
> > movq%xmm1, %xmm1# 8 [c=4 l=4]  *vec_concatv4sf_0
> > movq%xmm0, %xmm0# 9 [c=4 l=4]  *vec_concatv4sf_0
> > movq%xmm2, %xmm2# 12[c=4 l=4]  *vec_concatv4sf_0
> > mulps   %xmm1, %xmm0# 10[c=16 l=3]  *mulv4sf3/0
> > movq%xmm0, %xmm0# 13[c=4 l=4]  *vec_concatv4sf_0
> 
> so this one is obviously redundant - I suppose at the RTL level we have
> no chance of noticing this.  I hope for integer vector operations we
> avoid these ops?  I think this will make epilog vectorization with V2SFmode
> a bad idea, we'd need to appropriately disqualify this in the costing
> hooks.

Yes, the redundant movq is emitted only in front of V2SFmode trapping
operations. So, all integer, V2SF logic and swizzling operations are still
implemented directly with "emulated" instructions.
> 
> I wonder if combine could for example combine a v2sf load with the
> upper half zeroing for the next use?  Likewise for arithmetics.

The patch already does that. We know that V2SF load zeroes the upper half, so
there is no additional MOVQ emitted. To illustrate, the testcase:

--cut here--
typedef float __attribute__((vector_size(8))) v2sf;

v2sf m;

v2sf test (v2sf a)
{
  return a - m;
}
--cut here--

compiles to:

movqm(%rip), %xmm1  # 6 [c=4 l=8]  *vec_concatv4sf_0
movq%xmm0, %xmm0# 7 [c=4 l=4]  *vec_concatv4sf_0
subps   %xmm1, %xmm0# 8 [c=12 l=3]  *subv4sf3/0

As far as arithmetic is concerned, perhaps some back-walking RTL optimization
pass can figure out that the preceding trapping V2SFmode operation guarantees
zeros in the upper half and remove clearing insn. However, MOVQ xmm,xmm is an
extremely fast instruction with latency of 1 and reciprocal throughput of 0.33,
so I guess it is not of much concern.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-26 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #17 from Richard Biener  ---
(In reply to Uroš Bizjak from comment #16)
> Created attachment 55636 [details]
> Proposed patch
> 
> Proposed patch clears the upper half of a V4SFmode operand register before
> all potentially trapping instructions. The testcase from comment #12 now
> compiles to:
> 
> movq%xmm1, %xmm1# 9 [c=4 l=4]  *vec_concatv4sf_0
> movq%xmm0, %xmm0# 10[c=4 l=4]  *vec_concatv4sf_0
> addps   %xmm1, %xmm0# 11[c=12 l=3]  *addv4sf3/0
> 
> This approach addresses issues with traps (Comment #0), as well as with
> denormal/invalid values (Comment #14). An obvious exception to the rule is a
> division, where the value != 0.0 should be loaded into the upper half of the
> denominator.
> 
> The patch effectively tightens the solution from PR95046 by clearing upper
> halves of all operand registers before every potentially trapping
> instruction. The testcase:
> 
> --cut here--
> typedef float __attribute__((vector_size(8))) v2sf;
> 
> v2sf test (v2sf a, v2sf b, v2sf c)
> {
>   return a * b - c;
> }
> --cut here--
> 
> compiles to:
> 
> movq%xmm1, %xmm1# 8 [c=4 l=4]  *vec_concatv4sf_0
> movq%xmm0, %xmm0# 9 [c=4 l=4]  *vec_concatv4sf_0
> movq%xmm2, %xmm2# 12[c=4 l=4]  *vec_concatv4sf_0
> mulps   %xmm1, %xmm0# 10[c=16 l=3]  *mulv4sf3/0
> movq%xmm0, %xmm0# 13[c=4 l=4]  *vec_concatv4sf_0

so this one is obviously redundant - I suppose at the RTL level we have
no chance of noticing this.  I hope for integer vector operations we
avoid these ops?  I think this will make epilog vectorization with V2SFmode
a bad idea, we'd need to appropriately disqualify this in the costing
hooks.

I wonder if combine could for example combine a v2sf load with the
upper half zeroing for the next use?  Likewise for arithmetics.

> subps   %xmm2, %xmm0# 14[c=12 l=3]  *subv4sf3/0
> 
> The implementation simply calls V4SFmode operation, so we can remove all
> "emulated" SSE2 V2SFmode instructions and SSE2 V2SFmode alternatives from
> 3dNOW! insn patterns.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-26 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|NEW |ASSIGNED

--- Comment #16 from Uroš Bizjak  ---
Created attachment 55636
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55636=edit
Proposed patch

Proposed patch clears the upper half of a V4SFmode operand register before all
potentially trapping instructions. The testcase from comment #12 now compiles
to:

movq%xmm1, %xmm1# 9 [c=4 l=4]  *vec_concatv4sf_0
movq%xmm0, %xmm0# 10[c=4 l=4]  *vec_concatv4sf_0
addps   %xmm1, %xmm0# 11[c=12 l=3]  *addv4sf3/0

This approach addresses issues with traps (Comment #0), as well as with
denormal/invalid values (Comment #14). An obvious exception to the rule is a
division, where the value != 0.0 should be loaded into the upper half of the
denominator.

The patch effectively tightens the solution from PR95046 by clearing upper
halves of all operand registers before every potentially trapping instruction.
The testcase:

--cut here--
typedef float __attribute__((vector_size(8))) v2sf;

v2sf test (v2sf a, v2sf b, v2sf c)
{
  return a * b - c;
}
--cut here--

compiles to:

movq%xmm1, %xmm1# 8 [c=4 l=4]  *vec_concatv4sf_0
movq%xmm0, %xmm0# 9 [c=4 l=4]  *vec_concatv4sf_0
movq%xmm2, %xmm2# 12[c=4 l=4]  *vec_concatv4sf_0
mulps   %xmm1, %xmm0# 10[c=16 l=3]  *mulv4sf3/0
movq%xmm0, %xmm0# 13[c=4 l=4]  *vec_concatv4sf_0
subps   %xmm2, %xmm0# 14[c=12 l=3]  *subv4sf3/0

The implementation simply calls V4SFmode operation, so we can remove all
"emulated" SSE2 V2SFmode instructions and SSE2 V2SFmode alternatives from
3dNOW! insn patterns.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread jbeulich at suse dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #15 from jbeulich at suse dot com ---
(In reply to Richard Biener from comment #12)
> _mm_storel_pi could be implemented using __builtin_shufflevector these days.
> Which shows exactly the same issue:

(also related to comment 10) I don't think the problem is how the registers are
filled (and in my example I simply used the first approach that came to mind
and worked). The problem is that the arithmetic insn assumes the upper parts to
not hold certain special values (or pairs thereof). Aiui one could create the
exact same situation with inline assembly instead of any of the builtins.

This isn't any different from using 512-bit operations for more narrow vectors
when AVX512VL isn't enabled. Afaict such uses are carefully avoided for
floating point vectors, and are used only in a limited number of cases on
integer vectors (Hongtao recently asked me to not go any further in that
direction either).

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #14 from Alexander Monakov  ---
That seems undesirable in light of comment #4, you'd risk creating a situation
when -fno-trapping-math is unpredictably slower when denormals appear in dirty
upper halves.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #13 from Uroš Bizjak  ---
I think we should put all partial vector V2SF operations under
!flag_trapping_math.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #12 from Richard Biener  ---
_mm_storel_pi could be implemented using __builtin_shufflevector these days.
Which shows exactly the same issue:

typedef float __attribute__((vector_size(8))) v2sf_t;
typedef float __attribute__((vector_size(16))) v4sf_t;

v2sf_t test(v4sf_t x, v4sf_t y) {
v2sf_t x2, y2;

x2 = __builtin_shufflevector (x, x, 0, 1);
y2 = __builtin_shufflevector (y, x, 0, 1);

return x2 + y2;
}

expands to

(insn 7 4 8 2 (set (reg:DI 88)
(vec_select:DI (subreg:V2DI (reg/v:V4SF 85 [ x ]) 0)
(parallel [
(const_int 0 [0])
]))) "t.c":7:5 -1
 (nil))
(insn 8 7 9 2 (set (reg:DI 89)
(vec_select:DI (subreg:V2DI (reg/v:V4SF 86 [ y ]) 0)
(parallel [
(const_int 0 [0])
]))) "t.c":8:5 -1
 (nil))
(insn 9 8 10 2 (set (reg:V2SF 87)
(plus:V2SF (subreg:V2SF (reg:DI 88) 0)
(subreg:V2SF (reg:DI 89) 0))) "t.c":12:12 -1
 (nil))

and is recognized by the same set_noop_p code.  On GIMPLE we have

  x2_2 = BIT_FIELD_REF ;
  y2_4 = BIT_FIELD_REF ;
  _5 = x2_2 + y2_4;

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #11 from Richard Biener  ---
(In reply to Richard Biener from comment #2)
> The
> 
>(insn 13 4 14 2 (set (reg:V2SF 20 xmm0 [orig:91 x2 ] [91])
> (vec_select:V2SF (reg:V4SF 20 xmm0 [94])
> (parallel [
> (const_int 0 [0])
> (const_int 1 [0x1])
> ]))) "t.c":10:12 4394 {sse_storelps}
>  (nil))
> 
> insns are gone in split after reload.

The opinion is that the above insn leaves the upper half of xmm0 undefined.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #10 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #7)
> I guess for the specific usage we need to wrap this in an UNSPEC?

Probably, so a MOVQ xmm, xmm insn should be emitted for __builtin_ia32_storelps
(AKA _mm_storel_pi), so the top 64bits will be cleared. There is already
*vec_concatv4sf_0 that looks appropriate to implement the move.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread jbeulich at suse dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #9 from jbeulich at suse dot com ---
(In reply to Richard Biener from comment #1)
> So what's the issue?  That this is wrong for -ftrapping-math?

Even without that option MXCSR may be modified for reasons contained to just
the upper halves of the registers.

>  Or that the
> return value has undefined contents in the upper half?  (I don't think the
> ABI specifies how V2SF is returned)

This part is fine, aiui.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #8 from Richard Biener  ---
OTOH the set isn't noop for the xmm0 hardreg (it zeros the upper parts)

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #7 from Richard Biener  ---
I guess for the specific usage we need to wrap this in an UNSPEC?

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #6 from Richard Biener  ---
(In reply to Segher Boessenkool from comment #5)
> (In reply to Richard Biener from comment #2)
> > The
> > 
> >(insn 13 4 14 2 (set (reg:V2SF 20 xmm0 [orig:91 x2 ] [91])
> > (vec_select:V2SF (reg:V4SF 20 xmm0 [94])
> > (parallel [
> > (const_int 0 [0])
> > (const_int 1 [0x1])
> > ]))) "t.c":10:12 4394 {sse_storelps}
> >  (nil))
> > 
> > insns are gone in split after reload.
> 
> Insns 13 and 14 are deleted by split2, yes.  Although the very next insn
> (15) obviously uses the regs (20 and 21) those insns set?!

set_noop_p returns true for it ...

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

Segher Boessenkool  changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org

--- Comment #5 from Segher Boessenkool  ---
(In reply to Richard Biener from comment #2)
> The
> 
>(insn 13 4 14 2 (set (reg:V2SF 20 xmm0 [orig:91 x2 ] [91])
> (vec_select:V2SF (reg:V4SF 20 xmm0 [94])
> (parallel [
> (const_int 0 [0])
> (const_int 1 [0x1])
> ]))) "t.c":10:12 4394 {sse_storelps}
>  (nil))
> 
> insns are gone in split after reload.

Insns 13 and 14 are deleted by split2, yes.  Although the very next insn
(15) obviously uses the regs (20 and 21) those insns set?!

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
In addition to FPU exception issue, it's also a performance trap due to
handling of accidental denormals in upper halves.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #3 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #1)
> So what's the issue?  That this is wrong for -ftrapping-math?  Or that the
> return value has undefined contents in the upper half?  (I don't think the
> ABI specifies how V2SF is returned)

__m64 is classified as SSE class, returned in XMM register.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

Richard Biener  changed:

   What|Removed |Added

   Keywords||wrong-code
   Last reconfirmed||2023-07-21
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #2 from Richard Biener  ---
The

   (insn 13 4 14 2 (set (reg:V2SF 20 xmm0 [orig:91 x2 ] [91])
(vec_select:V2SF (reg:V4SF 20 xmm0 [94])
(parallel [
(const_int 0 [0])
(const_int 1 [0x1])
]))) "t.c":10:12 4394 {sse_storelps}
 (nil))

insns are gone in split after reload.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #1 from Richard Biener  ---
So what's the issue?  That this is wrong for -ftrapping-math?  Or that the
return value has undefined contents in the upper half?  (I don't think the
ABI specifies how V2SF is returned)