[Bug target/88278] Fails to elide zeroing of upper vector register

2018-12-03 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Richard Biener  ---
Seems to be fixed.

[Bug target/88278] Fails to elide zeroing of upper vector register

2018-12-02 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278

--- Comment #8 from Jakub Jelinek  ---
Author: jakub
Date: Sun Dec  2 20:43:49 2018
New Revision: 266728

URL: https://gcc.gnu.org/viewcvs?rev=266728=gcc=rev
Log:
PR target/88278
* config/i386/sse.md (*vec_concatv4sf_0, *vec_concatv4si_0): New insns.

* gcc.target/i386/pr88278.c: New test.
* gcc.target/i386/pr53759.c: Don't expect vmovlps insn, expect vmovq
instead.
* gcc.target/i386/pr53759-2.c: New test.

Added:
trunk/gcc/testsuite/gcc.target/i386/pr53759-2.c
trunk/gcc/testsuite/gcc.target/i386/pr88278.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/sse.md
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.target/i386/pr53759.c

[Bug target/88278] Fails to elide zeroing of upper vector register

2018-11-30 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278

--- Comment #7 from rguenther at suse dot de  ---
On November 30, 2018 4:28:54 PM GMT+01:00, "jakub at gcc dot gnu.org"
 wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278
>
>--- Comment #6 from Jakub Jelinek  ---
>I wonder about something like:
>--- gcc/config/i386/sse.md.jj   2018-11-29 23:16:06.481301632 +0100
>+++ gcc/config/i386/sse.md  2018-11-30 16:21:21.480379008 +0100
>@@ -7248,6 +7248,17 @@
>(set_attr "prefix" "orig,maybe_evex,orig,maybe_evex")
>(set_attr "mode" "V4SF,V4SF,V2SF,V2SF")])
>
>+(define_insn "*vec_concatv4sf_0"
>+  [(set (match_operand:V4SF 0 "register_operand"   "=v")
>+   (vec_concat:V4SF
>+ (match_operand:V2SF 1 "nonimmediate_operand" "xm")
>+ (match_operand:V2SF 2 "const0_operand"   " C")))]
>+  "TARGET_SSE2"
>+  "%vmovq\t{%1, %0|%0, %1}"
>+  [(set_attr "type" "ssemov")
>+   (set_attr "prefix" "maybe_vex")
>+   (set_attr "mode" "DF")])
>+
>;; Avoid combining registers from different units in a single
>alternative,
> ;; see comment above inline_secondary_memory_needed function in i386.c
> (define_insn "vec_set_0"
>@@ -14409,6 +14420,23 @@
>(set_attr "prefix" "orig,maybe_evex,orig,orig,maybe_evex")
>(set_attr "mode" "TI,TI,V4SF,V2SF,V2SF")])
>
>+(define_insn "*vec_concatv4si_0"
>+  [(set (match_operand:V4SI 0 "register_operand"   "=v,x")
>+   (vec_concat:V4SI
>+ (match_operand:V2SI 1 "nonimmediate_operand" "vm,?!*y")
>+ (match_operand:V2SI 2 "const0_operand"   " C,C")))]
>+  "TARGET_SSE2"
>+  "@
>+   %vmovq\t{%1, %0|%0, %1}
>+   movq2dq\t{%1, %0|%0, %1}"
>+  [(set_attr "type" "ssemov")
>+   (set_attr "prefix" "maybe_vex,orig")
>+   (set_attr "mode" "TI")
>+   (set (attr "preferred_for_speed")
>+ (if_then_else (eq_attr "alternative" "1")
>+   (symbol_ref "TARGET_INTER_UNIT_MOVES_FROM_VEC")
>+   (symbol_ref "true")))])
>+
> ;; movd instead of movq is required to handle broken assemblers.
> (define_insn "vec_concatv2di"
>   [(set (match_operand:V2DI 0 "register_operand"
>
>but the #c0 testcases don't compile for me with -O2 -msse2 -fgimple
>(nor
>-mavx), so I can't easily verify.

I have committed the prerequisite for that now. 

>I don't see how we could get rid of those for the v <- v,C cases,
>unless we
>analyze whatever instruction generated it and prove that it leaves all
>the
>higher bits set to zero.  E.g. one could have a v4si to v2si downcast
>(just
>picking the lowpart subreg) followed by concatenating it with zero, and
>if we
>blindly drop the movq instruction, the upper bits might be non-zero.

Yeah, I'm looking for a way to do low part loads from memory with zeroing the
rest.

[Bug target/88278] Fails to elide zeroing of upper vector register

2018-11-30 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278

--- Comment #6 from Jakub Jelinek  ---
I wonder about something like:
--- gcc/config/i386/sse.md.jj   2018-11-29 23:16:06.481301632 +0100
+++ gcc/config/i386/sse.md  2018-11-30 16:21:21.480379008 +0100
@@ -7248,6 +7248,17 @@
(set_attr "prefix" "orig,maybe_evex,orig,maybe_evex")
(set_attr "mode" "V4SF,V4SF,V2SF,V2SF")])

+(define_insn "*vec_concatv4sf_0"
+  [(set (match_operand:V4SF 0 "register_operand"   "=v")
+   (vec_concat:V4SF
+ (match_operand:V2SF 1 "nonimmediate_operand" "xm")
+ (match_operand:V2SF 2 "const0_operand"   " C")))]
+  "TARGET_SSE2"
+  "%vmovq\t{%1, %0|%0, %1}"
+  [(set_attr "type" "ssemov")
+   (set_attr "prefix" "maybe_vex")
+   (set_attr "mode" "DF")])
+
 ;; Avoid combining registers from different units in a single alternative,
 ;; see comment above inline_secondary_memory_needed function in i386.c
 (define_insn "vec_set_0"
@@ -14409,6 +14420,23 @@
(set_attr "prefix" "orig,maybe_evex,orig,orig,maybe_evex")
(set_attr "mode" "TI,TI,V4SF,V2SF,V2SF")])

+(define_insn "*vec_concatv4si_0"
+  [(set (match_operand:V4SI 0 "register_operand"   "=v,x")
+   (vec_concat:V4SI
+ (match_operand:V2SI 1 "nonimmediate_operand" "vm,?!*y")
+ (match_operand:V2SI 2 "const0_operand"   " C,C")))]
+  "TARGET_SSE2"
+  "@
+   %vmovq\t{%1, %0|%0, %1}
+   movq2dq\t{%1, %0|%0, %1}"
+  [(set_attr "type" "ssemov")
+   (set_attr "prefix" "maybe_vex,orig")
+   (set_attr "mode" "TI")
+   (set (attr "preferred_for_speed")
+ (if_then_else (eq_attr "alternative" "1")
+   (symbol_ref "TARGET_INTER_UNIT_MOVES_FROM_VEC")
+   (symbol_ref "true")))])
+
 ;; movd instead of movq is required to handle broken assemblers.
 (define_insn "vec_concatv2di"
   [(set (match_operand:V2DI 0 "register_operand"

but the #c0 testcases don't compile for me with -O2 -msse2 -fgimple (nor
-mavx), so I can't easily verify.

I don't see how we could get rid of those for the v <- v,C cases, unless we
analyze whatever instruction generated it and prove that it leaves all the
higher bits set to zero.  E.g. one could have a v4si to v2si downcast (just
picking the lowpart subreg) followed by concatenating it with zero, and if we
blindly drop the movq instruction, the upper bits might be non-zero.

[Bug target/88278] Fails to elide zeroing of upper vector register

2018-11-30 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278

--- Comment #5 from Jakub Jelinek  ---
Note, we also have vec_concatv2di pattern that handles:
(set (match_operand:V2DI 0 ("register_operand") ("=Yr,*x,x ,v ,v,v ,x  
,x,v ,x,x,v"))
(vec_concat:V2DI (match_operand:DI 1 ("nonimmediate_operand") (" 
0, 0,x ,Yv,r,vm,?!*y,0,Yv,0,0,v"))
(match_operand:DI 2 ("nonimm_or_0_operand") (" rm,rm,rm,rm,C ,C
,C ,x,Yv,x,m,m"
and then *vec_concatv4si which handles far less than that, just the last 5
alternatives.  Conceptually, both operations do the same thing, on the other
side not sure if it is a good idea to say to RA that it could put V2SImode
pseudos into GPRs.  So, that leaves us at the first pinsrq alternatives with
just "m" for them, not really sure they are worth it.  Then there is one v <-
r, C alternative, again, I think we shouldn't put V2SImode into GPRs, v <- vm,
C alternative that we definitely want, but it could be a separate
*vec_concatv4si_0
pattern two, as nonimmediate_operand and const0_operand don't overlap and last
x <- ?!*y, C, that's MMXish, maybe, maybe not.

[Bug target/88278] Fails to elide zeroing of upper vector register

2018-11-30 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278

--- Comment #4 from Marc Glisse  ---
(In reply to Jakub Jelinek from comment #2)
> All these use something like:
> (insn 7 6 13 2 (set (reg:V8SI 87)
> (unspec:V8SI [
> (mem:V4SI (reg:DI 90) [0 *x_3(D)+0 S16 A128])
> ] UNSPEC_CAST)) "include/avxintrin.h":1484:20 4813 {avx_si256_si}
>  (expr_list:REG_DEAD (reg:DI 90)
> (nil)))
> Not really sure why UNSPEC_CAST rather than representing it with something
> natural like VEC_CONCAT of nonimmediate_operand and const0_operand.

I tried to get rid of the UNSPEC_CAST in PR 50829... VEC_CONCAT with 0 may be
fine when loading from memory, but we don't want it when the argument is in a
register.

[Bug target/88278] Fails to elide zeroing of upper vector register

2018-11-30 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278

--- Comment #3 from rguenther at suse dot de  ---
On Fri, 30 Nov 2018, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278
> 
> Jakub Jelinek  changed:
> 
>What|Removed |Added
> 
>  Status|UNCONFIRMED |NEW
>Last reconfirmed||2018-11-30
>  Ever confirmed|0   |1
> 
> --- Comment #2 from Jakub Jelinek  ---
> I guess
> #include 
> 
> __m128i
> foo (__m64 *x)
> {
>   return _mm_movpi64_epi64 (*x);
> }
> is what intrinsic users would write for this case, and that is optimized
> properly:
> (insn 7 6 12 2 (set (reg:V2DI 87)
> (vec_concat:V2DI (mem:DI (reg:DI 89) [0 *x_3(D)+0 S8 A64])
> (const_int 0 [0]))) "include/emmintrin.h":592:24 3956
> {vec_concatv2di}
>  (expr_list:REG_DEAD (reg:DI 89)
> (nil)))
> 
> Similarly e.g.
> #include 
> 
> __m256
> foo (__m128 *x)
> {
>   return _mm256_castps128_ps256 (*x);
> }
> which is conceptually closest to this case.
> Or
> #include 
> 
> __m256i
> foo (__m128i *x)
> {
>   return _mm256_castsi128_si256 (*x);
> }
> 
> All these use something like:
> (insn 7 6 13 2 (set (reg:V8SI 87)
> (unspec:V8SI [
> (mem:V4SI (reg:DI 90) [0 *x_3(D)+0 S16 A128])
> ] UNSPEC_CAST)) "include/avxintrin.h":1484:20 4813 {avx_si256_si}
>  (expr_list:REG_DEAD (reg:DI 90)
> (nil)))
> Not really sure why UNSPEC_CAST rather than representing it with something
> natural like VEC_CONCAT of nonimmediate_operand and const0_operand.

OK, it indeed seems to "work" when punning via integers:

typedef unsigned long v2di __attribute__((vector_size(16)));

v2di __GIMPLE baz (unsigned long *p)
{
  unsigned long _2;
  v2di _3;

bb_2:
  _2 = __MEM  (p_1(D));
  _3 = _Literal (v2di) { _2, _Literal (unsigned long) 0 };
  return _3;
}

looks like for this combine can do

Successfully matched this instruction:
(set (reg:V2DI 87)
(vec_concat:V2DI (mem:DI (reg:DI 89) [1 *p_1(D)+0 S8 A64])
(const_int 0 [0])))

which means the vector variants could be handled similarly
by macroizing on vector modes with matching sizes?  Or is
this undesirable?  If we declare the above canonical RTL
for zero-"extending" loads into vector registers then we
can handle this during RTL expansion I guess.

[Bug target/88278] Fails to elide zeroing of upper vector register

2018-11-30 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278

Jakub Jelinek  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2018-11-30
 Ever confirmed|0   |1

--- Comment #2 from Jakub Jelinek  ---
I guess
#include 

__m128i
foo (__m64 *x)
{
  return _mm_movpi64_epi64 (*x);
}
is what intrinsic users would write for this case, and that is optimized
properly:
(insn 7 6 12 2 (set (reg:V2DI 87)
(vec_concat:V2DI (mem:DI (reg:DI 89) [0 *x_3(D)+0 S8 A64])
(const_int 0 [0]))) "include/emmintrin.h":592:24 3956
{vec_concatv2di}
 (expr_list:REG_DEAD (reg:DI 89)
(nil)))

Similarly e.g.
#include 

__m256
foo (__m128 *x)
{
  return _mm256_castps128_ps256 (*x);
}
which is conceptually closest to this case.
Or
#include 

__m256i
foo (__m128i *x)
{
  return _mm256_castsi128_si256 (*x);
}

All these use something like:
(insn 7 6 13 2 (set (reg:V8SI 87)
(unspec:V8SI [
(mem:V4SI (reg:DI 90) [0 *x_3(D)+0 S16 A128])
] UNSPEC_CAST)) "include/avxintrin.h":1484:20 4813 {avx_si256_si}
 (expr_list:REG_DEAD (reg:DI 90)
(nil)))
Not really sure why UNSPEC_CAST rather than representing it with something
natural like VEC_CONCAT of nonimmediate_operand and const0_operand.

[Bug target/88278] Fails to elide zeroing of upper vector register

2018-11-30 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||x86_64-*-* i?86-*-*
 CC||jakub at gcc dot gnu.org,
   ||uros at gcc dot gnu.org

--- Comment #1 from Richard Biener  ---
Hmm, it looks like *movdi_internal and friends to not represent the implicit
zeroing of the upper part?

I guess RTL in general (before reload) doesn't know that V2SI is the low
part of a V4SI register.  But after reload we see (split2)

(insn 7 5 8 2 (set (reg:V8QI 20 xmm0 [orig:88 MEM[(unsigned char *)p_1(D)] ]
[88])
(mem:V8QI (reg:DI 5 di [91]) [0 MEM[(unsigned char *)p_1(D)]+0 S8 A8]))
1078 {*movv8qi_internal}
 (nil))
(insn 8 7 9 2 (set (reg:V2SI 21 xmm1 [90])
(const_vector:V2SI [
(const_int 0 [0]) repeated x2
])) 1080 {*movv2si_internal}
 (expr_list:REG_EQUIV (const_vector:V2SI [
(const_int 0 [0]) repeated x2
])
(nil)))
(insn 9 8 16 2 (set (reg:V4SI 20 xmm0 [89])
(vec_concat:V4SI (reg:V2SI 20 xmm0 [orig:88 MEM[(unsigned char
*)p_1(D)] ] [88])
(reg:V2SI 21 xmm1 [90]))) 3955 {*vec_concatv4si}
 (expr_list:REG_EQUAL (vec_concat:V4SI (subreg:V2SI (reg:V8QI 20 xmm0
[orig:88 MEM[(unsigned char *)p_1(D)] ] [88]) 0)
(const_vector:V2SI [
(const_int 0 [0]) repeated x2
]))
(nil)))

where the pattern is probably easier to optimize (but we then fail to elide
the xmm1 register as not needed eventually)?