[Bug target/88278] Fails to elide zeroing of upper vector register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278 Richard Biener changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #9 from Richard Biener --- Seems to be fixed.
[Bug target/88278] Fails to elide zeroing of upper vector register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278 --- Comment #8 from Jakub Jelinek --- Author: jakub Date: Sun Dec 2 20:43:49 2018 New Revision: 266728 URL: https://gcc.gnu.org/viewcvs?rev=266728=gcc=rev Log: PR target/88278 * config/i386/sse.md (*vec_concatv4sf_0, *vec_concatv4si_0): New insns. * gcc.target/i386/pr88278.c: New test. * gcc.target/i386/pr53759.c: Don't expect vmovlps insn, expect vmovq instead. * gcc.target/i386/pr53759-2.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr53759-2.c trunk/gcc/testsuite/gcc.target/i386/pr88278.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/sse.md trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/i386/pr53759.c
[Bug target/88278] Fails to elide zeroing of upper vector register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278 --- Comment #7 from rguenther at suse dot de --- On November 30, 2018 4:28:54 PM GMT+01:00, "jakub at gcc dot gnu.org" wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278 > >--- Comment #6 from Jakub Jelinek --- >I wonder about something like: >--- gcc/config/i386/sse.md.jj 2018-11-29 23:16:06.481301632 +0100 >+++ gcc/config/i386/sse.md 2018-11-30 16:21:21.480379008 +0100 >@@ -7248,6 +7248,17 @@ >(set_attr "prefix" "orig,maybe_evex,orig,maybe_evex") >(set_attr "mode" "V4SF,V4SF,V2SF,V2SF")]) > >+(define_insn "*vec_concatv4sf_0" >+ [(set (match_operand:V4SF 0 "register_operand" "=v") >+ (vec_concat:V4SF >+ (match_operand:V2SF 1 "nonimmediate_operand" "xm") >+ (match_operand:V2SF 2 "const0_operand" " C")))] >+ "TARGET_SSE2" >+ "%vmovq\t{%1, %0|%0, %1}" >+ [(set_attr "type" "ssemov") >+ (set_attr "prefix" "maybe_vex") >+ (set_attr "mode" "DF")]) >+ >;; Avoid combining registers from different units in a single >alternative, > ;; see comment above inline_secondary_memory_needed function in i386.c > (define_insn "vec_set_0" >@@ -14409,6 +14420,23 @@ >(set_attr "prefix" "orig,maybe_evex,orig,orig,maybe_evex") >(set_attr "mode" "TI,TI,V4SF,V2SF,V2SF")]) > >+(define_insn "*vec_concatv4si_0" >+ [(set (match_operand:V4SI 0 "register_operand" "=v,x") >+ (vec_concat:V4SI >+ (match_operand:V2SI 1 "nonimmediate_operand" "vm,?!*y") >+ (match_operand:V2SI 2 "const0_operand" " C,C")))] >+ "TARGET_SSE2" >+ "@ >+ %vmovq\t{%1, %0|%0, %1} >+ movq2dq\t{%1, %0|%0, %1}" >+ [(set_attr "type" "ssemov") >+ (set_attr "prefix" "maybe_vex,orig") >+ (set_attr "mode" "TI") >+ (set (attr "preferred_for_speed") >+ (if_then_else (eq_attr "alternative" "1") >+ (symbol_ref "TARGET_INTER_UNIT_MOVES_FROM_VEC") >+ (symbol_ref "true")))]) >+ > ;; movd instead of movq is required to handle broken assemblers. > (define_insn "vec_concatv2di" > [(set (match_operand:V2DI 0 "register_operand" > >but the #c0 testcases don't compile for me with -O2 -msse2 -fgimple >(nor >-mavx), so I can't easily verify. I have committed the prerequisite for that now. >I don't see how we could get rid of those for the v <- v,C cases, >unless we >analyze whatever instruction generated it and prove that it leaves all >the >higher bits set to zero. E.g. one could have a v4si to v2si downcast >(just >picking the lowpart subreg) followed by concatenating it with zero, and >if we >blindly drop the movq instruction, the upper bits might be non-zero. Yeah, I'm looking for a way to do low part loads from memory with zeroing the rest.
[Bug target/88278] Fails to elide zeroing of upper vector register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278 --- Comment #6 from Jakub Jelinek --- I wonder about something like: --- gcc/config/i386/sse.md.jj 2018-11-29 23:16:06.481301632 +0100 +++ gcc/config/i386/sse.md 2018-11-30 16:21:21.480379008 +0100 @@ -7248,6 +7248,17 @@ (set_attr "prefix" "orig,maybe_evex,orig,maybe_evex") (set_attr "mode" "V4SF,V4SF,V2SF,V2SF")]) +(define_insn "*vec_concatv4sf_0" + [(set (match_operand:V4SF 0 "register_operand" "=v") + (vec_concat:V4SF + (match_operand:V2SF 1 "nonimmediate_operand" "xm") + (match_operand:V2SF 2 "const0_operand" " C")))] + "TARGET_SSE2" + "%vmovq\t{%1, %0|%0, %1}" + [(set_attr "type" "ssemov") + (set_attr "prefix" "maybe_vex") + (set_attr "mode" "DF")]) + ;; Avoid combining registers from different units in a single alternative, ;; see comment above inline_secondary_memory_needed function in i386.c (define_insn "vec_set_0" @@ -14409,6 +14420,23 @@ (set_attr "prefix" "orig,maybe_evex,orig,orig,maybe_evex") (set_attr "mode" "TI,TI,V4SF,V2SF,V2SF")]) +(define_insn "*vec_concatv4si_0" + [(set (match_operand:V4SI 0 "register_operand" "=v,x") + (vec_concat:V4SI + (match_operand:V2SI 1 "nonimmediate_operand" "vm,?!*y") + (match_operand:V2SI 2 "const0_operand" " C,C")))] + "TARGET_SSE2" + "@ + %vmovq\t{%1, %0|%0, %1} + movq2dq\t{%1, %0|%0, %1}" + [(set_attr "type" "ssemov") + (set_attr "prefix" "maybe_vex,orig") + (set_attr "mode" "TI") + (set (attr "preferred_for_speed") + (if_then_else (eq_attr "alternative" "1") + (symbol_ref "TARGET_INTER_UNIT_MOVES_FROM_VEC") + (symbol_ref "true")))]) + ;; movd instead of movq is required to handle broken assemblers. (define_insn "vec_concatv2di" [(set (match_operand:V2DI 0 "register_operand" but the #c0 testcases don't compile for me with -O2 -msse2 -fgimple (nor -mavx), so I can't easily verify. I don't see how we could get rid of those for the v <- v,C cases, unless we analyze whatever instruction generated it and prove that it leaves all the higher bits set to zero. E.g. one could have a v4si to v2si downcast (just picking the lowpart subreg) followed by concatenating it with zero, and if we blindly drop the movq instruction, the upper bits might be non-zero.
[Bug target/88278] Fails to elide zeroing of upper vector register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278 --- Comment #5 from Jakub Jelinek --- Note, we also have vec_concatv2di pattern that handles: (set (match_operand:V2DI 0 ("register_operand") ("=Yr,*x,x ,v ,v,v ,x ,x,v ,x,x,v")) (vec_concat:V2DI (match_operand:DI 1 ("nonimmediate_operand") (" 0, 0,x ,Yv,r,vm,?!*y,0,Yv,0,0,v")) (match_operand:DI 2 ("nonimm_or_0_operand") (" rm,rm,rm,rm,C ,C ,C ,x,Yv,x,m,m" and then *vec_concatv4si which handles far less than that, just the last 5 alternatives. Conceptually, both operations do the same thing, on the other side not sure if it is a good idea to say to RA that it could put V2SImode pseudos into GPRs. So, that leaves us at the first pinsrq alternatives with just "m" for them, not really sure they are worth it. Then there is one v <- r, C alternative, again, I think we shouldn't put V2SImode into GPRs, v <- vm, C alternative that we definitely want, but it could be a separate *vec_concatv4si_0 pattern two, as nonimmediate_operand and const0_operand don't overlap and last x <- ?!*y, C, that's MMXish, maybe, maybe not.
[Bug target/88278] Fails to elide zeroing of upper vector register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278 --- Comment #4 from Marc Glisse --- (In reply to Jakub Jelinek from comment #2) > All these use something like: > (insn 7 6 13 2 (set (reg:V8SI 87) > (unspec:V8SI [ > (mem:V4SI (reg:DI 90) [0 *x_3(D)+0 S16 A128]) > ] UNSPEC_CAST)) "include/avxintrin.h":1484:20 4813 {avx_si256_si} > (expr_list:REG_DEAD (reg:DI 90) > (nil))) > Not really sure why UNSPEC_CAST rather than representing it with something > natural like VEC_CONCAT of nonimmediate_operand and const0_operand. I tried to get rid of the UNSPEC_CAST in PR 50829... VEC_CONCAT with 0 may be fine when loading from memory, but we don't want it when the argument is in a register.
[Bug target/88278] Fails to elide zeroing of upper vector register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278 --- Comment #3 from rguenther at suse dot de --- On Fri, 30 Nov 2018, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278 > > Jakub Jelinek changed: > >What|Removed |Added > > Status|UNCONFIRMED |NEW >Last reconfirmed||2018-11-30 > Ever confirmed|0 |1 > > --- Comment #2 from Jakub Jelinek --- > I guess > #include > > __m128i > foo (__m64 *x) > { > return _mm_movpi64_epi64 (*x); > } > is what intrinsic users would write for this case, and that is optimized > properly: > (insn 7 6 12 2 (set (reg:V2DI 87) > (vec_concat:V2DI (mem:DI (reg:DI 89) [0 *x_3(D)+0 S8 A64]) > (const_int 0 [0]))) "include/emmintrin.h":592:24 3956 > {vec_concatv2di} > (expr_list:REG_DEAD (reg:DI 89) > (nil))) > > Similarly e.g. > #include > > __m256 > foo (__m128 *x) > { > return _mm256_castps128_ps256 (*x); > } > which is conceptually closest to this case. > Or > #include > > __m256i > foo (__m128i *x) > { > return _mm256_castsi128_si256 (*x); > } > > All these use something like: > (insn 7 6 13 2 (set (reg:V8SI 87) > (unspec:V8SI [ > (mem:V4SI (reg:DI 90) [0 *x_3(D)+0 S16 A128]) > ] UNSPEC_CAST)) "include/avxintrin.h":1484:20 4813 {avx_si256_si} > (expr_list:REG_DEAD (reg:DI 90) > (nil))) > Not really sure why UNSPEC_CAST rather than representing it with something > natural like VEC_CONCAT of nonimmediate_operand and const0_operand. OK, it indeed seems to "work" when punning via integers: typedef unsigned long v2di __attribute__((vector_size(16))); v2di __GIMPLE baz (unsigned long *p) { unsigned long _2; v2di _3; bb_2: _2 = __MEM (p_1(D)); _3 = _Literal (v2di) { _2, _Literal (unsigned long) 0 }; return _3; } looks like for this combine can do Successfully matched this instruction: (set (reg:V2DI 87) (vec_concat:V2DI (mem:DI (reg:DI 89) [1 *p_1(D)+0 S8 A64]) (const_int 0 [0]))) which means the vector variants could be handled similarly by macroizing on vector modes with matching sizes? Or is this undesirable? If we declare the above canonical RTL for zero-"extending" loads into vector registers then we can handle this during RTL expansion I guess.
[Bug target/88278] Fails to elide zeroing of upper vector register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278 Jakub Jelinek changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2018-11-30 Ever confirmed|0 |1 --- Comment #2 from Jakub Jelinek --- I guess #include __m128i foo (__m64 *x) { return _mm_movpi64_epi64 (*x); } is what intrinsic users would write for this case, and that is optimized properly: (insn 7 6 12 2 (set (reg:V2DI 87) (vec_concat:V2DI (mem:DI (reg:DI 89) [0 *x_3(D)+0 S8 A64]) (const_int 0 [0]))) "include/emmintrin.h":592:24 3956 {vec_concatv2di} (expr_list:REG_DEAD (reg:DI 89) (nil))) Similarly e.g. #include __m256 foo (__m128 *x) { return _mm256_castps128_ps256 (*x); } which is conceptually closest to this case. Or #include __m256i foo (__m128i *x) { return _mm256_castsi128_si256 (*x); } All these use something like: (insn 7 6 13 2 (set (reg:V8SI 87) (unspec:V8SI [ (mem:V4SI (reg:DI 90) [0 *x_3(D)+0 S16 A128]) ] UNSPEC_CAST)) "include/avxintrin.h":1484:20 4813 {avx_si256_si} (expr_list:REG_DEAD (reg:DI 90) (nil))) Not really sure why UNSPEC_CAST rather than representing it with something natural like VEC_CONCAT of nonimmediate_operand and const0_operand.
[Bug target/88278] Fails to elide zeroing of upper vector register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278 Richard Biener changed: What|Removed |Added Keywords||missed-optimization Target||x86_64-*-* i?86-*-* CC||jakub at gcc dot gnu.org, ||uros at gcc dot gnu.org --- Comment #1 from Richard Biener --- Hmm, it looks like *movdi_internal and friends to not represent the implicit zeroing of the upper part? I guess RTL in general (before reload) doesn't know that V2SI is the low part of a V4SI register. But after reload we see (split2) (insn 7 5 8 2 (set (reg:V8QI 20 xmm0 [orig:88 MEM[(unsigned char *)p_1(D)] ] [88]) (mem:V8QI (reg:DI 5 di [91]) [0 MEM[(unsigned char *)p_1(D)]+0 S8 A8])) 1078 {*movv8qi_internal} (nil)) (insn 8 7 9 2 (set (reg:V2SI 21 xmm1 [90]) (const_vector:V2SI [ (const_int 0 [0]) repeated x2 ])) 1080 {*movv2si_internal} (expr_list:REG_EQUIV (const_vector:V2SI [ (const_int 0 [0]) repeated x2 ]) (nil))) (insn 9 8 16 2 (set (reg:V4SI 20 xmm0 [89]) (vec_concat:V4SI (reg:V2SI 20 xmm0 [orig:88 MEM[(unsigned char *)p_1(D)] ] [88]) (reg:V2SI 21 xmm1 [90]))) 3955 {*vec_concatv4si} (expr_list:REG_EQUAL (vec_concat:V4SI (subreg:V2SI (reg:V8QI 20 xmm0 [orig:88 MEM[(unsigned char *)p_1(D)] ] [88]) 0) (const_vector:V2SI [ (const_int 0 [0]) repeated x2 ])) (nil))) where the pattern is probably easier to optimize (but we then fail to elide the xmm1 register as not needed eventually)?