https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278
--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> --- On Fri, 30 Nov 2018, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88278 > > Jakub Jelinek <jakub at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|UNCONFIRMED |NEW > Last reconfirmed| |2018-11-30 > Ever confirmed|0 |1 > > --- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > I guess > #include <x86intrin.h> > > __m128i > foo (__m64 *x) > { > return _mm_movpi64_epi64 (*x); > } > is what intrinsic users would write for this case, and that is optimized > properly: > (insn 7 6 12 2 (set (reg:V2DI 87) > (vec_concat:V2DI (mem:DI (reg:DI 89) [0 *x_3(D)+0 S8 A64]) > (const_int 0 [0]))) "include/emmintrin.h":592:24 3956 > {vec_concatv2di} > (expr_list:REG_DEAD (reg:DI 89) > (nil))) > > Similarly e.g. > #include <x86intrin.h> > > __m256 > foo (__m128 *x) > { > return _mm256_castps128_ps256 (*x); > } > which is conceptually closest to this case. > Or > #include <x86intrin.h> > > __m256i > foo (__m128i *x) > { > return _mm256_castsi128_si256 (*x); > } > > All these use something like: > (insn 7 6 13 2 (set (reg:V8SI 87) > (unspec:V8SI [ > (mem:V4SI (reg:DI 90) [0 *x_3(D)+0 S16 A128]) > ] UNSPEC_CAST)) "include/avxintrin.h":1484:20 4813 {avx_si256_si} > (expr_list:REG_DEAD (reg:DI 90) > (nil))) > Not really sure why UNSPEC_CAST rather than representing it with something > natural like VEC_CONCAT of nonimmediate_operand and const0_operand. OK, it indeed seems to "work" when punning via integers: typedef unsigned long v2di __attribute__((vector_size(16))); v2di __GIMPLE baz (unsigned long *p) { unsigned long _2; v2di _3; bb_2: _2 = __MEM <unsigned long, 64> (p_1(D)); _3 = _Literal (v2di) { _2, _Literal (unsigned long) 0 }; return _3; } looks like for this combine can do Successfully matched this instruction: (set (reg:V2DI 87) (vec_concat:V2DI (mem:DI (reg:DI 89) [1 *p_1(D)+0 S8 A64]) (const_int 0 [0]))) which means the vector variants could be handled similarly by macroizing on vector modes with matching sizes? Or is this undesirable? If we declare the above canonical RTL for zero-"extending" loads into vector registers then we can handle this during RTL expansion I guess.