On Sat, May 3, 2014 at 2:39 AM, Cong Hou <co...@google.com> wrote: > On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener <rguent...@suse.de> wrote: >> On Thu, 24 Apr 2014, Cong Hou wrote: >> >>> Given the following loop: >>> >>> int a[N]; >>> short b[N*2]; >>> >>> for (int i = 0; i < N; ++i) >>> a[i] = b[i*2]; >>> >>> >>> After being vectorized, the access to b[i*2] will be compiled into >>> several packing statements, while the type promotion from short to int >>> will be compiled into several unpacking statements. With this patch, >>> each pair of pack/unpack statements will be replaced by less expensive >>> statements (with shift or bit-and operations). >>> >>> On x86_64, the loop above will be compiled into the following assembly >>> (with -O2 -ftree-vectorize): >>> >>> movdqu 0x10(%rcx),%xmm3 >>> movdqu -0x20(%rcx),%xmm0 >>> movdqa %xmm0,%xmm2 >>> punpcklwd %xmm3,%xmm0 >>> punpckhwd %xmm3,%xmm2 >>> movdqa %xmm0,%xmm3 >>> punpcklwd %xmm2,%xmm0 >>> punpckhwd %xmm2,%xmm3 >>> movdqa %xmm1,%xmm2 >>> punpcklwd %xmm3,%xmm0 >>> pcmpgtw %xmm0,%xmm2 >>> movdqa %xmm0,%xmm3 >>> punpckhwd %xmm2,%xmm0 >>> punpcklwd %xmm2,%xmm3 >>> movups %xmm0,-0x10(%rdx) >>> movups %xmm3,-0x20(%rdx) >>> >>> >>> With this patch, the generated assembly is shown below: >>> >>> movdqu 0x10(%rcx),%xmm0 >>> movdqu -0x20(%rcx),%xmm1 >>> pslld $0x10,%xmm0 >>> psrad $0x10,%xmm0 >>> pslld $0x10,%xmm1 >>> movups %xmm0,-0x10(%rdx) >>> psrad $0x10,%xmm1 >>> movups %xmm1,-0x20(%rdx) >>> >>> >>> Bootstrapped and tested on x86-64. OK for trunk? >> >> This is an odd place to implement such transform. Also if it >> is faster or not depends on the exact ISA you target - for >> example ppc has constraints on the maximum number of shifts >> carried out in parallel and the above has 4 in very short >> succession. Esp. for the sign-extend path. > > Thank you for the information about ppc. If this is an issue, I think > we can do it in a target dependent way. > > >> >> So this looks more like an opportunity for a post-vectorizer >> transform on RTL or for the vectorizer special-casing >> widening loads with a vectorizer pattern. > > I am not sure if the RTL transform is more difficult to implement. I > prefer the widening loads method, which can be detected in a pattern > recognizer. The target related issue will be resolved by only > expanding the widening load on those targets where this pattern is > beneficial. But this requires new tree operations to be defined. What > is your suggestion? > > I apologize for the delayed reply.
Likewise ;) I suggest to implement this optimization in vector lowering in tree-vect-generic.c. This sees for your example vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B]; vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B]; vect_perm_even_35 = VEC_PERM_EXPR <vect__5.7_32, vect__5.8_34, { 0, 2, 4, 6, 8, 10, 12, 14 }>; vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35; vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35; where you can apply the pattern matching and transform (after checking with the target, of course). Richard. > > thanks, > Cong > >> >> Richard.