On Sat, May 3, 2014 at 2:39 AM, Cong Hou <co...@google.com> wrote:
> On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener <rguent...@suse.de> wrote:
>> On Thu, 24 Apr 2014, Cong Hou wrote:
>>
>>> Given the following loop:
>>>
>>> int a[N];
>>> short b[N*2];
>>>
>>> for (int i = 0; i < N; ++i)
>>>   a[i] = b[i*2];
>>>
>>>
>>> After being vectorized, the access to b[i*2] will be compiled into
>>> several packing statements, while the type promotion from short to int
>>> will be compiled into several unpacking statements. With this patch,
>>> each pair of pack/unpack statements will be replaced by less expensive
>>> statements (with shift or bit-and operations).
>>>
>>> On x86_64, the loop above will be compiled into the following assembly
>>> (with -O2 -ftree-vectorize):
>>>
>>> movdqu 0x10(%rcx),%xmm3
>>> movdqu -0x20(%rcx),%xmm0
>>> movdqa %xmm0,%xmm2
>>> punpcklwd %xmm3,%xmm0
>>> punpckhwd %xmm3,%xmm2
>>> movdqa %xmm0,%xmm3
>>> punpcklwd %xmm2,%xmm0
>>> punpckhwd %xmm2,%xmm3
>>> movdqa %xmm1,%xmm2
>>> punpcklwd %xmm3,%xmm0
>>> pcmpgtw %xmm0,%xmm2
>>> movdqa %xmm0,%xmm3
>>> punpckhwd %xmm2,%xmm0
>>> punpcklwd %xmm2,%xmm3
>>> movups %xmm0,-0x10(%rdx)
>>> movups %xmm3,-0x20(%rdx)
>>>
>>>
>>> With this patch, the generated assembly is shown below:
>>>
>>> movdqu 0x10(%rcx),%xmm0
>>> movdqu -0x20(%rcx),%xmm1
>>> pslld  $0x10,%xmm0
>>> psrad  $0x10,%xmm0
>>> pslld  $0x10,%xmm1
>>> movups %xmm0,-0x10(%rdx)
>>> psrad  $0x10,%xmm1
>>> movups %xmm1,-0x20(%rdx)
>>>
>>>
>>> Bootstrapped and tested on x86-64. OK for trunk?
>>
>> This is an odd place to implement such transform.  Also if it
>> is faster or not depends on the exact ISA you target - for
>> example ppc has constraints on the maximum number of shifts
>> carried out in parallel and the above has 4 in very short
>> succession.  Esp. for the sign-extend path.
>
> Thank you for the information about ppc. If this is an issue, I think
> we can do it in a target dependent way.
>
>
>>
>> So this looks more like an opportunity for a post-vectorizer
>> transform on RTL or for the vectorizer special-casing
>> widening loads with a vectorizer pattern.
>
> I am not sure if the RTL transform is more difficult to implement. I
> prefer the widening loads method, which can be detected in a pattern
> recognizer. The target related issue will be resolved by only
> expanding the widening load on those targets where this pattern is
> beneficial. But this requires new tree operations to be defined. What
> is your suggestion?
>
> I apologize for the delayed reply.

Likewise ;)

I suggest to implement this optimization in vector lowering in
tree-vect-generic.c.  This sees for your example

  vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B];
  vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B];
  vect_perm_even_35 = VEC_PERM_EXPR <vect__5.7_32, vect__5.8_34, { 0,
2, 4, 6, 8, 10, 12, 14 }>;
  vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35;
  vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35;

where you can apply the pattern matching and transform (after checking
with the target, of course).

Richard.

>
> thanks,
> Cong
>
>>
>> Richard.

Reply via email to