On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener <rguent...@suse.de> wrote:
> On Thu, 24 Apr 2014, Cong Hou wrote:
>
>> Given the following loop:
>>
>> int a[N];
>> short b[N*2];
>>
>> for (int i = 0; i < N; ++i)
>>   a[i] = b[i*2];
>>
>>
>> After being vectorized, the access to b[i*2] will be compiled into
>> several packing statements, while the type promotion from short to int
>> will be compiled into several unpacking statements. With this patch,
>> each pair of pack/unpack statements will be replaced by less expensive
>> statements (with shift or bit-and operations).
>>
>> On x86_64, the loop above will be compiled into the following assembly
>> (with -O2 -ftree-vectorize):
>>
>> movdqu 0x10(%rcx),%xmm3
>> movdqu -0x20(%rcx),%xmm0
>> movdqa %xmm0,%xmm2
>> punpcklwd %xmm3,%xmm0
>> punpckhwd %xmm3,%xmm2
>> movdqa %xmm0,%xmm3
>> punpcklwd %xmm2,%xmm0
>> punpckhwd %xmm2,%xmm3
>> movdqa %xmm1,%xmm2
>> punpcklwd %xmm3,%xmm0
>> pcmpgtw %xmm0,%xmm2
>> movdqa %xmm0,%xmm3
>> punpckhwd %xmm2,%xmm0
>> punpcklwd %xmm2,%xmm3
>> movups %xmm0,-0x10(%rdx)
>> movups %xmm3,-0x20(%rdx)
>>
>>
>> With this patch, the generated assembly is shown below:
>>
>> movdqu 0x10(%rcx),%xmm0
>> movdqu -0x20(%rcx),%xmm1
>> pslld  $0x10,%xmm0
>> psrad  $0x10,%xmm0
>> pslld  $0x10,%xmm1
>> movups %xmm0,-0x10(%rdx)
>> psrad  $0x10,%xmm1
>> movups %xmm1,-0x20(%rdx)
>>
>>
>> Bootstrapped and tested on x86-64. OK for trunk?
>
> This is an odd place to implement such transform.  Also if it
> is faster or not depends on the exact ISA you target - for
> example ppc has constraints on the maximum number of shifts
> carried out in parallel and the above has 4 in very short
> succession.  Esp. for the sign-extend path.

Thank you for the information about ppc. If this is an issue, I think
we can do it in a target dependent way.


>
> So this looks more like an opportunity for a post-vectorizer
> transform on RTL or for the vectorizer special-casing
> widening loads with a vectorizer pattern.

I am not sure if the RTL transform is more difficult to implement. I
prefer the widening loads method, which can be detected in a pattern
recognizer. The target related issue will be resolved by only
expanding the widening load on those targets where this pattern is
beneficial. But this requires new tree operations to be defined. What
is your suggestion?

I apologize for the delayed reply.


thanks,
Cong

>
> Richard.

Reply via email to