On Thu, May 9, 2013 at 10:24 PM, Jakub Jelinek <ja...@redhat.com> wrote:
> On Thu, May 09, 2013 at 09:47:49PM +0200, Uros Bizjak wrote:
>> On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek <ja...@redhat.com> wrote:
>> > 2013-05-09  Jakub Jelinek  <ja...@redhat.com>
>> >
>> >         * config/i386/i386.md (rotateinv): New code attr.
>> >         (*<rotate_insn><mode>3_1, *<rotate_insn>si3_1_zext,
>> >         *<rotate_insn>qi3_1_slp): Emit rorl %eax instead of
>> >         roll $31, %eax, etc.
>>
>> OK.
>
> Thanks.  While talking about rotates, seems they are quite expensive
> on SandyBridge if the destination (== source1) is memory.
>
> With -O3 -mavx:
>
> unsigned int a[1024] = { 0, 1, 2, 3, 4, 5, 6 };
>
> __attribute__((noinline, noclone)) void
> foo (void)
> {
>   int i;
>   for (i = 0; i < 1024; i++)
>     {
>       int j = i & 31;
>       a[i] = (a[i] << j) ^ (a[i] >> ((-j) & 31));
>     }
> }
>
> int
> main ()
> {
>   int i;
>   for (i = 0; i < 1000000; i++)
>     foo ();
>   return 0;
> }
>
> the timing for vanilla gcc, gcc with the rotate recognizer patch and
> the latter hand editted to use rotate on register gives:
> Intel i7-2600   0m1.421s        0m1.914s        0m0.826s
> AMD 8354        0m2.353s        0m0.988s        0m1.407s
>
> The vanilla gcc loop is long:
> .L2:
>         movl    a(,%rax,4), %edi
>         movl    %eax, %esi
>         andl    $31, %esi
>         movl    %esi, %ecx
>         negl    %ecx
>         movl    %edi, %edx
>         shrl    %cl, %edx
>         movl    %esi, %ecx
>         sall    %cl, %edi
>         xorl    %edi, %edx
>         movl    %edx, a(,%rax,4)
>         addq    $1, %rax
>         cmpq    $1024, %rax
>         jne     .L2
> With rotate using mem:
> .L2:
>         roll    %cl, a(,%rcx,4)
>         addq    $1, %rcx
>         cmpq    $1024, %rcx
>         jne     .L2
> and with the hand tweaked rotate:
> .L2:
>         movl    a(,%rcx,4), %eax
>         roll    %cl, %eax
>         addq    $1, %rcx
>         cmpq    $1024, %rcx
>         movl    %eax, a-4(,%rcx,4)
>         jne     .L2
>
> So, on SandyBridge (haven't tried other Intel CPUs) it perhaps might be
> beneficial to disallow rotate with =m , 0 constraints, while for the above
> mentioned AMD it is best to use it.  Guess the above isn't sufficient data
> for that, we'd need info on more CPUs and more rotation sizes.
>
> BTW, with -mavx2 we vectorize the loop without the rotate recognition patch
> and don't vectorize anymore, I'll need to adjust the pattern recognizer to
> handle those.  The question is if for missing vector rotate optab
> we can replace it with the probably faster and shorter
> x r<< y  -> (x << y) | (x >> (32 - y)) - which is invalid for y = 0,
> but perhaps if the only thing the vector shift would do for shift by 32
> would be either return 0, or x, it would still work, or if we want to go
> for the (x << y) | (x >> ((-y) & 31)).  The i?86/x86_64 vector shifts don't
> truncate the shift count, so perhaps best would be a vector rotate expander
> that expands it into some some special insn like vectorized y ? x >> (32 - y) 
> : 0
>
> One question is whether our LROTATE_EXPR or RROTATE_EXPR is only defined for
> rotation counts 0 through bitsize - 1 (as is the case when we pattern detect
> it, do we create *ROTATE_EXPR other way?), or whether we allow arbitrary
> rotation counts (would assume the rotation be done always with modulo
> bitsize).  Because if the rotation count could be arbitrary, we'd need
> to vectorize it as (x << (y & 31)) | (x >> ((-y) & 31))

The rotation count should be 0 to bitsize - 1 similar to shifts.

Richard.

>         Jakub

Reply via email to