On Thu, May 09, 2013 at 09:47:49PM +0200, Uros Bizjak wrote:
> On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek <ja...@redhat.com> wrote:
> > 2013-05-09  Jakub Jelinek  <ja...@redhat.com>
> >
> >         * config/i386/i386.md (rotateinv): New code attr.
> >         (*<rotate_insn><mode>3_1, *<rotate_insn>si3_1_zext,
> >         *<rotate_insn>qi3_1_slp): Emit rorl %eax instead of
> >         roll $31, %eax, etc.
> 
> OK.

Thanks.  While talking about rotates, seems they are quite expensive
on SandyBridge if the destination (== source1) is memory.

With -O3 -mavx:

unsigned int a[1024] = { 0, 1, 2, 3, 4, 5, 6 };

__attribute__((noinline, noclone)) void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
    {
      int j = i & 31;
      a[i] = (a[i] << j) ^ (a[i] >> ((-j) & 31));
    }
}

int
main ()
{
  int i;
  for (i = 0; i < 1000000; i++)
    foo ();
  return 0;
}

the timing for vanilla gcc, gcc with the rotate recognizer patch and
the latter hand editted to use rotate on register gives:
Intel i7-2600   0m1.421s        0m1.914s        0m0.826s
AMD 8354        0m2.353s        0m0.988s        0m1.407s

The vanilla gcc loop is long:
.L2:
        movl    a(,%rax,4), %edi
        movl    %eax, %esi
        andl    $31, %esi
        movl    %esi, %ecx
        negl    %ecx
        movl    %edi, %edx
        shrl    %cl, %edx
        movl    %esi, %ecx
        sall    %cl, %edi
        xorl    %edi, %edx
        movl    %edx, a(,%rax,4)
        addq    $1, %rax
        cmpq    $1024, %rax
        jne     .L2
With rotate using mem:
.L2:
        roll    %cl, a(,%rcx,4)
        addq    $1, %rcx
        cmpq    $1024, %rcx
        jne     .L2
and with the hand tweaked rotate:
.L2:
        movl    a(,%rcx,4), %eax
        roll    %cl, %eax
        addq    $1, %rcx
        cmpq    $1024, %rcx
        movl    %eax, a-4(,%rcx,4)
        jne     .L2

So, on SandyBridge (haven't tried other Intel CPUs) it perhaps might be
beneficial to disallow rotate with =m , 0 constraints, while for the above
mentioned AMD it is best to use it.  Guess the above isn't sufficient data
for that, we'd need info on more CPUs and more rotation sizes.

BTW, with -mavx2 we vectorize the loop without the rotate recognition patch
and don't vectorize anymore, I'll need to adjust the pattern recognizer to
handle those.  The question is if for missing vector rotate optab
we can replace it with the probably faster and shorter
x r<< y  -> (x << y) | (x >> (32 - y)) - which is invalid for y = 0,
but perhaps if the only thing the vector shift would do for shift by 32
would be either return 0, or x, it would still work, or if we want to go
for the (x << y) | (x >> ((-y) & 31)).  The i?86/x86_64 vector shifts don't
truncate the shift count, so perhaps best would be a vector rotate expander
that expands it into some some special insn like vectorized y ? x >> (32 - y) : 0

One question is whether our LROTATE_EXPR or RROTATE_EXPR is only defined for
rotation counts 0 through bitsize - 1 (as is the case when we pattern detect
it, do we create *ROTATE_EXPR other way?), or whether we allow arbitrary
rotation counts (would assume the rotation be done always with modulo
bitsize).  Because if the rotation count could be arbitrary, we'd need
to vectorize it as (x << (y & 31)) | (x >> ((-y) & 31))

        Jakub

Reply via email to