On Thu, May 9, 2013 at 10:24 PM, Jakub Jelinek <ja...@redhat.com> wrote: > On Thu, May 09, 2013 at 09:47:49PM +0200, Uros Bizjak wrote: >> On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek <ja...@redhat.com> wrote: >> > 2013-05-09 Jakub Jelinek <ja...@redhat.com> >> > >> > * config/i386/i386.md (rotateinv): New code attr. >> > (*<rotate_insn><mode>3_1, *<rotate_insn>si3_1_zext, >> > *<rotate_insn>qi3_1_slp): Emit rorl %eax instead of >> > roll $31, %eax, etc. >> >> OK. > > Thanks. While talking about rotates, seems they are quite expensive > on SandyBridge if the destination (== source1) is memory. > > With -O3 -mavx: > > unsigned int a[1024] = { 0, 1, 2, 3, 4, 5, 6 }; > > __attribute__((noinline, noclone)) void > foo (void) > { > int i; > for (i = 0; i < 1024; i++) > { > int j = i & 31; > a[i] = (a[i] << j) ^ (a[i] >> ((-j) & 31)); > } > } > > int > main () > { > int i; > for (i = 0; i < 1000000; i++) > foo (); > return 0; > } > > the timing for vanilla gcc, gcc with the rotate recognizer patch and > the latter hand editted to use rotate on register gives: > Intel i7-2600 0m1.421s 0m1.914s 0m0.826s > AMD 8354 0m2.353s 0m0.988s 0m1.407s > > The vanilla gcc loop is long: > .L2: > movl a(,%rax,4), %edi > movl %eax, %esi > andl $31, %esi > movl %esi, %ecx > negl %ecx > movl %edi, %edx > shrl %cl, %edx > movl %esi, %ecx > sall %cl, %edi > xorl %edi, %edx > movl %edx, a(,%rax,4) > addq $1, %rax > cmpq $1024, %rax > jne .L2 > With rotate using mem: > .L2: > roll %cl, a(,%rcx,4) > addq $1, %rcx > cmpq $1024, %rcx > jne .L2 > and with the hand tweaked rotate: > .L2: > movl a(,%rcx,4), %eax > roll %cl, %eax > addq $1, %rcx > cmpq $1024, %rcx > movl %eax, a-4(,%rcx,4) > jne .L2 > > So, on SandyBridge (haven't tried other Intel CPUs) it perhaps might be > beneficial to disallow rotate with =m , 0 constraints, while for the above > mentioned AMD it is best to use it. Guess the above isn't sufficient data > for that, we'd need info on more CPUs and more rotation sizes. > > BTW, with -mavx2 we vectorize the loop without the rotate recognition patch > and don't vectorize anymore, I'll need to adjust the pattern recognizer to > handle those. The question is if for missing vector rotate optab > we can replace it with the probably faster and shorter > x r<< y -> (x << y) | (x >> (32 - y)) - which is invalid for y = 0, > but perhaps if the only thing the vector shift would do for shift by 32 > would be either return 0, or x, it would still work, or if we want to go > for the (x << y) | (x >> ((-y) & 31)). The i?86/x86_64 vector shifts don't > truncate the shift count, so perhaps best would be a vector rotate expander > that expands it into some some special insn like vectorized y ? x >> (32 - y) > : 0 > > One question is whether our LROTATE_EXPR or RROTATE_EXPR is only defined for > rotation counts 0 through bitsize - 1 (as is the case when we pattern detect > it, do we create *ROTATE_EXPR other way?), or whether we allow arbitrary > rotation counts (would assume the rotation be done always with modulo > bitsize). Because if the rotation count could be arbitrary, we'd need > to vectorize it as (x << (y & 31)) | (x >> ((-y) & 31))
The rotation count should be 0 to bitsize - 1 similar to shifts. Richard. > Jakub