On Thu, May 09, 2013 at 09:47:49PM +0200, Uros Bizjak wrote: > On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek <ja...@redhat.com> wrote: > > 2013-05-09 Jakub Jelinek <ja...@redhat.com> > > > > * config/i386/i386.md (rotateinv): New code attr. > > (*<rotate_insn><mode>3_1, *<rotate_insn>si3_1_zext, > > *<rotate_insn>qi3_1_slp): Emit rorl %eax instead of > > roll $31, %eax, etc. > > OK.
Thanks. While talking about rotates, seems they are quite expensive on SandyBridge if the destination (== source1) is memory. With -O3 -mavx: unsigned int a[1024] = { 0, 1, 2, 3, 4, 5, 6 }; __attribute__((noinline, noclone)) void foo (void) { int i; for (i = 0; i < 1024; i++) { int j = i & 31; a[i] = (a[i] << j) ^ (a[i] >> ((-j) & 31)); } } int main () { int i; for (i = 0; i < 1000000; i++) foo (); return 0; } the timing for vanilla gcc, gcc with the rotate recognizer patch and the latter hand editted to use rotate on register gives: Intel i7-2600 0m1.421s 0m1.914s 0m0.826s AMD 8354 0m2.353s 0m0.988s 0m1.407s The vanilla gcc loop is long: .L2: movl a(,%rax,4), %edi movl %eax, %esi andl $31, %esi movl %esi, %ecx negl %ecx movl %edi, %edx shrl %cl, %edx movl %esi, %ecx sall %cl, %edi xorl %edi, %edx movl %edx, a(,%rax,4) addq $1, %rax cmpq $1024, %rax jne .L2 With rotate using mem: .L2: roll %cl, a(,%rcx,4) addq $1, %rcx cmpq $1024, %rcx jne .L2 and with the hand tweaked rotate: .L2: movl a(,%rcx,4), %eax roll %cl, %eax addq $1, %rcx cmpq $1024, %rcx movl %eax, a-4(,%rcx,4) jne .L2 So, on SandyBridge (haven't tried other Intel CPUs) it perhaps might be beneficial to disallow rotate with =m , 0 constraints, while for the above mentioned AMD it is best to use it. Guess the above isn't sufficient data for that, we'd need info on more CPUs and more rotation sizes. BTW, with -mavx2 we vectorize the loop without the rotate recognition patch and don't vectorize anymore, I'll need to adjust the pattern recognizer to handle those. The question is if for missing vector rotate optab we can replace it with the probably faster and shorter x r<< y -> (x << y) | (x >> (32 - y)) - which is invalid for y = 0, but perhaps if the only thing the vector shift would do for shift by 32 would be either return 0, or x, it would still work, or if we want to go for the (x << y) | (x >> ((-y) & 31)). The i?86/x86_64 vector shifts don't truncate the shift count, so perhaps best would be a vector rotate expander that expands it into some some special insn like vectorized y ? x >> (32 - y) : 0 One question is whether our LROTATE_EXPR or RROTATE_EXPR is only defined for rotation counts 0 through bitsize - 1 (as is the case when we pattern detect it, do we create *ROTATE_EXPR other way?), or whether we allow arbitrary rotation counts (would assume the rotation be done always with modulo bitsize). Because if the rotation count could be arbitrary, we'd need to vectorize it as (x << (y & 31)) | (x >> ((-y) & 31)) Jakub