Re: [PATCH] Improve rotation by mode bitsize - 1 (take 2)

2013-05-14 Thread Richard Biener
On Mon, 13 May 2013, Jakub Jelinek wrote:

> On Fri, May 10, 2013 at 07:15:38PM +0200, Jan Hubicka wrote:
> > It seems to me that it is not different from normalizing reg-10 into 
> > reg+(-10)
> > we do for years (and for good reason).  It is still target preference when 
> > use
> > add and when sub to perform the arithmetic, but it makes sense to keep 
> > single
> > canonical form of the expression both in RTL and Gimple.
> > 
> > For example we may want to be able to prove that 
> >   (rotate reg 31) == (rotatert reg 1)
> > is true or
> >   (rotate reg 30) == (rotatert reg 2)
> > is also true or cross jump both variants into one instruction.
> 
> Ok, this patch reverts my earlier patch and does the canonicalization, for
> now for RTL only.  Bootstrapped/regtested on x86_64-linux and i686-linux, ok
> for trunk?

Ok.

Thanks,
Richard.

> 2013-05-13  Jakub Jelinek  
> 
>   * expmed.c (expand_shift_1): Canonicalize rotates by
>   constant bitsize / 2 to bitsize - 1.
>   * simplify-rt.x (simplify_binary_operation_1)case ROTATERT>: Likewise.
> 
>   Revert:
>   2013-05-10  Jakub Jelinek  
> 
>   * config/i386/i386.md (rotateinv): New code attr.
>   (*3_1, *si3_1_zext,
>   *qi3_1_slp): Emit rorl %eax instead of
>   roll $31, %eax, etc.
> 
> --- gcc/expmed.c.jj   2013-05-13 13:03:31.0 +0200
> +++ gcc/expmed.c  2013-05-13 15:22:39.456194286 +0200
> @@ -2122,6 +2122,20 @@ expand_shift_1 (enum tree_code code, enu
>   op1 = SUBREG_REG (op1);
>  }
>  
> +  /* Canonicalize rotates by constant amount.  If op1 is bitsize / 2,
> + prefer left rotation, if op1 is from bitsize / 2 + 1 to
> + bitsize - 1, use other direction of rotate with 1 .. bitsize / 2 - 1
> + amount instead.  */
> +  if (rotate
> +  && CONST_INT_P (op1)
> +  && IN_RANGE (INTVAL (op1), GET_MODE_BITSIZE (mode) / 2 + left,
> +GET_MODE_BITSIZE (mode) - 1))
> +{
> +  op1 = GEN_INT (GET_MODE_BITSIZE (mode) - INTVAL (op1));
> +  left = !left;
> +  code = left ? LROTATE_EXPR : RROTATE_EXPR;
> +}
> +
>if (op1 == const0_rtx)
>  return shifted;
>  
> --- gcc/simplify-rtx.c.jj 2013-05-02 12:42:25.0 +0200
> +++ gcc/simplify-rtx.c2013-05-13 15:48:31.171182716 +0200
> @@ -3250,6 +3250,18 @@ simplify_binary_operation_1 (enum rtx_co
>  
>  case ROTATERT:
>  case ROTATE:
> +  /* Canonicalize rotates by constant amount.  If op1 is bitsize / 2,
> +  prefer left rotation, if op1 is from bitsize / 2 + 1 to
> +  bitsize - 1, use other direction of rotate with 1 .. bitsize / 2 - 1
> +  amount instead.  */
> +  if (CONST_INT_P (trueop1)
> +   && IN_RANGE (INTVAL (trueop1),
> +GET_MODE_BITSIZE (mode) / 2 + (code == ROTATE),
> +GET_MODE_BITSIZE (mode) - 1))
> + return simplify_gen_binary (code == ROTATE ? ROTATERT : ROTATE,
> + mode, op0, GEN_INT (GET_MODE_BITSIZE (mode)
> + - INTVAL (trueop1)));
> +  /* FALLTHRU */
>  case ASHIFTRT:
>if (trueop1 == CONST0_RTX (mode))
>   return op0;
> --- gcc/config/i386/i386.md.jj2013-05-13 09:44:51.675494325 +0200
> +++ gcc/config/i386/i386.md   2013-05-13 15:09:37.461637593 +0200
> @@ -762,9 +762,6 @@ (define_code_attr rotate_insn [(rotate "
>  ;; Base name for insn mnemonic.
>  (define_code_attr rotate [(rotate "rol") (rotatert "ror")])
>  
> -;; Base name for insn mnemonic of rotation in the other direction.
> -(define_code_attr rotateinv [(rotate "ror") (rotatert "rol")])
> -
>  ;; Mapping of abs neg operators
>  (define_code_iterator absneg [abs neg])
>  
> @@ -9755,15 +9752,11 @@ (define_insn "*3_1"
>return "#";
>  
>  default:
> -  if (TARGET_SHIFT1 || optimize_function_for_size_p (cfun))
> - {
> -   if (operands[2] == const1_rtx)
> - return "{}\t%0";
> -   if (CONST_INT_P (operands[2])
> -   && INTVAL (operands[2]) == GET_MODE_BITSIZE (mode) - 1)
> - return "{}\t%0";
> - }
> -  return "{}\t{%2, %0|%0, %2}";
> +  if (operands[2] == const1_rtx
> +   && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun)))
> + return "{}\t%0";
> +  else
> + return "{}\t{%2, %0|%0, %2}";
>  }
>  }
>[(set_attr "isa" "*,bmi2")
> @@ -9825,14 +9818,11 @@ (define_insn "*si3_1_zext"
>return "#";
>  
>  default:
> -  if (TARGET_SHIFT1 || optimize_function_for_size_p (cfun))
> - {
> -   if (operands[2] == const1_rtx)
> - return "{l}\t%k0";
> -   if (CONST_INT_P (operands[2]) && INTVAL (operands[2]) == 31)
> - return "{l}\t%k0";
> - }
> -  return "{l}\t{%2, %k0|%k0, %2}";
> +  if (operands[2] == const1_rtx
> +   && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun)))
> + return "{l}\t%k0";
> +  else
> + return "{l}\t{%2, %k0|%k0, %2}";
>  }
>  }
>[(set_attr "isa"

Re: [PATCH] Improve rotation by mode bitsize - 1 (take 2)

2013-05-13 Thread Uros Bizjak
On Mon, May 13, 2013 at 6:43 PM, Jakub Jelinek  wrote:
> On Fri, May 10, 2013 at 07:15:38PM +0200, Jan Hubicka wrote:
>> It seems to me that it is not different from normalizing reg-10 into 
>> reg+(-10)
>> we do for years (and for good reason).  It is still target preference when 
>> use
>> add and when sub to perform the arithmetic, but it makes sense to keep single
>> canonical form of the expression both in RTL and Gimple.
>>
>> For example we may want to be able to prove that
>>   (rotate reg 31) == (rotatert reg 1)
>> is true or
>>   (rotate reg 30) == (rotatert reg 2)
>> is also true or cross jump both variants into one instruction.
>
> Ok, this patch reverts my earlier patch and does the canonicalization, for
> now for RTL only.  Bootstrapped/regtested on x86_64-linux and i686-linux, ok
> for trunk?
>
> 2013-05-13  Jakub Jelinek  
>
> * expmed.c (expand_shift_1): Canonicalize rotates by
> constant bitsize / 2 to bitsize - 1.
> * simplify-rt.x (simplify_binary_operation_1)  case ROTATERT>: Likewise.
>
> Revert:
> 2013-05-10  Jakub Jelinek  
>
> * config/i386/i386.md (rotateinv): New code attr.
> (*3_1, *si3_1_zext,
> *qi3_1_slp): Emit rorl %eax instead of
> roll $31, %eax, etc.

You can revert your own patch without approval, so the patch approval
depends solely on the approval from ME maintainer.

Thanks,
Uros.


Re: [PATCH] Improve rotation by mode bitsize - 1

2013-05-10 Thread Jan Hubicka
> Hi!
> 
> Note, the patch has been already committed.
> 
> On Fri, May 10, 2013 at 03:48:50PM +0200, Jan Hubicka wrote:
> > Going this route you will also need to update
> > 
> >   [(set_attr "type" "rotate1")
> >(set (attr "length_immediate")
> >  (if_then_else
> >(and (match_operand 1 "const1_operand")
> > (ior (match_test "TARGET_SHIFT1")
> >  (match_test "optimize_function_for_size_p (cfun)")))
> >(const_string "0")
> >(const_string "*")))
> >(set_attr "mode" "QI")])
> > 
> > computing the immediate size.  Why we can't canonicalize this in
> > folding/simplify_rtx/combiner? I do not see much benefit allowing both
> > forms of instructions in our insn streams...
> 
> Both directions of rotation make sense for variable length rotation, and
> why would the generic simplify-rtx.c code want to prefer one rotate over
> the other direction for constant rotation counts?  That is clearly target
> specific preference.

It seems to me that it is not different from normalizing reg-10 into reg+(-10)
we do for years (and for good reason).  It is still target preference when use
add and when sub to perform the arithmetic, but it makes sense to keep single
canonical form of the expression both in RTL and Gimple.

For example we may want to be able to prove that 
  (rotate reg 31) == (rotatert reg 1)
is true or
  (rotate reg 30) == (rotatert reg 2)
is also true or cross jump both variants into one instruction.

So it seems to me that canonicalizing constant rotations to get the operand
into range 0bitsize/2 makes sense in general and will make the extra
special case in i386.md unnecesary.
> 
> What we could do instead of the patch I've committed just tweak expansion
> of the patterns instead, assuming it is unlikely RTL optimizations
> materialize a constant rotation out of non-constant rotations from expansion
> time.
> 
> Or the length_immediate attributes could be adjusted, shouldn't be too hard
> (just replace that (match_operand 1 "const1_operand") in there with
> (and (match_operand 1 "const_int_operand")
>  (ior (match_operand 1 "const1_operand")
> (match_test "INTVAL (operands[1]) == GET_MODE_BITSIZE (mode) - 
> 1")))
> or similar (untested).

Yep, we should either update the pattern or add canonicalization.  I still think
the second variant is better...

Honza


Re: [PATCH] Improve rotation by mode bitsize - 1

2013-05-10 Thread Jakub Jelinek
Hi!

Note, the patch has been already committed.

On Fri, May 10, 2013 at 03:48:50PM +0200, Jan Hubicka wrote:
> Going this route you will also need to update
> 
>   [(set_attr "type" "rotate1")
>(set (attr "length_immediate")
>  (if_then_else
>(and (match_operand 1 "const1_operand")
> (ior (match_test "TARGET_SHIFT1")
>  (match_test "optimize_function_for_size_p (cfun)")))
>(const_string "0")
>(const_string "*")))
>(set_attr "mode" "QI")])
> 
> computing the immediate size.  Why we can't canonicalize this in
> folding/simplify_rtx/combiner? I do not see much benefit allowing both
> forms of instructions in our insn streams...

Both directions of rotation make sense for variable length rotation, and
why would the generic simplify-rtx.c code want to prefer one rotate over
the other direction for constant rotation counts?  That is clearly target
specific preference.

What we could do instead of the patch I've committed just tweak expansion
of the patterns instead, assuming it is unlikely RTL optimizations
materialize a constant rotation out of non-constant rotations from expansion
time.

Or the length_immediate attributes could be adjusted, shouldn't be too hard
(just replace that (match_operand 1 "const1_operand") in there with
(and (match_operand 1 "const_int_operand")
 (ior (match_operand 1 "const1_operand")
  (match_test "INTVAL (operands[1]) == GET_MODE_BITSIZE (mode) - 
1")))
or similar (untested).

Jakub


Re: [PATCH] Improve rotation by mode bitsize - 1

2013-05-10 Thread Jan Hubicka
> Hi!
> 
> This is something I've noticed while working on the rotate recognizer
> patch I've just posted.  We emit say
>   roll %eax
> instead of
>   roll $1, %eax
> because the former is shorter, but emit
>   roll $31, %eax
> instead of the equivalent, but shorter
>   rorl %eax
> The following patch let us optimize even those.  Bootstrapped/regtested
> on x86_64-linux and i686-linux, ok for trunk?
> 
> 2013-05-09  Jakub Jelinek  
> 
>   * config/i386/i386.md (rotateinv): New code attr.
>   (*3_1, *si3_1_zext,
>   *qi3_1_slp): Emit rorl %eax instead of
>   roll $31, %eax, etc.

Going this route you will also need to update

  [(set_attr "type" "rotate1")
   (set (attr "length_immediate")
 (if_then_else
   (and (match_operand 1 "const1_operand")
(ior (match_test "TARGET_SHIFT1")
 (match_test "optimize_function_for_size_p (cfun)")))
   (const_string "0")
   (const_string "*")))
   (set_attr "mode" "QI")])

computing the immediate size.  Why we can't canonicalize this in
folding/simplify_rtx/combiner? I do not see much benefit allowing both
forms of instructions in our insn streams...

Honza


Re: [PATCH] Improve rotation by mode bitsize - 1

2013-05-10 Thread Richard Biener
On Thu, May 9, 2013 at 10:24 PM, Jakub Jelinek  wrote:
> On Thu, May 09, 2013 at 09:47:49PM +0200, Uros Bizjak wrote:
>> On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek  wrote:
>> > 2013-05-09  Jakub Jelinek  
>> >
>> > * config/i386/i386.md (rotateinv): New code attr.
>> > (*3_1, *si3_1_zext,
>> > *qi3_1_slp): Emit rorl %eax instead of
>> > roll $31, %eax, etc.
>>
>> OK.
>
> Thanks.  While talking about rotates, seems they are quite expensive
> on SandyBridge if the destination (== source1) is memory.
>
> With -O3 -mavx:
>
> unsigned int a[1024] = { 0, 1, 2, 3, 4, 5, 6 };
>
> __attribute__((noinline, noclone)) void
> foo (void)
> {
>   int i;
>   for (i = 0; i < 1024; i++)
> {
>   int j = i & 31;
>   a[i] = (a[i] << j) ^ (a[i] >> ((-j) & 31));
> }
> }
>
> int
> main ()
> {
>   int i;
>   for (i = 0; i < 100; i++)
> foo ();
>   return 0;
> }
>
> the timing for vanilla gcc, gcc with the rotate recognizer patch and
> the latter hand editted to use rotate on register gives:
> Intel i7-2600   0m1.421s0m1.914s0m0.826s
> AMD 83540m2.353s0m0.988s0m1.407s
>
> The vanilla gcc loop is long:
> .L2:
> movla(,%rax,4), %edi
> movl%eax, %esi
> andl$31, %esi
> movl%esi, %ecx
> negl%ecx
> movl%edi, %edx
> shrl%cl, %edx
> movl%esi, %ecx
> sall%cl, %edi
> xorl%edi, %edx
> movl%edx, a(,%rax,4)
> addq$1, %rax
> cmpq$1024, %rax
> jne .L2
> With rotate using mem:
> .L2:
> roll%cl, a(,%rcx,4)
> addq$1, %rcx
> cmpq$1024, %rcx
> jne .L2
> and with the hand tweaked rotate:
> .L2:
> movla(,%rcx,4), %eax
> roll%cl, %eax
> addq$1, %rcx
> cmpq$1024, %rcx
> movl%eax, a-4(,%rcx,4)
> jne .L2
>
> So, on SandyBridge (haven't tried other Intel CPUs) it perhaps might be
> beneficial to disallow rotate with =m , 0 constraints, while for the above
> mentioned AMD it is best to use it.  Guess the above isn't sufficient data
> for that, we'd need info on more CPUs and more rotation sizes.
>
> BTW, with -mavx2 we vectorize the loop without the rotate recognition patch
> and don't vectorize anymore, I'll need to adjust the pattern recognizer to
> handle those.  The question is if for missing vector rotate optab
> we can replace it with the probably faster and shorter
> x r<< y  -> (x << y) | (x >> (32 - y)) - which is invalid for y = 0,
> but perhaps if the only thing the vector shift would do for shift by 32
> would be either return 0, or x, it would still work, or if we want to go
> for the (x << y) | (x >> ((-y) & 31)).  The i?86/x86_64 vector shifts don't
> truncate the shift count, so perhaps best would be a vector rotate expander
> that expands it into some some special insn like vectorized y ? x >> (32 - y) 
> : 0
>
> One question is whether our LROTATE_EXPR or RROTATE_EXPR is only defined for
> rotation counts 0 through bitsize - 1 (as is the case when we pattern detect
> it, do we create *ROTATE_EXPR other way?), or whether we allow arbitrary
> rotation counts (would assume the rotation be done always with modulo
> bitsize).  Because if the rotation count could be arbitrary, we'd need
> to vectorize it as (x << (y & 31)) | (x >> ((-y) & 31))

The rotation count should be 0 to bitsize - 1 similar to shifts.

Richard.

> Jakub


Re: [PATCH] Improve rotation by mode bitsize - 1

2013-05-09 Thread Jakub Jelinek
On Thu, May 09, 2013 at 09:47:49PM +0200, Uros Bizjak wrote:
> On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek  wrote:
> > 2013-05-09  Jakub Jelinek  
> >
> > * config/i386/i386.md (rotateinv): New code attr.
> > (*3_1, *si3_1_zext,
> > *qi3_1_slp): Emit rorl %eax instead of
> > roll $31, %eax, etc.
> 
> OK.

Thanks.  While talking about rotates, seems they are quite expensive
on SandyBridge if the destination (== source1) is memory.

With -O3 -mavx:

unsigned int a[1024] = { 0, 1, 2, 3, 4, 5, 6 };

__attribute__((noinline, noclone)) void
foo (void)
{
  int i;
  for (i = 0; i < 1024; i++)
{
  int j = i & 31;
  a[i] = (a[i] << j) ^ (a[i] >> ((-j) & 31));
}
}

int
main ()
{
  int i;
  for (i = 0; i < 100; i++)
foo ();
  return 0;
}

the timing for vanilla gcc, gcc with the rotate recognizer patch and
the latter hand editted to use rotate on register gives:
Intel i7-2600   0m1.421s0m1.914s0m0.826s
AMD 83540m2.353s0m0.988s0m1.407s

The vanilla gcc loop is long:
.L2:
movla(,%rax,4), %edi
movl%eax, %esi
andl$31, %esi
movl%esi, %ecx
negl%ecx
movl%edi, %edx
shrl%cl, %edx
movl%esi, %ecx
sall%cl, %edi
xorl%edi, %edx
movl%edx, a(,%rax,4)
addq$1, %rax
cmpq$1024, %rax
jne .L2
With rotate using mem:
.L2:
roll%cl, a(,%rcx,4)
addq$1, %rcx
cmpq$1024, %rcx
jne .L2
and with the hand tweaked rotate:
.L2:
movla(,%rcx,4), %eax
roll%cl, %eax
addq$1, %rcx
cmpq$1024, %rcx
movl%eax, a-4(,%rcx,4)
jne .L2

So, on SandyBridge (haven't tried other Intel CPUs) it perhaps might be
beneficial to disallow rotate with =m , 0 constraints, while for the above
mentioned AMD it is best to use it.  Guess the above isn't sufficient data
for that, we'd need info on more CPUs and more rotation sizes.

BTW, with -mavx2 we vectorize the loop without the rotate recognition patch
and don't vectorize anymore, I'll need to adjust the pattern recognizer to
handle those.  The question is if for missing vector rotate optab
we can replace it with the probably faster and shorter
x r<< y  -> (x << y) | (x >> (32 - y)) - which is invalid for y = 0,
but perhaps if the only thing the vector shift would do for shift by 32
would be either return 0, or x, it would still work, or if we want to go
for the (x << y) | (x >> ((-y) & 31)).  The i?86/x86_64 vector shifts don't
truncate the shift count, so perhaps best would be a vector rotate expander
that expands it into some some special insn like vectorized y ? x >> (32 - y) : 0

One question is whether our LROTATE_EXPR or RROTATE_EXPR is only defined for
rotation counts 0 through bitsize - 1 (as is the case when we pattern detect
it, do we create *ROTATE_EXPR other way?), or whether we allow arbitrary
rotation counts (would assume the rotation be done always with modulo
bitsize).  Because if the rotation count could be arbitrary, we'd need
to vectorize it as (x << (y & 31)) | (x >> ((-y) & 31))

Jakub


Re: [PATCH] Improve rotation by mode bitsize - 1

2013-05-09 Thread Uros Bizjak
On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek  wrote:

> This is something I've noticed while working on the rotate recognizer
> patch I've just posted.  We emit say
>   roll %eax
> instead of
>   roll $1, %eax
> because the former is shorter, but emit
>   roll $31, %eax
> instead of the equivalent, but shorter
>   rorl %eax
> The following patch let us optimize even those.  Bootstrapped/regtested
> on x86_64-linux and i686-linux, ok for trunk?
>
> 2013-05-09  Jakub Jelinek  
>
> * config/i386/i386.md (rotateinv): New code attr.
> (*3_1, *si3_1_zext,
> *qi3_1_slp): Emit rorl %eax instead of
> roll $31, %eax, etc.

OK.

Thanks,
Uros.


Re: [PATCH] Improve rotation by mode bitsize - 1

2013-05-09 Thread Jakub Jelinek
On Thu, May 09, 2013 at 08:50:59PM +0200, Steven Bosscher wrote:
> On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek wrote:
> > This is something I've noticed while working on the rotate recognizer
> > patch I've just posted.  We emit say
> >   roll %eax
> > instead of
> >   roll $1, %eax
> > because the former is shorter, but emit
> >   roll $31, %eax
> > instead of the equivalent, but shorter
> >   rorl %eax
> 
> Wouldn't this be better done as one or more peephole2s?

Given that the output routine is in all cases already C code,
I think peephole2s would be only slower and more code.

Jakub


Re: [PATCH] Improve rotation by mode bitsize - 1

2013-05-09 Thread Steven Bosscher
On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek wrote:
> Hi!
>
> This is something I've noticed while working on the rotate recognizer
> patch I've just posted.  We emit say
>   roll %eax
> instead of
>   roll $1, %eax
> because the former is shorter, but emit
>   roll $31, %eax
> instead of the equivalent, but shorter
>   rorl %eax

Wouldn't this be better done as one or more peephole2s?

Ciao!
Steven