[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-16 Thread lloyd at randombit dot net
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #17 from Jack Lloyd  ---
Thank you!

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-16 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

Jakub Jelinek  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Jakub Jelinek  ---
Fixed.

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-14 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #15 from Jakub Jelinek  ---
Author: jakub
Date: Sat Oct 14 18:48:38 2017
New Revision: 253761

URL: https://gcc.gnu.org/viewcvs?rev=253761=gcc=rev
Log:
PR middle-end/62263
PR middle-end/82498
* tree-ssa-phiopt.c (value_replacement): Comment fix.  Handle
up to 2 preparation statements for ASSIGN in MIDDLE_BB.

* c-c++-common/rotate-8.c: Expect no PHIs in optimized dump.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/c-c++-common/rotate-8.c
trunk/gcc/tree-ssa-phiopt.c

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-14 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #14 from Jakub Jelinek  ---
Author: jakub
Date: Sat Oct 14 18:47:14 2017
New Revision: 253760

URL: https://gcc.gnu.org/viewcvs?rev=253760=gcc=rev
Log:
PR middle-end/62263
PR middle-end/82498
* tree-ssa-forwprop.c (simplify_rotate): Allow def_arg1[N]
to be any operand_equal_p operands.  For & (B - 1) require
B to be power of 2.  Recognize
(X << (Y & (B - 1))) | (X >> ((-Y) & (B - 1))) and similar patterns.

* c-c++-common/rotate-5.c (f2): New function.  Move old
function to ...
(f4): ... this.  Use 127 instead of 128.
(f3, f5, f6): New functions.
(main): Test all f[1-6] functions, with both 0 and 1 as
second arguments.
* c-c++-common/rotate-6.c: New test.
* c-c++-common/rotate-6a.c: New test.
* c-c++-common/rotate-7.c: New test.
* c-c++-common/rotate-7a.c: New test.
* c-c++-common/rotate-8.c: New test.

Added:
trunk/gcc/testsuite/c-c++-common/rotate-6.c
trunk/gcc/testsuite/c-c++-common/rotate-6a.c
trunk/gcc/testsuite/c-c++-common/rotate-7.c
trunk/gcc/testsuite/c-c++-common/rotate-7a.c
trunk/gcc/testsuite/c-c++-common/rotate-8.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/c-c++-common/rotate-5.c
trunk/gcc/tree-ssa-forwprop.c

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-13 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #13 from Jakub Jelinek  ---
Author: jakub
Date: Fri Oct 13 07:28:46 2017
New Revision: 253709

URL: https://gcc.gnu.org/viewcvs?rev=253709=gcc=rev
Log:
PR target/82498
* fold-const.c (fold_binary_loc) : Code cleanups,
instead of handling MINUS_EXPR twice (once for each argument),
canonicalize operand order and handle just once, use rtype where
possible.  Handle (A << B) | (A >> (-B & (Z - 1))).

* gcc.dg/tree-ssa/pr82498.c: New test.

Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/pr82498.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/fold-const.c
trunk/gcc/testsuite/ChangeLog

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-13 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #12 from Jakub Jelinek  ---
Author: jakub
Date: Fri Oct 13 07:17:06 2017
New Revision: 253708

URL: https://gcc.gnu.org/viewcvs?rev=253708=gcc=rev
Log:
PR target/82498
* config/i386/ia32intrin.h (__rold, __rord, __rolq, __rorq): Allow
any values of __C while still being pattern recognizable as a simple
rotate instruction.

* gcc.dg/ubsan/pr82498.c: New test.

Added:
trunk/gcc/testsuite/gcc.dg/ubsan/pr82498.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/ia32intrin.h
trunk/gcc/testsuite/ChangeLog

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-12 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #11 from Jakub Jelinek  ---
Author: jakub
Date: Thu Oct 12 19:10:34 2017
New Revision: 253695

URL: https://gcc.gnu.org/viewcvs?rev=253695=gcc=rev
Log:
PR target/82498
* config/i386/i386.md (*ashl3_mask_1,
*3_mask_1, *3_mask_1,
*_mask_1, *btr_mask_1): New define_insn_and_split
patterns.

* gcc.target/i386/pr82498-1.c: New test.
* gcc.target/i386/pr82498-2.c: New test.

Added:
trunk/gcc/testsuite/gcc.target/i386/pr82498-1.c
trunk/gcc/testsuite/gcc.target/i386/pr82498-2.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.md
trunk/gcc/testsuite/ChangeLog

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-12 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #10 from Marc Glisse  ---
f1...f6 already have a LROTATE_EXPR in the .original dump. The others don't get
one until forwprop1, which is after einline, so there is a small chance of
inlining causing other optimizations that mess with rotate detection (or the
large-ish code before rotate is recognized may prevent early inlining, missing
optimizations). I guess without going through the large job of moving the
rotate code from forwprop to match.pd it would be possible to add one basic
transform to recognize precisely the case in those intrinsics, if we pick one
in f7...f11.

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-12 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #9 from Jakub Jelinek  ---
Created attachment 42343
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42343=edit
gcc8-pr82498-intrin.patch

Untested patch to fix the intrinsics.

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-12 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #8 from Jakub Jelinek  ---
(In reply to Uroš Bizjak from comment #6)
> (In reply to Jack Lloyd from comment #5)
> > Jakub thank you very much for your comments, this was helpful for me in
> > getting consistent rol/ror generation.
> > 
> > Speaking as a user it's frustrating that Clang and GCC don't just have a
> > builtin for rotations like MSVC does, instead you have to guess what
> > expressions the optimizer(s) know about. That said there are a lot of
> > strange ways to right a rotate and probably GCC doesn't need to know all of
> > them.
> 
> You can use __rol{b,w,d,q} and __ror{b,w,d,q} (and their aliases) from
> ia32intrin.h. These are standardized; you have to include x86intrin.h header.

Well, at least as currently implemented, __ro{l,r}{d,q} require the shift count
to be 1 to bitsize - 1.  So at least I think we should change their
implementation to be like f8 (with & instead of %) which we know we generate
optimal code for for any shift count.

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-12 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #7 from Marc Glisse  ---
(In reply to Uroš Bizjak from comment #6)
> You can use __rol{b,w,d,q} and __ror{b,w,d,q} (and their aliases) from
> ia32intrin.h. These are standardized; you have to include x86intrin.h header.

Some of those break if you use -fsanitize=undefined.

#include 

int main(){
  unsigned i = 0;
  return __rold(i,0);
}

/usr/lib/gcc-snapshot/lib/gcc/x86_64-linux-gnu/8/include/ia32intrin.h:150:30:
runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-12 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #6 from Uroš Bizjak  ---
(In reply to Jack Lloyd from comment #5)
> Jakub thank you very much for your comments, this was helpful for me in
> getting consistent rol/ror generation.
> 
> Speaking as a user it's frustrating that Clang and GCC don't just have a
> builtin for rotations like MSVC does, instead you have to guess what
> expressions the optimizer(s) know about. That said there are a lot of
> strange ways to right a rotate and probably GCC doesn't need to know all of
> them.

You can use __rol{b,w,d,q} and __ror{b,w,d,q} (and their aliases) from
ia32intrin.h. These are standardized; you have to include x86intrin.h header.

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-11 Thread lloyd at randombit dot net
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #5 from Jack Lloyd  ---
Jakub thank you very much for your comments, this was helpful for me in getting
consistent rol/ror generation.

Speaking as a user it's frustrating that Clang and GCC don't just have a
builtin for rotations like MSVC does, instead you have to guess what
expressions the optimizer(s) know about. That said there are a lot of strange
ways to right a rotate and probably GCC doesn't need to know all of them.

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-11 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #4 from Jakub Jelinek  ---
Two further cases:
unsigned
f10 (unsigned x, unsigned char y)
{
  y %= __CHAR_BIT__ * __SIZEOF_INT__;
  return (x << y) | (x >> (-y & ((__CHAR_BIT__ * __SIZEOF_INT__) - 1)));
}

unsigned
f11 (unsigned x, unsigned short y)
{
  y %= __CHAR_BIT__ * __SIZEOF_INT__;
  return (x << y) | (x >> (-y & ((__CHAR_BIT__ * __SIZEOF_INT__) - 1)));
}

On f11 GCC generates also efficient code, on f10 useless &.
Guess the f10 case would be improved by addition of a
*3_mask_1 define_insn_and_split (and similarly the
inefficient/nonportable  f1 code would be slightly improved).

Looking at LLVM, f1/f3/f5 are worse in LLVM than in GCC, and in all cases
instead of cmov it uses branching; f7/f8/f9/f10/f11 all generate efficient code
though, so the same like GCC in case of f8 and f11.

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-11 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

--- Comment #3 from Jakub Jelinek  ---
Not to mention that the #c0 code has undefined behavior if rot is not 0, but a
multiple of 8 * sizeof(uint32_t), like 32, 64, ...
If you insisted on the rot == 0 check it would need to be done after the the
rot %= ... and then GCC wouldn't recognize the and operation to be useless.

So, on a more complete testcase where f1, f2, f3 have UB on multiples of 32
other than 0, we get most efficient code on f5 and f8
unsigned
f1 (unsigned x, unsigned char y)
{
  if (y == 0)
return x;
  y %= __CHAR_BIT__ * __SIZEOF_INT__;
  return (x << y) | (x >> (__CHAR_BIT__ * __SIZEOF_INT__ - y));
}

unsigned
f2 (unsigned x, unsigned y)
{
  if (y == 0)
return x;
  y %= __CHAR_BIT__ * __SIZEOF_INT__;
  return (x << y) | (x >> (__CHAR_BIT__ * __SIZEOF_INT__ - y));
}

unsigned
f3 (unsigned x, unsigned short y)
{
  if (y == 0)
return x;
  y %= __CHAR_BIT__ * __SIZEOF_INT__;
  return (x << y) | (x >> (__CHAR_BIT__ * __SIZEOF_INT__ - y));
}

unsigned
f4 (unsigned x, unsigned char y)
{
  y %= __CHAR_BIT__ * __SIZEOF_INT__;
  if (y == 0)
return x;
  return (x << y) | (x >> (__CHAR_BIT__ * __SIZEOF_INT__ - y));
}

unsigned
f5 (unsigned x, unsigned y)
{
  y %= __CHAR_BIT__ * __SIZEOF_INT__;
  if (y == 0)
return x;
  return (x << y) | (x >> (__CHAR_BIT__ * __SIZEOF_INT__ - y));
}

unsigned
f6 (unsigned x, unsigned short y)
{
  y %= __CHAR_BIT__ * __SIZEOF_INT__;
  if (y == 0)
return x;
  return (x << y) | (x >> (__CHAR_BIT__ * __SIZEOF_INT__ - y));
}

unsigned
f7 (unsigned x, unsigned char y)
{
  y %= __CHAR_BIT__ * __SIZEOF_INT__;
  return (x << y) | (x >> (-y % (__CHAR_BIT__ * __SIZEOF_INT__)));
}

unsigned
f8 (unsigned x, unsigned int y)
{
  y %= __CHAR_BIT__ * __SIZEOF_INT__;
  return (x << y) | (x >> (-y % (__CHAR_BIT__ * __SIZEOF_INT__)));
}

unsigned
f9 (unsigned x, unsigned short y)
{
  y %= __CHAR_BIT__ * __SIZEOF_INT__;
  return (x << y) | (x >> (-y % (__CHAR_BIT__ * __SIZEOF_INT__)));
}

where f8 is pattern recognized as x r<< (y & 31) and f5 is pattern recognized
and phiopt optimized into that.
The rest produce inefficient code, f1/f4/f6 with useless & and useless
comparison + cmov, f2/f3 just with useless comparison/cmov, and f7/f9 aren't
even pattern recognized as rotates.

The real question is how many different portable and non-portable ways of doing
this we really need to pattern recognize, there are many weirdo ways this can
be written.

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-11 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #2 from Jakub Jelinek  ---
The pattern is written for what people normally do, i.e. use int/unsigned int
shift count rather than unsigned char.

BTW, in any case, this isn't portable rotate that you get efficient code for
anyway, you get much better code if you don't special case the 0:
#include 
#include 

uint32_t
baz (uint32_t x, unsigned y)
{
  y %= CHAR_BIT * sizeof (uint32_t);
  return (x << y) | (x >> (-y % (CHAR_BIT * sizeof (uint32_t;
}

is what the compiler will pattern match and generate optimal code for, but if
it doesn't is still efficient and without branches.

[Bug target/82498] Missed optimization for x86 rotate instruction

2017-10-10 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498

Marc Glisse  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-10-10
 Ever confirmed|0   |1

--- Comment #1 from Marc Glisse  ---
Looks like https://stackoverflow.com/q/44000956/1918193 . During combine, we
try to match

(set (reg:SI 97)
(rotate:SI (reg/v:SI 90 [ input ])
(and:QI (subreg:QI (reg:SI 92 [ rot ]) 0)
(const_int 31 [0x1f]

But the pattern in i386.md has 'and' and 'subreg' reversed.

For the other part, we have a very limited transform that removes the test in
this case:
uint32_t rotate_left(uint32_t input, int rot)
{
  if(rot == 0)
return input;
  return static_cast((input << rot) | (input >>
(8*sizeof(uint32_t)-rot)));;
}
But it only works when there is a single gimple insn involved, not
and+cast+rotate.