[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #17 from Jack Lloyd --- Thank you!
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 Jakub Jelinek changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #16 from Jakub Jelinek --- Fixed.
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #15 from Jakub Jelinek --- Author: jakub Date: Sat Oct 14 18:48:38 2017 New Revision: 253761 URL: https://gcc.gnu.org/viewcvs?rev=253761=gcc=rev Log: PR middle-end/62263 PR middle-end/82498 * tree-ssa-phiopt.c (value_replacement): Comment fix. Handle up to 2 preparation statements for ASSIGN in MIDDLE_BB. * c-c++-common/rotate-8.c: Expect no PHIs in optimized dump. Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/c-c++-common/rotate-8.c trunk/gcc/tree-ssa-phiopt.c
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #14 from Jakub Jelinek --- Author: jakub Date: Sat Oct 14 18:47:14 2017 New Revision: 253760 URL: https://gcc.gnu.org/viewcvs?rev=253760=gcc=rev Log: PR middle-end/62263 PR middle-end/82498 * tree-ssa-forwprop.c (simplify_rotate): Allow def_arg1[N] to be any operand_equal_p operands. For & (B - 1) require B to be power of 2. Recognize (X << (Y & (B - 1))) | (X >> ((-Y) & (B - 1))) and similar patterns. * c-c++-common/rotate-5.c (f2): New function. Move old function to ... (f4): ... this. Use 127 instead of 128. (f3, f5, f6): New functions. (main): Test all f[1-6] functions, with both 0 and 1 as second arguments. * c-c++-common/rotate-6.c: New test. * c-c++-common/rotate-6a.c: New test. * c-c++-common/rotate-7.c: New test. * c-c++-common/rotate-7a.c: New test. * c-c++-common/rotate-8.c: New test. Added: trunk/gcc/testsuite/c-c++-common/rotate-6.c trunk/gcc/testsuite/c-c++-common/rotate-6a.c trunk/gcc/testsuite/c-c++-common/rotate-7.c trunk/gcc/testsuite/c-c++-common/rotate-7a.c trunk/gcc/testsuite/c-c++-common/rotate-8.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/c-c++-common/rotate-5.c trunk/gcc/tree-ssa-forwprop.c
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #13 from Jakub Jelinek --- Author: jakub Date: Fri Oct 13 07:28:46 2017 New Revision: 253709 URL: https://gcc.gnu.org/viewcvs?rev=253709=gcc=rev Log: PR target/82498 * fold-const.c (fold_binary_loc) : Code cleanups, instead of handling MINUS_EXPR twice (once for each argument), canonicalize operand order and handle just once, use rtype where possible. Handle (A << B) | (A >> (-B & (Z - 1))). * gcc.dg/tree-ssa/pr82498.c: New test. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/pr82498.c Modified: trunk/gcc/ChangeLog trunk/gcc/fold-const.c trunk/gcc/testsuite/ChangeLog
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #12 from Jakub Jelinek --- Author: jakub Date: Fri Oct 13 07:17:06 2017 New Revision: 253708 URL: https://gcc.gnu.org/viewcvs?rev=253708=gcc=rev Log: PR target/82498 * config/i386/ia32intrin.h (__rold, __rord, __rolq, __rorq): Allow any values of __C while still being pattern recognizable as a simple rotate instruction. * gcc.dg/ubsan/pr82498.c: New test. Added: trunk/gcc/testsuite/gcc.dg/ubsan/pr82498.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/ia32intrin.h trunk/gcc/testsuite/ChangeLog
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #11 from Jakub Jelinek --- Author: jakub Date: Thu Oct 12 19:10:34 2017 New Revision: 253695 URL: https://gcc.gnu.org/viewcvs?rev=253695=gcc=rev Log: PR target/82498 * config/i386/i386.md (*ashl3_mask_1, *3_mask_1, *3_mask_1, *_mask_1, *btr_mask_1): New define_insn_and_split patterns. * gcc.target/i386/pr82498-1.c: New test. * gcc.target/i386/pr82498-2.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr82498-1.c trunk/gcc/testsuite/gcc.target/i386/pr82498-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.md trunk/gcc/testsuite/ChangeLog
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #10 from Marc Glisse --- f1...f6 already have a LROTATE_EXPR in the .original dump. The others don't get one until forwprop1, which is after einline, so there is a small chance of inlining causing other optimizations that mess with rotate detection (or the large-ish code before rotate is recognized may prevent early inlining, missing optimizations). I guess without going through the large job of moving the rotate code from forwprop to match.pd it would be possible to add one basic transform to recognize precisely the case in those intrinsics, if we pick one in f7...f11.
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #9 from Jakub Jelinek --- Created attachment 42343 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42343=edit gcc8-pr82498-intrin.patch Untested patch to fix the intrinsics.
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #8 from Jakub Jelinek --- (In reply to Uroš Bizjak from comment #6) > (In reply to Jack Lloyd from comment #5) > > Jakub thank you very much for your comments, this was helpful for me in > > getting consistent rol/ror generation. > > > > Speaking as a user it's frustrating that Clang and GCC don't just have a > > builtin for rotations like MSVC does, instead you have to guess what > > expressions the optimizer(s) know about. That said there are a lot of > > strange ways to right a rotate and probably GCC doesn't need to know all of > > them. > > You can use __rol{b,w,d,q} and __ror{b,w,d,q} (and their aliases) from > ia32intrin.h. These are standardized; you have to include x86intrin.h header. Well, at least as currently implemented, __ro{l,r}{d,q} require the shift count to be 1 to bitsize - 1. So at least I think we should change their implementation to be like f8 (with & instead of %) which we know we generate optimal code for for any shift count.
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #7 from Marc Glisse --- (In reply to Uroš Bizjak from comment #6) > You can use __rol{b,w,d,q} and __ror{b,w,d,q} (and their aliases) from > ia32intrin.h. These are standardized; you have to include x86intrin.h header. Some of those break if you use -fsanitize=undefined. #include int main(){ unsigned i = 0; return __rold(i,0); } /usr/lib/gcc-snapshot/lib/gcc/x86_64-linux-gnu/8/include/ia32intrin.h:150:30: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #6 from Uroš Bizjak --- (In reply to Jack Lloyd from comment #5) > Jakub thank you very much for your comments, this was helpful for me in > getting consistent rol/ror generation. > > Speaking as a user it's frustrating that Clang and GCC don't just have a > builtin for rotations like MSVC does, instead you have to guess what > expressions the optimizer(s) know about. That said there are a lot of > strange ways to right a rotate and probably GCC doesn't need to know all of > them. You can use __rol{b,w,d,q} and __ror{b,w,d,q} (and their aliases) from ia32intrin.h. These are standardized; you have to include x86intrin.h header.
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #5 from Jack Lloyd --- Jakub thank you very much for your comments, this was helpful for me in getting consistent rol/ror generation. Speaking as a user it's frustrating that Clang and GCC don't just have a builtin for rotations like MSVC does, instead you have to guess what expressions the optimizer(s) know about. That said there are a lot of strange ways to right a rotate and probably GCC doesn't need to know all of them.
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #4 from Jakub Jelinek --- Two further cases: unsigned f10 (unsigned x, unsigned char y) { y %= __CHAR_BIT__ * __SIZEOF_INT__; return (x << y) | (x >> (-y & ((__CHAR_BIT__ * __SIZEOF_INT__) - 1))); } unsigned f11 (unsigned x, unsigned short y) { y %= __CHAR_BIT__ * __SIZEOF_INT__; return (x << y) | (x >> (-y & ((__CHAR_BIT__ * __SIZEOF_INT__) - 1))); } On f11 GCC generates also efficient code, on f10 useless &. Guess the f10 case would be improved by addition of a *3_mask_1 define_insn_and_split (and similarly the inefficient/nonportable f1 code would be slightly improved). Looking at LLVM, f1/f3/f5 are worse in LLVM than in GCC, and in all cases instead of cmov it uses branching; f7/f8/f9/f10/f11 all generate efficient code though, so the same like GCC in case of f8 and f11.
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #3 from Jakub Jelinek --- Not to mention that the #c0 code has undefined behavior if rot is not 0, but a multiple of 8 * sizeof(uint32_t), like 32, 64, ... If you insisted on the rot == 0 check it would need to be done after the the rot %= ... and then GCC wouldn't recognize the and operation to be useless. So, on a more complete testcase where f1, f2, f3 have UB on multiples of 32 other than 0, we get most efficient code on f5 and f8 unsigned f1 (unsigned x, unsigned char y) { if (y == 0) return x; y %= __CHAR_BIT__ * __SIZEOF_INT__; return (x << y) | (x >> (__CHAR_BIT__ * __SIZEOF_INT__ - y)); } unsigned f2 (unsigned x, unsigned y) { if (y == 0) return x; y %= __CHAR_BIT__ * __SIZEOF_INT__; return (x << y) | (x >> (__CHAR_BIT__ * __SIZEOF_INT__ - y)); } unsigned f3 (unsigned x, unsigned short y) { if (y == 0) return x; y %= __CHAR_BIT__ * __SIZEOF_INT__; return (x << y) | (x >> (__CHAR_BIT__ * __SIZEOF_INT__ - y)); } unsigned f4 (unsigned x, unsigned char y) { y %= __CHAR_BIT__ * __SIZEOF_INT__; if (y == 0) return x; return (x << y) | (x >> (__CHAR_BIT__ * __SIZEOF_INT__ - y)); } unsigned f5 (unsigned x, unsigned y) { y %= __CHAR_BIT__ * __SIZEOF_INT__; if (y == 0) return x; return (x << y) | (x >> (__CHAR_BIT__ * __SIZEOF_INT__ - y)); } unsigned f6 (unsigned x, unsigned short y) { y %= __CHAR_BIT__ * __SIZEOF_INT__; if (y == 0) return x; return (x << y) | (x >> (__CHAR_BIT__ * __SIZEOF_INT__ - y)); } unsigned f7 (unsigned x, unsigned char y) { y %= __CHAR_BIT__ * __SIZEOF_INT__; return (x << y) | (x >> (-y % (__CHAR_BIT__ * __SIZEOF_INT__))); } unsigned f8 (unsigned x, unsigned int y) { y %= __CHAR_BIT__ * __SIZEOF_INT__; return (x << y) | (x >> (-y % (__CHAR_BIT__ * __SIZEOF_INT__))); } unsigned f9 (unsigned x, unsigned short y) { y %= __CHAR_BIT__ * __SIZEOF_INT__; return (x << y) | (x >> (-y % (__CHAR_BIT__ * __SIZEOF_INT__))); } where f8 is pattern recognized as x r<< (y & 31) and f5 is pattern recognized and phiopt optimized into that. The rest produce inefficient code, f1/f4/f6 with useless & and useless comparison + cmov, f2/f3 just with useless comparison/cmov, and f7/f9 aren't even pattern recognized as rotates. The real question is how many different portable and non-portable ways of doing this we really need to pattern recognize, there are many weirdo ways this can be written.
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #2 from Jakub Jelinek --- The pattern is written for what people normally do, i.e. use int/unsigned int shift count rather than unsigned char. BTW, in any case, this isn't portable rotate that you get efficient code for anyway, you get much better code if you don't special case the 0: #include #include uint32_t baz (uint32_t x, unsigned y) { y %= CHAR_BIT * sizeof (uint32_t); return (x << y) | (x >> (-y % (CHAR_BIT * sizeof (uint32_t; } is what the compiler will pattern match and generate optimal code for, but if it doesn't is still efficient and without branches.
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-10-10 Ever confirmed|0 |1 --- Comment #1 from Marc Glisse --- Looks like https://stackoverflow.com/q/44000956/1918193 . During combine, we try to match (set (reg:SI 97) (rotate:SI (reg/v:SI 90 [ input ]) (and:QI (subreg:QI (reg:SI 92 [ rot ]) 0) (const_int 31 [0x1f] But the pattern in i386.md has 'and' and 'subreg' reversed. For the other part, we have a very limited transform that removes the test in this case: uint32_t rotate_left(uint32_t input, int rot) { if(rot == 0) return input; return static_cast((input << rot) | (input >> (8*sizeof(uint32_t)-rot)));; } But it only works when there is a single gimple insn involved, not and+cast+rotate.