[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 Roger Sayle changed: What|Removed |Added CC||roger at nextmovesoftware dot com Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED Target Milestone|--- |12.0 --- Comment #13 from Roger Sayle --- All of the (bswap-related) optimizations mentioned in PR tree-optimization/40210 are now implemented in mainline GCC.
[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 --- Comment #12 from CVS Commits --- The master branch has been updated by Roger Sayle : https://gcc.gnu.org/g:4c619132b3f14dc5e672a7f2f0e09cb784193559 commit r12-2137-g4c619132b3f14dc5e672a7f2f0e09cb784193559 Author: Roger Sayle Date: Thu Jul 8 11:46:14 2021 +0100 PR tree-optimization/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in match.pd All of the optimizations/transformations mentioned in bugzilla for PR tree-optimization/40210 are already implemented in mainline GCC, with one exception. In comment #5, there's a suggestion that (bswap64(x)>>56)&0xff can be implemented without the bswap as (unsigned char)x, or equivalently x&0xff. This patch implements the above optimization, and closely related variants. For any single bit, (bswap(X)>>C1)&1 can be simplified to (X>>C2)&1, where bit position C2 is the appropriate permutation of C1. Similarly, the bswap can eliminated if the desired set of bits all lie within the same byte, hence (bswap(x)>>8)&255 can always be optimized, as can (bswap(x)>>8)&123. Previously, int foo(long long x) { return (__builtin_bswap64(x) >> 56) & 0xff; } compiled with -O2 to foo:movq%rdi, %rax bswap %rax shrq$56, %rax ret with this patch, it now compiles to foo:movzbl %dil, %eax ret 2021-07-08 Roger Sayle Richard Biener gcc/ChangeLog PR tree-optimization/40210 * match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as (x>>C3)&C2 when possible. Simplify bswap(x)>>C1 as ((T)x)>>C2 when possible. Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255. gcc/testsuite/ChangeLog PR tree-optimization/40210 * gcc.dg/builtin-bswap-13.c: New test. * gcc.dg/builtin-bswap-14.c: New test.
[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 --- Comment #11 from Pawel Sikora 2011-10-07 18:45:57 UTC --- (In reply to comment #10) > (In reply to comment #9) > > Created attachment 25442 [details] > > testcase > > I think those are hard to optimize really since those are inline-asm really. > And the unsigned short one gets optimized correctly. ahhh, glibc uses a generic i386 implementation (ror+xchg) in while it has an optimized for i486+ :(
[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 --- Comment #10 from Andrew Pinski 2011-10-07 18:29:07 UTC --- (In reply to comment #9) > Created attachment 25442 [details] > testcase I think those are hard to optimize really since those are inline-asm really. And the unsigned short one gets optimized correctly.
[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210 --- Comment #9 from Pawel Sikora 2011-10-07 18:22:25 UTC --- Created attachment 25442 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25442 testcase
[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
--- Comment #8 from hp at gcc dot gnu dot org 2009-06-10 17:36 --- (In reply to comment #7) > I do think that bit operations in general > could be handled a lot better, and that would help out a whole lot of code. If you add (compilable) test-cases with enhancement requests, carefully noting target and gcc version where you observed the lack of optimization, this would help a lot. > Once that framework was in place optimizing the byteswap builtin would be > trivial. We already have a "framework". -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210
[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
--- Comment #7 from eric-bugs at omnifarious dot org 2009-05-20 20:22 --- I've been playing around a bit more, and I've noticed that gcc in general does not do a spectacular job of optimizing bitwise operations of any kind. Some kind of general framework for tracking the movements of individual bits and details like "16 bit values only have 16 bits, so using & to ensure this in various ways is a null operation." might actually do a lot to speed up a lot of code. I distinctly remember a time long past when I and a co-worker fiddled some complex bit operations this way and that to get the assembly out we knew was close to optimal for a tight inner loop. The resulting expression was significantly less clear than the most obvious way of stating the same thing and I also knew that if DEC changed their compiler in certain ways we'd have to do it all over again. As an example, there is no reason that: (x << 8) | (x >> 8) should result in better code than ((x & 0xffu) << 8) | ((x & 0xff00u) >> 8) when x is of type uint16_t, but it does. And recognizing that either can be done in one instruction on an x86 would be even better. So, while I think you are likely correct that the byteswap builtins do not need a lot of extensive optimization, I do think that bit operations in general could be handled a lot better, and that would help out a whole lot of code. Once that framework was in place optimizing the byteswap builtin would be trivial. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210
[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
--- Comment #6 from jakub at gcc dot gnu dot org 2009-05-20 20:05 --- There are plenty other possible builtin bswap optimizations. E.g. extern void bar (void); void foo (int x) { if (__builtin_bswap32 (x) == __builtin_bswap32 (0x1234567)) bar (); } should be optimized into if (x == 0x1234567) (only for EQ_EXPR/NE_EXPR), similarly __builtin_bswap32 (x) == 0x1234567 should be optimized into x == __builtin_bswap32 (0x1234567) because the latter can be swapped at compile time, similarly __builtin_bswap32 (__builtin_bswap32 (x) | 0x1234) could be optimized into x | __builtin_bswap32 (0x1234) (similarly for ^ or & or ~), etc. The question is if enough projects start using these builtins to make it worth spending compile time on it and what exact optimizations are useful on real-world code. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210
[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized
--- Comment #5 from eric-bugs at omnifarious dot org 2009-05-20 19:39 --- This code: #include #include inline uint64_t byteswap_64(const uint64_t x) { return __builtin_bswap64(x); } inline uint32_t byteswap_32(const uint32_t x) { return __builtin_bswap32(x); } extern void random_function(uint32_t a, uint64_t b, uint32_t c, uint64_t d); void swapping(const uint32_t x32, const uint64_t x64) { random_function(byteswap_32(x32), byteswap_64(x64), byteswap_32(byteswap_32(x32)), byteswap_64(byteswap_64(x64))); } void swaparray(uint64_t outvals[], char outtop[], const uint64_t invals[], const size_t size) { size_t i = 0; for (i = 0; i < size; ++i) { outvals[i] = byteswap_64(invals[i]); outtop[i] = (byteswap_64(invals[i]) >> 56) & 0xffull; } } results in this assembly: .globl swaparray .type swaparray, @function swaparray: .LFB5: testq %rcx, %rcx je .L8 xorl%r8d, %r8d .p2align 4,,7 .p2align 3 .L7: movq(%rdx,%r8,8), %rax bswap %rax movq%rax, (%rdi,%r8,8) movq(%rdx,%r8,8), %rax bswap %rax shrq$56, %rax movb%al, (%rsi,%r8) incq%r8 cmpq%r8, %rcx ja .L7 .L8: rep ret .LFE5: .size swaparray, .-swaparray .p2align 4,,15 .globl swapping .type swapping, @function swapping: .LFB4: bswap %rsi bswap %edi movq%rsi, %rcx movl%edi, %edx bswap %rcx bswap %edx jmp random_function .LFE4: .size swapping, .-swapping when compiled with gcc -O3 -mtune=native -march=native on an Opteron system. Notice that in swapping bswap is used twice rather than having two move instructions and two bswap instructions. The optimizer is apparently unaware that bswap is its own inverse. In swaparray the bswap operation is not subject to an obvious CSE optimization, nor is it realized that the latter line might be more efficiently implemented by movb %al, (%rsi,%r8) before the bswap operation. -- eric-bugs at omnifarious dot org changed: What|Removed |Added Summary|gcc needs byte swap builtins|gcc byte swap builtins ||inadequately optimized http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210