[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized

2021-07-10 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210

Roger Sayle  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com
 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED
   Target Milestone|--- |12.0

--- Comment #13 from Roger Sayle  ---
All of the (bswap-related) optimizations mentioned in PR
tree-optimization/40210 are now implemented in mainline GCC.

[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized

2021-07-08 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210

--- Comment #12 from CVS Commits  ---
The master branch has been updated by Roger Sayle :

https://gcc.gnu.org/g:4c619132b3f14dc5e672a7f2f0e09cb784193559

commit r12-2137-g4c619132b3f14dc5e672a7f2f0e09cb784193559
Author: Roger Sayle 
Date:   Thu Jul 8 11:46:14 2021 +0100

PR tree-optimization/40210: Fold (bswap(X)>>C1)&C2 to (X>>C3)&C2 in
match.pd

All of the optimizations/transformations mentioned in bugzilla for
PR tree-optimization/40210 are already implemented in mainline GCC,
with one exception.  In comment #5, there's a suggestion that
(bswap64(x)>>56)&0xff can be implemented without the bswap as
(unsigned char)x, or equivalently x&0xff.

This patch implements the above optimization, and closely related
variants.  For any single bit, (bswap(X)>>C1)&1 can be simplified
to (X>>C2)&1, where bit position C2 is the appropriate permutation
of C1.  Similarly, the bswap can eliminated if the desired set of
bits all lie within the same byte, hence (bswap(x)>>8)&255 can
always be optimized, as can (bswap(x)>>8)&123.

Previously,
int foo(long long x) {
  return (__builtin_bswap64(x) >> 56) & 0xff;
}

compiled with -O2 to
foo:movq%rdi, %rax
bswap   %rax
shrq$56, %rax
ret

with this patch, it now compiles to
foo:movzbl  %dil, %eax
ret

2021-07-08  Roger Sayle  
Richard Biener  

gcc/ChangeLog
PR tree-optimization/40210
* match.pd (bswap optimizations): Simplify (bswap(x)>>C1)&C2 as
(x>>C3)&C2 when possible.  Simplify bswap(x)>>C1 as ((T)x)>>C2
when possible.  Simplify bswap(x)&C1 as (x>>C2)&C1 when 0<=C1<=255.

gcc/testsuite/ChangeLog
PR tree-optimization/40210
* gcc.dg/builtin-bswap-13.c: New test.
* gcc.dg/builtin-bswap-14.c: New test.

[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized

2011-10-07 Thread pluto at agmk dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210

--- Comment #11 from Pawel Sikora  2011-10-07 18:45:57 
UTC ---
(In reply to comment #10)
> (In reply to comment #9)
> > Created attachment 25442 [details]
> > testcase
> 
> I think those are hard to optimize really since those are inline-asm really. 
> And the unsigned short one gets optimized correctly.

ahhh,
glibc uses a generic i386 implementation (ror+xchg) in 
while it has an optimized  for i486+ :(


[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized

2011-10-07 Thread pinskia at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210

--- Comment #10 from Andrew Pinski  2011-10-07 
18:29:07 UTC ---
(In reply to comment #9)
> Created attachment 25442 [details]
> testcase

I think those are hard to optimize really since those are inline-asm really. 
And the unsigned short one gets optimized correctly.


[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized

2011-10-07 Thread pluto at agmk dot net
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210

--- Comment #9 from Pawel Sikora  2011-10-07 18:22:25 
UTC ---
Created attachment 25442
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25442
testcase


[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized

2009-06-10 Thread hp at gcc dot gnu dot org


--- Comment #8 from hp at gcc dot gnu dot org  2009-06-10 17:36 ---
(In reply to comment #7)
> I do think that bit operations in general
> could be handled a lot better, and that would help out a whole lot of code. 

If you add (compilable) test-cases with enhancement requests, carefully noting
target and gcc version where you observed the lack of optimization, this would
help a lot.

> Once that framework was in place optimizing the byteswap builtin would be
> trivial.

We already have a "framework".


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210



[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized

2009-05-20 Thread eric-bugs at omnifarious dot org


--- Comment #7 from eric-bugs at omnifarious dot org  2009-05-20 20:22 
---
I've been playing around a bit more, and I've noticed that gcc in general does
not do a spectacular job of optimizing bitwise operations of any kind.

Some kind of general framework for tracking the movements of individual bits
and details like "16 bit values only have 16 bits, so using & to ensure this in
various ways is a null operation." might actually do a lot to speed up a lot of
code.

I distinctly remember a time long past when I and a co-worker fiddled some
complex bit operations this way and that to get the assembly out we knew was
close to optimal for a tight inner loop.  The resulting expression was
significantly less clear than the most obvious way of stating the same thing
and I also knew that if DEC changed their compiler in certain ways we'd have to
do it all over again.

As an example, there is no reason that:

(x << 8) | (x >> 8) should result in better code than ((x & 0xffu) << 8) | ((x
& 0xff00u) >> 8) when x is of type uint16_t, but it does.  And recognizing that
either can be done in one instruction on an x86 would be even better.

So, while I think you are likely correct that the byteswap builtins do not need
a lot of extensive optimization, I do think that bit operations in general
could be handled a lot better, and that would help out a whole lot of code. 
Once that framework was in place optimizing the byteswap builtin would be
trivial.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210



[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized

2009-05-20 Thread jakub at gcc dot gnu dot org


--- Comment #6 from jakub at gcc dot gnu dot org  2009-05-20 20:05 ---
There are plenty other possible builtin bswap optimizations.  E.g.
extern void bar (void);

void foo (int x)
{
  if (__builtin_bswap32 (x) == __builtin_bswap32 (0x1234567))
bar ();
}

should be optimized into if (x == 0x1234567) (only for EQ_EXPR/NE_EXPR),
similarly __builtin_bswap32 (x) == 0x1234567 should be optimized into
x == __builtin_bswap32 (0x1234567) because the latter can be swapped at compile
time, similarly __builtin_bswap32 (__builtin_bswap32 (x) | 0x1234) could
be optimized into x | __builtin_bswap32 (0x1234) (similarly for ^ or & or ~),
etc.  The question is if enough projects start using these builtins to make it
worth spending compile time on it and what exact optimizations are useful on
real-world code.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210



[Bug tree-optimization/40210] gcc byte swap builtins inadequately optimized

2009-05-20 Thread eric-bugs at omnifarious dot org


--- Comment #5 from eric-bugs at omnifarious dot org  2009-05-20 19:39 
---
This code:

#include 
#include 

inline uint64_t byteswap_64(const uint64_t x)
{
   return __builtin_bswap64(x);
}

inline uint32_t byteswap_32(const uint32_t x)
{
   return __builtin_bswap32(x);
}

extern void random_function(uint32_t a, uint64_t b, uint32_t c, uint64_t d);

void swapping(const uint32_t x32, const uint64_t x64)
{
   random_function(byteswap_32(x32), byteswap_64(x64),
byteswap_32(byteswap_32(x32)), byteswap_64(byteswap_64(x64)));
}

void swaparray(uint64_t outvals[], char outtop[], const uint64_t invals[],
const size_t size)
{
   size_t i = 0;
   for (i = 0; i < size; ++i) {
  outvals[i] = byteswap_64(invals[i]);
  outtop[i] = (byteswap_64(invals[i]) >> 56) & 0xffull;
   }
}

results in this assembly:

.globl swaparray
.type   swaparray, @function
swaparray:
.LFB5:
testq   %rcx, %rcx
je  .L8
xorl%r8d, %r8d
.p2align 4,,7
.p2align 3
.L7:
movq(%rdx,%r8,8), %rax
bswap   %rax
movq%rax, (%rdi,%r8,8)
movq(%rdx,%r8,8), %rax
bswap   %rax
shrq$56, %rax
movb%al, (%rsi,%r8)
incq%r8
cmpq%r8, %rcx
ja  .L7
.L8:
rep
ret
.LFE5:
.size   swaparray, .-swaparray
.p2align 4,,15
.globl swapping
.type   swapping, @function
swapping:
.LFB4:
bswap   %rsi
bswap   %edi
movq%rsi, %rcx
movl%edi, %edx
bswap   %rcx
bswap   %edx
jmp random_function
.LFE4:
.size   swapping, .-swapping

when compiled with gcc -O3 -mtune=native -march=native on an Opteron system.

Notice that in swapping bswap is used twice rather than having two move
instructions and two bswap instructions.  The optimizer is apparently unaware
that bswap is its own inverse.

In swaparray the bswap operation is not subject to an obvious CSE optimization,
nor is it realized that the latter line might be more efficiently implemented
by movb   %al, (%rsi,%r8) before the bswap operation.


-- 

eric-bugs at omnifarious dot org changed:

   What|Removed |Added

Summary|gcc needs byte swap builtins|gcc byte swap builtins
   ||inadequately optimized


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40210