y) << y) or vice versa

Explorer09 at gmail dot com via Gcc-bugs Thu, 14 Mar 2024 08:48:13 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114338


            Bug ID: 114338
           Summary: (x & (-1 << y)) should be optimized to ((x >> y) << y)
                    or vice versa
           Product: gcc
           Version: 13.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: Explorer09 at gmail dot com
  Target Milestone: ---

### Test code

```c
#include <stdint.h>

unsigned int func1(unsigned int x, unsigned char count)
{
  return (x >> count) << count;
}

unsigned int func2(unsigned int x, unsigned char count)
{
  return x & (~0U << count);
}

uint16_t func3(uint16_t x, unsigned char count)
{
  return (x >> count) << count;
}

uint16_t func4(uint16_t x, unsigned char count)
{
  return x & (0xFFFF << count);
}
```

### Expected result

func1 and func2 should compile to identical code. The compiler should
pick the pattern that produces the smallest or fastest code for the
target processor.

func3 and func4 should compile to identical code, too. If the ABI
doesn't require the upper bits of the registers to be zeroed, then
func3 and func4 code size could be as small as func1 and func2.

### Current result (gcc)

With x86-64 gcc 13.2.0 and "-Os" option

func2 generates code that is one byte larger than func1.
func3 generates a MOVZX instruction (one byte larger than func1)
that ideally can be replaced with a MOV. 

For func4, I guess you can put the test code in Compiler Explorer
(godbolt.org) and see the result yourself. (It's the real case I was
facing. I only need to work with a 16-bit input value and I don't
want to internally promote to uint32_t just for optimization purpose.)

### Current result (clang)

clang 18.1.0 and "-Os" option
(Tested in Compiler Explorer (godbolt.org))

func1, func2 and func3 produce identical code in x86-64. It seems
that clang does recognize the pattern and optimize accordingly.

For x86-64 target, ((x >> count) << count) is used.
For ARM and RISC-V targets, ((-1 << count) & x) is used.

However clang doesn't recognize func4 as identical to func3, so
there are rooms to improve.

[Bug rtl-optimization/114338] New: (x & (-1 << y)) should be optimized to ((x >> y) << y) or vice versa

Reply via email to