https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124625
Bug ID: 124625
Summary: Missed ubfx + shifted addressing for (x >> N) & MASK
when MASK has trailing zeros
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: ptomsich at gcc dot gnu.org
Target Milestone: ---
Description
When the result of (x >> N) & MASK is used as a byte offset into memory and
MASK has trailing zeros (i.e., MASK = C << B), GCC does not decompose the
expression into ((x >> (N+B)) & C) << B. This
prevents the AArch64 backend from using ubfx (bitfield extract) and folding
the shift into the ldr addressing mode.
Reproducer:
long f(long occ, long base, long offset) {
long idx = (occ >> 54) & 504; /* 504 = 63 << 3 */
return *(long *)(base + idx + offset);
}
Current output (GCC 16.0.1, -O2):
f:
asr x0, x0, 54
add x1, x1, x2
and x0, x0, 504
ldr x0, [x1, x0]
ret
4 instructions. The asr + and pair extracts bits 54-59 but leaves them
shifted left by 3, preventing the backend from recognizing a bitfield extract
or folding the scale into the load.
Expected output:
f:
add x1, x1, x2
ubfx x0, x0, 57, 6
ldr x0, [x1, x0, lsl 3]
ret
3 instructions. Decomposing (x >> 54) & 504 into ((x >> 57) & 63) << 3
exposes two opportunities:
1. (x >> 57) & 63 is a 6-bit field extraction at position 57 -- maps to ubfx
x0, x0, 57, 6 (1 instruction instead of asr + and = 2).
2. The << 3 folds into the ldr addressing mode as ldr x0, [x1, x0, lsl 3] (no
extra instruction).
Verification that GCC can generate the optimal code:
Writing the decomposed form by hand confirms the backend is ready:
long f_manual(long occ, long base, long offset) {
int idx = (occ >> 57) & 63;
return *(long *)(base + offset + (long)idx * 8);
}
produces:
f_manual:
add x1, x1, x2
ubfx x0, x0, 57, 6
ldr x0, [x1, w0, sxtw 3]
ret
3 instructions -- the backend already handles the decomposed form correctly.
Root cause:
The combine pass (combine.cc, make_compound_operation_int()) handles (and
(lshiftrt X A) MASK) but only when MASK is 2^N - 1 (consecutive 1's from bit
0):
if (GET_CODE (XEXP (x, 0)) == LSHIFTRT
&& (i = exact_log2 (UINTVAL (XEXP (x, 1)) + 1)) >= 0)
For MASK = 504 = 0x1f8: exact_log2(505) returns -1, so no decomposition is
attempted. The missing case is masks with trailing zeros where factoring them
out would expose a bitfield extract plus a
shift that the backend can fold into addressing modes or shift-add
instructions.
Affected targets:
This is a target-independent issue in combine. Other targets also benefit
from the decomposition:
- RISC-V (Zba): exposes sh3add (shift-add) opportunities
- x86-64: enables scaled index addressing ([base + idx*8]), though the
instruction count is the same
AArch64 benefits the most because ubfx replaces two instructions and the
shifted ldr addressing mode absorbs the scale.
Real-world occurrence:
This pattern appears in chess engines (deepsjeng BishopAttacks) where
bitboard lookup tables are indexed by masked shift results. The hot function
contains 2 instances.