https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106594
--- Comment #8 from Roger Sayle <roger at nextmovesoftware dot com> --- Time for a status update. The PR title is a little misleading; sign-extensions aren't really the problem, but it turns out that the equivalent zero-extensions aren't always optimized as well as the equivalent sign-extensions (due to alternate possible RTL representations), even on symmetric targets like aarch64. The following patch solves the underlying issue reported in the PR: diff --git a/gcc/combine.cc b/gcc/combine.cc index a5fabf3..d739cae 100644 --- a/gcc/combine.cc +++ b/gcc/combine.cc @@ -7288,7 +7288,17 @@ expand_compound_operation (rtx x) && (STORE_FLAG_VALUE & ~GET_MODE_MASK (inner_mode)) == 0) return SUBREG_REG (XEXP (x, 0)); + /* If ZERO_EXTEND is cheap on this target, do nothing, + i.e. don't attempt to convert it to a pair of shifts. */ + if (set_src_cost (x, mode, optimize_this_for_speed_p) + <= COSTS_N_INSNS (1)) + return x; } + /* Likewise, if SIGN_EXTEND is cheap, do nothing. */ + else if (GET_CODE (x) == SIGN_EXTEND + && set_src_cost (x, mode, optimize_this_for_speed_p) + <= COSTS_N_INSNS (1)) + return x; /* If we reach here, we want to return a pair of shifts. The inner shift is a left shift of BITSIZE - POS - LEN bits. The outer The explanation/motivation for this fix is in the code immediately above in expand_compound_operation, that considers converting SIGN_EXTEND into ZERO_EXTEND, but checks set_src_cost to confirm that it'd be cheaper than the pair of shifts mentioned at the bottom. By converting SIGN_EXTEND to ZERO_EXTEND earlier (for example during the tree-ssa passes), the new ZERO_EXTEND now passes through the above logic and always gets converted to a pair of shifts, regardless of the cost, triggering the regresssion. Thanks to Tamar for confirming the above change results in a significant performance benefit (sometimes ~5%) on aarch64. The complication is that the above change, to preserve ZERO_EXTEND/SIGN_EXTEND in GCC's preferred canonical RTL when these operations are cheap, exposes a number of latent bugs and missed optimization opportunities (on at least x86 and aarch64). I'm currently bootstrapping and regression testing a number of fixes to prevent these and minimize disruption should the above change be approved. Hence we're making progress, but a clean "fix" isn't ready just yet. p.s. HP is right that REG_EQUAL notes may help (when x & 0xff is cheaper than x & 0x23, i.e. if we're really doing AND then including nonzero_bits is free, but if this is a zero_extension, the AND is implicit, and will only be recognized as an addressing mode if the precision mask isn't modified by nonzero_bits).