> On 20 Jan 2026, at 05:23, Andrew Pinski <[email protected]> > wrote: > > On Mon, Jan 19, 2026 at 8:12 PM Soumya AR <[email protected]> wrote: >> >> Ping. >> >> I split the files from the previous mail so it's hopefully easier to review. > > I can review this but the approval won't be for until stage1. This > pass at this point is too risky for this point of the release cycle.
Thanks for any feedback you can give. FWIW we’ve been testing this internally for a few months without any issues. > > Though I also wonder how much of this can/should be done on the gimple > level in a generic way. GIMPLE does have powerful ranger infrastructure for this, but I was concerned about doing this earlier because it’s very likely that some later pass could introduce extra extend operations, which would likely undo the benefit of the narrowing. Thanks, Kyrill > And if there is a way to get the zero-bits from the gimple level down > to the RTL level still so we don't need to keep on recomputing them > (this is useful for other passes too). > > Thanks, > Andrew Pinski > >> >> Also CC'ing Alex Coplan to this thread. >> >> Thanks, >> Soumya >> >>> On 12 Jan 2026, at 12:42 PM, Soumya AR <[email protected]> wrote: >>> >>> Hi Tamar, >>> >>> Attaching an updated version of this patch that enables the pass at O2 and >>> above >>> on aarch64, and can be optionally disabled with -mno-narrow-gp-writes. >>> >>> Enabling it by default at O2 touched quite a large number of tests, which I >>> have updated in this patch. >>> >>> Most of the updates are straightforward, which involve changing x registers >>> to >>> (w|x) registers (e.g., x[0-9]+ -> [wx][0-9]+). >>> >>> There are some tests (eg. aarch64/int_mov_immediate_1.c) where the >>> representation of the immediate changes: >>> >>> mov w0, 4294927974 -> mov w0, -39322 >>> >>> This is because when the following RTL is narrowed to SI: >>> (set (reg/i:DI 0 x0) >>> (const_int 4294927974 [0xffff6666])) >>> >>> Due to the MSB changing to Bit 31, which is set, the output is printed as >>> signed. >>> >>> Thanks, >>> Soumya >>> >>> >>> >>>> On 1 Dec 2025, at 2:03 PM, Soumya AR <[email protected]> wrote: >>>> >>>> External email: Use caution opening links or attachments >>>> >>>> >>>> Ping. >>>> >>>> Thanks, >>>> Soumya >>>> >>>>> On 13 Nov 2025, at 11:43 AM, Soumya AR <[email protected]> wrote: >>>>> >>>>> AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit >>>>> >>>>> This patch adds a new AArch64 RTL pass that optimizes 64-bit >>>>> general purpose register operations to use 32-bit W-registers when the >>>>> upper 32 bits of the register are known to be zero. >>>>> >>>>> This is beneficial for the Olympus core, which benefits from using 32-bit >>>>> W-registers over 64-bit X-registers if possible. This is recommended by >>>>> the >>>>> updated Olympus Software Optimization Guide, which will be published soon. >>>>> >>>>> This pass can be controlled with -mnarrow-gp-writes and is active at -O2 >>>>> and >>>>> above, but not enabled by default, except for -mcpu=olympus. >>>>> >>>>> --- >>>>> >>>>> In AArch64, each 64-bit X register has a corresponding 32-bit W register >>>>> that maps to its lower half. When we can guarantee that the upper 32 bits >>>>> are never used, we can safely narrow operations to use W registers >>>>> instead. >>>>> >>>>> For example, this code: >>>>> uint64_t foo(uint64_t a) { >>>>> return (a & 255) + 3; >>>>> } >>>>> >>>>> Currently compiles to: >>>>> and x8, x0, #0xff >>>>> add x0, x8, #3 >>>>> >>>>> But with this pass enabled, it optimizes to: >>>>> and x8, x0, #0xff >>>>> add w0, w8, #3 // Using W register instead of X >>>>> >>>>> --- >>>>> >>>>> The pass operates in two phases: >>>>> >>>>> 1) Analysis Phase: >>>>> - Using RTL-SSA, iterates through extended basic blocks (EBBs) >>>>> - Computes nonzero bit masks for each register definition >>>>> - Recursively processes PHI nodes >>>>> - Identifies candidates for narrowing >>>>> 2) Transformation Phase: >>>>> - Applies narrowing to validated candidates >>>>> - Converts DImode operations to SImode where safe >>>>> >>>>> The pass runs late in the RTL pipeline, after register allocation, to >>>>> ensure >>>>> stable def-use chains and avoid interfering with earlier optimizations. >>>>> >>>>> --- >>>>> >>>>> nonzero_bits(src, DImode) is a function defined in rtlanal.cc that >>>>> recursively >>>>> analyzes RTL expressions to compute a bitmask. However, nonzero_bits has a >>>>> limitation: when it encounters a register, it conservatively returns the >>>>> mode >>>>> mask (all bits potentially set). Since this pass analyzes all defs in an >>>>> instruction, this information can be used to refine the mask. The pass >>>>> maintains >>>>> a hash map of computed bit masks and installs a custom RTL hooks callback >>>>> to consult this mask when encountering a register. >>>>> >>>>> --- >>>>> >>>>> PHI nodes require special handling to merge masks from all inputs. This >>>>> is done >>>>> by combine_mask_from_phi. 3 cases are tackled here: >>>>> 1. Input Edge has a Definition: This is the simplest case. For each input >>>>> edge to the PHI, the def information is retreived and its mask is looked >>>>> up. >>>>> 2. Input Edge has no Definition: A conservative mask is assumed for that >>>>> input. >>>>> 3. Input Edge is a PHI: Recursively call combine_mask_from_phi to >>>>> merge the masks of all incoming values. >>>>> >>>>> --- >>>>> >>>>> When processing regular instructions, the pass first tackles SET and >>>>> PARALLEL >>>>> patterns with compare instructions. >>>>> >>>>> Single SET instructions: >>>>> >>>>> If the upper 32 bits of the source are known to be zero, then the >>>>> instruction >>>>> qualifies for narrowing. Instead of just using lowpart_subreg for the >>>>> source, >>>>> we define narrow_dimode_src to attempt further optimizations: >>>>> >>>>> - Bitwise operations (AND/OR/XOR/ASHIFT): simplified via >>>>> simplify_gen_binary >>>>> - IF_THEN_ELSE: simplified via simplify_gen_ternary >>>>> >>>>> PARALLEL Instructions (Compare + SET): >>>>> >>>>> The pass tackles flag-setting operations (ADDS, SUBS, ANDS, etc.) where >>>>> the SET >>>>> source equals the first operand of the COMPARE. Depending on the >>>>> condition code >>>>> for the compare, the pass checks for the required bits to be zero: >>>>> >>>>> - CC_Zmode/CC_NZmode: Upper 32 bits >>>>> - CC_NZVmode: Upper 32 bits and bit 31 (for overflow) >>>>> >>>>> If the instruction does not match the above patterns (or matches but >>>>> cannot be >>>>> optimized), the pass still analyzes all its definitions to ensure >>>>> nzero_map is >>>>> complete. This ensures every definition has an entry in nzero_map. >>>>> >>>>> --- >>>>> >>>>> When transforming the qualified instructions, the pass uses >>>>> rtl_ssa::recog and >>>>> rtl_ssa::change_is_worthwhile to verify the new pattern and determine if >>>>> the >>>>> transformation is worthwhile. >>>>> >>>>> --- >>>>> >>>>> As an additional benefit, testing on Neoverse-V2 shows that instances of >>>>> 'and x1, x2, #0xffffffff' are converted to zero-latency 'mov w1, w2' >>>>> instructions after this pass narrows them. >>>>> >>>>> --- >>>>> >>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>>>> regression. >>>>> OK for mainline? >>>>> >>>>> Co-authored-by: Kyrylo Tkachov <[email protected]> >>>>> Signed-off-by: Soumya AR <[email protected]> >>>>> >>>>> gcc/ChangeLog: >>>>> >>>>> * config.gcc: Add aarch64-narrow-gp-writes.o. >>>>> * config/aarch64/aarch64-passes.def (INSERT_PASS_BEFORE): Insert >>>>> pass_narrow_gp_writes before pass_cleanup_barriers. >>>>> * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION): >>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES. >>>>> * config/aarch64/tuning_models/olympus.h: >>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES to tune_flags. >>>>> * config/aarch64/aarch64-protos.h (make_pass_narrow_gp_writes): Declare. >>>>> * config/aarch64/aarch64.opt (mnarrow-gp-writes): New option. >>>>> * config/aarch64/t-aarch64: Add aarch64-narrow-gp-writes.o rule. >>>>> * doc/invoke.texi: Document -mnarrow-gp-writes. >>>>> * config/aarch64/aarch64-narrow-gp-writes.cc: New file. >>>>> >>>>> gcc/testsuite/ChangeLog: >>>>> >>>>> * gcc.target/aarch64/narrow-gp-writes-1.c: New test. >>>>> * gcc.target/aarch64/narrow-gp-writes-2.c: New test. >>>>> * gcc.target/aarch64/narrow-gp-writes-3.c: New test. >>>>> * gcc.target/aarch64/narrow-gp-writes-4.c: New test. >>>>> * gcc.target/aarch64/narrow-gp-writes-5.c: New test. >>>>> * gcc.target/aarch64/narrow-gp-writes-6.c: New test. >>>>> * gcc.target/aarch64/narrow-gp-writes-7.c: New test. >>>>> >>>>> >>>>> <0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch> >>>> >>>
