> On 20 Jan 2026, at 10:06, Kyrylo Tkachov <[email protected]> wrote: > > > >> On 20 Jan 2026, at 05:23, Andrew Pinski <[email protected]> >> wrote: >> >> On Mon, Jan 19, 2026 at 8:12 PM Soumya AR <[email protected]> wrote: >>> >>> Ping. >>> >>> I split the files from the previous mail so it's hopefully easier to review. >> >> I can review this but the approval won't be for until stage1. This >> pass at this point is too risky for this point of the release cycle. > > Thanks for any feedback you can give. FWIW we’ve been testing this internally > for a few months without any issues.
One option to reduce the risk that Soumya’s initial patch implemented was to enable this only for -mcpu=olympus. We initially developed and tested it on that target. So that way it wouldn’t affect most aarch64 targets and we’d still have the -mno-* option to disable it as a workaround for users if it causes trouble. Would that be okay with you? Thanks, Kyrill > >> >> Though I also wonder how much of this can/should be done on the gimple >> level in a generic way. > > GIMPLE does have powerful ranger infrastructure for this, but I was concerned > about doing this earlier because it’s very likely that some later pass could > introduce extra extend operations, which would likely undo the benefit of the > narrowing. > > Thanks, > Kyrill > >> And if there is a way to get the zero-bits from the gimple level down >> to the RTL level still so we don't need to keep on recomputing them >> (this is useful for other passes too). >> >> Thanks, >> Andrew Pinski >> >>> >>> Also CC'ing Alex Coplan to this thread. >>> >>> Thanks, >>> Soumya >>> >>>> On 12 Jan 2026, at 12:42 PM, Soumya AR <[email protected]> wrote: >>>> >>>> Hi Tamar, >>>> >>>> Attaching an updated version of this patch that enables the pass at O2 and >>>> above >>>> on aarch64, and can be optionally disabled with -mno-narrow-gp-writes. >>>> >>>> Enabling it by default at O2 touched quite a large number of tests, which I >>>> have updated in this patch. >>>> >>>> Most of the updates are straightforward, which involve changing x >>>> registers to >>>> (w|x) registers (e.g., x[0-9]+ -> [wx][0-9]+). >>>> >>>> There are some tests (eg. aarch64/int_mov_immediate_1.c) where the >>>> representation of the immediate changes: >>>> >>>> mov w0, 4294927974 -> mov w0, -39322 >>>> >>>> This is because when the following RTL is narrowed to SI: >>>> (set (reg/i:DI 0 x0) >>>> (const_int 4294927974 [0xffff6666])) >>>> >>>> Due to the MSB changing to Bit 31, which is set, the output is printed as >>>> signed. >>>> >>>> Thanks, >>>> Soumya >>>> >>>> >>>> >>>>> On 1 Dec 2025, at 2:03 PM, Soumya AR <[email protected]> wrote: >>>>> >>>>> External email: Use caution opening links or attachments >>>>> >>>>> >>>>> Ping. >>>>> >>>>> Thanks, >>>>> Soumya >>>>> >>>>>> On 13 Nov 2025, at 11:43 AM, Soumya AR <[email protected]> wrote: >>>>>> >>>>>> AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit >>>>>> >>>>>> This patch adds a new AArch64 RTL pass that optimizes 64-bit >>>>>> general purpose register operations to use 32-bit W-registers when the >>>>>> upper 32 bits of the register are known to be zero. >>>>>> >>>>>> This is beneficial for the Olympus core, which benefits from using 32-bit >>>>>> W-registers over 64-bit X-registers if possible. This is recommended by >>>>>> the >>>>>> updated Olympus Software Optimization Guide, which will be published >>>>>> soon. >>>>>> >>>>>> This pass can be controlled with -mnarrow-gp-writes and is active at -O2 >>>>>> and >>>>>> above, but not enabled by default, except for -mcpu=olympus. >>>>>> >>>>>> --- >>>>>> >>>>>> In AArch64, each 64-bit X register has a corresponding 32-bit W register >>>>>> that maps to its lower half. When we can guarantee that the upper 32 >>>>>> bits >>>>>> are never used, we can safely narrow operations to use W registers >>>>>> instead. >>>>>> >>>>>> For example, this code: >>>>>> uint64_t foo(uint64_t a) { >>>>>> return (a & 255) + 3; >>>>>> } >>>>>> >>>>>> Currently compiles to: >>>>>> and x8, x0, #0xff >>>>>> add x0, x8, #3 >>>>>> >>>>>> But with this pass enabled, it optimizes to: >>>>>> and x8, x0, #0xff >>>>>> add w0, w8, #3 // Using W register instead of X >>>>>> >>>>>> --- >>>>>> >>>>>> The pass operates in two phases: >>>>>> >>>>>> 1) Analysis Phase: >>>>>> - Using RTL-SSA, iterates through extended basic blocks (EBBs) >>>>>> - Computes nonzero bit masks for each register definition >>>>>> - Recursively processes PHI nodes >>>>>> - Identifies candidates for narrowing >>>>>> 2) Transformation Phase: >>>>>> - Applies narrowing to validated candidates >>>>>> - Converts DImode operations to SImode where safe >>>>>> >>>>>> The pass runs late in the RTL pipeline, after register allocation, to >>>>>> ensure >>>>>> stable def-use chains and avoid interfering with earlier optimizations. >>>>>> >>>>>> --- >>>>>> >>>>>> nonzero_bits(src, DImode) is a function defined in rtlanal.cc that >>>>>> recursively >>>>>> analyzes RTL expressions to compute a bitmask. However, nonzero_bits has >>>>>> a >>>>>> limitation: when it encounters a register, it conservatively returns the >>>>>> mode >>>>>> mask (all bits potentially set). Since this pass analyzes all defs in an >>>>>> instruction, this information can be used to refine the mask. The pass >>>>>> maintains >>>>>> a hash map of computed bit masks and installs a custom RTL hooks callback >>>>>> to consult this mask when encountering a register. >>>>>> >>>>>> --- >>>>>> >>>>>> PHI nodes require special handling to merge masks from all inputs. This >>>>>> is done >>>>>> by combine_mask_from_phi. 3 cases are tackled here: >>>>>> 1. Input Edge has a Definition: This is the simplest case. For each input >>>>>> edge to the PHI, the def information is retreived and its mask is looked >>>>>> up. >>>>>> 2. Input Edge has no Definition: A conservative mask is assumed for that >>>>>> input. >>>>>> 3. Input Edge is a PHI: Recursively call combine_mask_from_phi to >>>>>> merge the masks of all incoming values. >>>>>> >>>>>> --- >>>>>> >>>>>> When processing regular instructions, the pass first tackles SET and >>>>>> PARALLEL >>>>>> patterns with compare instructions. >>>>>> >>>>>> Single SET instructions: >>>>>> >>>>>> If the upper 32 bits of the source are known to be zero, then the >>>>>> instruction >>>>>> qualifies for narrowing. Instead of just using lowpart_subreg for the >>>>>> source, >>>>>> we define narrow_dimode_src to attempt further optimizations: >>>>>> >>>>>> - Bitwise operations (AND/OR/XOR/ASHIFT): simplified via >>>>>> simplify_gen_binary >>>>>> - IF_THEN_ELSE: simplified via simplify_gen_ternary >>>>>> >>>>>> PARALLEL Instructions (Compare + SET): >>>>>> >>>>>> The pass tackles flag-setting operations (ADDS, SUBS, ANDS, etc.) where >>>>>> the SET >>>>>> source equals the first operand of the COMPARE. Depending on the >>>>>> condition code >>>>>> for the compare, the pass checks for the required bits to be zero: >>>>>> >>>>>> - CC_Zmode/CC_NZmode: Upper 32 bits >>>>>> - CC_NZVmode: Upper 32 bits and bit 31 (for overflow) >>>>>> >>>>>> If the instruction does not match the above patterns (or matches but >>>>>> cannot be >>>>>> optimized), the pass still analyzes all its definitions to ensure >>>>>> nzero_map is >>>>>> complete. This ensures every definition has an entry in nzero_map. >>>>>> >>>>>> --- >>>>>> >>>>>> When transforming the qualified instructions, the pass uses >>>>>> rtl_ssa::recog and >>>>>> rtl_ssa::change_is_worthwhile to verify the new pattern and determine if >>>>>> the >>>>>> transformation is worthwhile. >>>>>> >>>>>> --- >>>>>> >>>>>> As an additional benefit, testing on Neoverse-V2 shows that instances of >>>>>> 'and x1, x2, #0xffffffff' are converted to zero-latency 'mov w1, w2' >>>>>> instructions after this pass narrows them. >>>>>> >>>>>> --- >>>>>> >>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>>>>> regression. >>>>>> OK for mainline? >>>>>> >>>>>> Co-authored-by: Kyrylo Tkachov <[email protected]> >>>>>> Signed-off-by: Soumya AR <[email protected]> >>>>>> >>>>>> gcc/ChangeLog: >>>>>> >>>>>> * config.gcc: Add aarch64-narrow-gp-writes.o. >>>>>> * config/aarch64/aarch64-passes.def (INSERT_PASS_BEFORE): Insert >>>>>> pass_narrow_gp_writes before pass_cleanup_barriers. >>>>>> * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION): >>>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES. >>>>>> * config/aarch64/tuning_models/olympus.h: >>>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES to tune_flags. >>>>>> * config/aarch64/aarch64-protos.h (make_pass_narrow_gp_writes): Declare. >>>>>> * config/aarch64/aarch64.opt (mnarrow-gp-writes): New option. >>>>>> * config/aarch64/t-aarch64: Add aarch64-narrow-gp-writes.o rule. >>>>>> * doc/invoke.texi: Document -mnarrow-gp-writes. >>>>>> * config/aarch64/aarch64-narrow-gp-writes.cc: New file. >>>>>> >>>>>> gcc/testsuite/ChangeLog: >>>>>> >>>>>> * gcc.target/aarch64/narrow-gp-writes-1.c: New test. >>>>>> * gcc.target/aarch64/narrow-gp-writes-2.c: New test. >>>>>> * gcc.target/aarch64/narrow-gp-writes-3.c: New test. >>>>>> * gcc.target/aarch64/narrow-gp-writes-4.c: New test. >>>>>> * gcc.target/aarch64/narrow-gp-writes-5.c: New test. >>>>>> * gcc.target/aarch64/narrow-gp-writes-6.c: New test. >>>>>> * gcc.target/aarch64/narrow-gp-writes-7.c: New test. >>>>>> >>>>>> >>>>>> <0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch>
