Re: [PING][PATCH v2] AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit

Soumya AR Mon, 19 Jan 2026 20:11:14 -0800

Ping.

I split the files from the previous mail so it's hopefully easier to review.


Also CC'ing Alex Coplan to this thread.

Thanks,
Soumya

> On 12 Jan 2026, at 12:42 PM, Soumya AR <[email protected]> wrote:
> 
> Hi Tamar,
> 
> Attaching an updated version of this patch that enables the pass at O2 and 
> above
> on aarch64, and can be optionally disabled with -mno-narrow-gp-writes.
> 
> Enabling it by default at O2 touched quite a large number of tests, which I
> have updated in this patch.
> 
> Most of the updates are straightforward, which involve changing x registers to
> (w|x) registers (e.g., x[0-9]+ -> [wx][0-9]+). 
> 
> There are some tests (eg. aarch64/int_mov_immediate_1.c) where the
> representation of the immediate changes: 
> 
>         mov w0, 4294927974 -> mov w0, -39322
> 
> This is because when the following RTL is narrowed to SI:
>         (set (reg/i:DI 0 x0)
>                 (const_int 4294927974 [0xffff6666]))
> 
> Due to the MSB changing to Bit 31, which is set, the output is printed as
> signed.
> 
> Thanks,
> Soumya
> 
> 
> 
> > On 1 Dec 2025, at 2:03 PM, Soumya AR <[email protected]> wrote:
> > 
> > External email: Use caution opening links or attachments
> > 
> > 
> > Ping.
> > 
> > Thanks,
> > Soumya
> > 
> >> On 13 Nov 2025, at 11:43 AM, Soumya AR <[email protected]> wrote:
> >> 
> >> AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit
> >> 
> >> This patch adds a new AArch64 RTL pass that optimizes 64-bit
> >> general purpose register operations to use 32-bit W-registers when the
> >> upper 32 bits of the register are known to be zero.
> >> 
> >> This is beneficial for the Olympus core, which benefits from using 32-bit
> >> W-registers over 64-bit X-registers if possible. This is recommended by the
> >> updated Olympus Software Optimization Guide, which will be published soon.
> >> 
> >> This pass can be controlled with -mnarrow-gp-writes and is active at -O2 
> >> and
> >> above, but not enabled by default, except for -mcpu=olympus.
> >> 
> >> ---
> >> 
> >> In AArch64, each 64-bit X register has a corresponding 32-bit W register
> >> that maps to its lower half.  When we can guarantee that the upper 32 bits
> >> are never used, we can safely narrow operations to use W registers instead.
> >> 
> >> For example, this code:
> >> uint64_t foo(uint64_t a) {
> >>     return (a & 255) + 3;
> >> }
> >> 
> >> Currently compiles to:
> >> and x8, x0, #0xff
> >> add x0, x8, #3
> >> 
> >> But with this pass enabled, it optimizes to:
> >> and x8, x0, #0xff
> >> add w0, w8, #3      // Using W register instead of X
> >> 
> >> ---
> >> 
> >> The pass operates in two phases:
> >> 
> >> 1) Analysis Phase:
> >> - Using RTL-SSA, iterates through extended basic blocks (EBBs)
> >> - Computes nonzero bit masks for each register definition
> >> - Recursively processes PHI nodes
> >> - Identifies candidates for narrowing
> >> 2) Transformation Phase:
> >> - Applies narrowing to validated candidates
> >> - Converts DImode operations to SImode where safe
> >> 
> >> The pass runs late in the RTL pipeline, after register allocation, to 
> >> ensure
> >> stable def-use chains and avoid interfering with earlier optimizations.
> >> 
> >> ---
> >> 
> >> nonzero_bits(src, DImode) is a function defined in rtlanal.cc that 
> >> recursively
> >> analyzes RTL expressions to compute a bitmask. However, nonzero_bits has a
> >> limitation: when it encounters a register, it conservatively returns the 
> >> mode
> >> mask (all bits potentially set). Since this pass analyzes all defs in an
> >> instruction, this information can be used to refine the mask. The pass 
> >> maintains
> >> a hash map of computed bit masks and installs a custom RTL hooks callback
> >> to consult this mask when encountering a register.
> >> 
> >> ---
> >> 
> >> PHI nodes require special handling to merge masks from all inputs. This is 
> >> done
> >> by combine_mask_from_phi. 3 cases are tackled here:
> >> 1. Input Edge has a Definition: This is the simplest case. For each input
> >> edge to the PHI, the def information is retreived and its mask is looked 
> >> up.
> >> 2. Input Edge has no Definition: A conservative mask is assumed for that
> >> input.
> >> 3. Input Edge is a PHI: Recursively call combine_mask_from_phi to
> >> merge the masks of all incoming values.
> >> 
> >> ---
> >> 
> >> When processing regular instructions, the pass first tackles SET and 
> >> PARALLEL
> >> patterns with compare instructions.
> >> 
> >> Single SET instructions:
> >> 
> >> If the upper 32 bits of the source are known to be zero, then the 
> >> instruction
> >> qualifies for narrowing. Instead of just using lowpart_subreg for the 
> >> source,
> >> we define narrow_dimode_src to attempt further optimizations:
> >> 
> >> - Bitwise operations (AND/OR/XOR/ASHIFT): simplified via 
> >> simplify_gen_binary
> >> - IF_THEN_ELSE: simplified via simplify_gen_ternary
> >> 
> >> PARALLEL Instructions (Compare + SET):
> >> 
> >> The pass tackles flag-setting operations (ADDS, SUBS, ANDS, etc.) where 
> >> the SET
> >> source equals the first operand of the COMPARE. Depending on the condition 
> >> code
> >> for the compare, the pass checks for the required bits to be zero:
> >> 
> >> - CC_Zmode/CC_NZmode: Upper 32 bits
> >> - CC_NZVmode: Upper 32 bits and bit 31 (for overflow)
> >> 
> >> If the instruction does not match the above patterns (or matches but 
> >> cannot be
> >> optimized), the pass still analyzes all its definitions to ensure 
> >> nzero_map is
> >> complete. This ensures every definition has an entry in nzero_map.
> >> 
> >> ---
> >> 
> >> When transforming the qualified instructions, the pass uses rtl_ssa::recog 
> >> and
> >> rtl_ssa::change_is_worthwhile to verify the new pattern and determine if 
> >> the
> >> transformation is worthwhile.
> >> 
> >> ---
> >> 
> >> As an additional benefit, testing on Neoverse-V2 shows that instances of
> >> 'and x1, x2, #0xffffffff' are converted to zero-latency 'mov w1, w2'
> >> instructions after this pass narrows them.
> >> 
> >> ---
> >> 
> >> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
> >> regression.
> >> OK for mainline?
> >> 
> >> Co-authored-by: Kyrylo Tkachov <[email protected]>
> >> Signed-off-by: Soumya AR <[email protected]>
> >> 
> >> gcc/ChangeLog:
> >> 
> >>    * config.gcc: Add aarch64-narrow-gp-writes.o.
> >>    * config/aarch64/aarch64-passes.def (INSERT_PASS_BEFORE): Insert
> >>    pass_narrow_gp_writes before pass_cleanup_barriers.
> >>    * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION):
> >>    Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES.
> >>    * config/aarch64/tuning_models/olympus.h:
> >>    Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES to tune_flags.
> >>    * config/aarch64/aarch64-protos.h (make_pass_narrow_gp_writes): Declare.
> >>    * config/aarch64/aarch64.opt (mnarrow-gp-writes): New option.
> >>    * config/aarch64/t-aarch64: Add aarch64-narrow-gp-writes.o rule.
> >>    * doc/invoke.texi: Document -mnarrow-gp-writes.
> >>    * config/aarch64/aarch64-narrow-gp-writes.cc: New file.
> >> 
> >> gcc/testsuite/ChangeLog:
> >> 
> >>    * gcc.target/aarch64/narrow-gp-writes-1.c: New test.
> >>    * gcc.target/aarch64/narrow-gp-writes-2.c: New test.
> >>    * gcc.target/aarch64/narrow-gp-writes-3.c: New test.
> >>    * gcc.target/aarch64/narrow-gp-writes-4.c: New test.
> >>    * gcc.target/aarch64/narrow-gp-writes-5.c: New test.
> >>    * gcc.target/aarch64/narrow-gp-writes-6.c: New test.
> >>    * gcc.target/aarch64/narrow-gp-writes-7.c: New test.
> >> 
> >> 
> >> <0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch>
> > 
>

0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch.gz
Description: 0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch.gz

0002-AArch64-Update-aarch64-aarch64-sve-aarch64-acle-test.patch.gz
Description: 0002-AArch64-Update-aarch64-aarch64-sve-aarch64-acle-test.patch.gz

0003-AArch64-Update-aarch64-sme2-acle-asm-tests-affected-.patch.gz
Description: 0003-AArch64-Update-aarch64-sme2-acle-asm-tests-affected-.patch.gz

0004-AArch64-Update-aarch64-sve-acle-asm-tests-affected-b.patch.gz
Description: 0004-AArch64-Update-aarch64-sve-acle-asm-tests-affected-b.patch.gz

0005-AArch64-Update-aarch64-sve2-acle-asm-tests-affected-.patch.gz
Description: 0005-AArch64-Update-aarch64-sve2-acle-asm-tests-affected-.patch.gz

Re: [PING][PATCH v2] AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit

Reply via email to