> On 20 Jan 2026, at 10:06, Kyrylo Tkachov <[email protected]> wrote:
> 
> 
> 
>> On 20 Jan 2026, at 05:23, Andrew Pinski <[email protected]> 
>> wrote:
>> 
>> On Mon, Jan 19, 2026 at 8:12 PM Soumya AR <[email protected]> wrote:
>>> 
>>> Ping.
>>> 
>>> I split the files from the previous mail so it's hopefully easier to review.
>> 
>> I can review this but the approval won't be for until stage1. This
>> pass at this point is too risky for this point of the release cycle.
> 
> Thanks for any feedback you can give. FWIW we’ve been testing this internally 
> for a few months without any issues.

One option to reduce the risk that Soumya’s initial patch implemented was to 
enable this only for -mcpu=olympus. We initially developed and tested it on 
that target.
So that way it wouldn’t affect most aarch64 targets and we’d still have the 
-mno-* option to disable it as a workaround for users if it causes trouble.
Would that be okay with you?
Thanks,
Kyrill

> 
>> 
>> Though I also wonder how much of this can/should be done on the gimple
>> level in a generic way.
> 
> GIMPLE does have powerful ranger infrastructure for this, but I was concerned 
> about doing this earlier because it’s very likely that some later pass could 
> introduce extra extend operations, which would likely undo the benefit of the 
> narrowing.
> 
> Thanks,
> Kyrill
> 
>> And if there is a way to get the zero-bits from the gimple level down
>> to the RTL level still so we don't need to keep on recomputing them
>> (this is useful for other passes too).
>> 
>> Thanks,
>> Andrew Pinski
>> 
>>> 
>>> Also CC'ing Alex Coplan to this thread.
>>> 
>>> Thanks,
>>> Soumya
>>> 
>>>> On 12 Jan 2026, at 12:42 PM, Soumya AR <[email protected]> wrote:
>>>> 
>>>> Hi Tamar,
>>>> 
>>>> Attaching an updated version of this patch that enables the pass at O2 and 
>>>> above
>>>> on aarch64, and can be optionally disabled with -mno-narrow-gp-writes.
>>>> 
>>>> Enabling it by default at O2 touched quite a large number of tests, which I
>>>> have updated in this patch.
>>>> 
>>>> Most of the updates are straightforward, which involve changing x 
>>>> registers to
>>>> (w|x) registers (e.g., x[0-9]+ -> [wx][0-9]+).
>>>> 
>>>> There are some tests (eg. aarch64/int_mov_immediate_1.c) where the
>>>> representation of the immediate changes:
>>>> 
>>>>       mov w0, 4294927974 -> mov w0, -39322
>>>> 
>>>> This is because when the following RTL is narrowed to SI:
>>>>       (set (reg/i:DI 0 x0)
>>>>               (const_int 4294927974 [0xffff6666]))
>>>> 
>>>> Due to the MSB changing to Bit 31, which is set, the output is printed as
>>>> signed.
>>>> 
>>>> Thanks,
>>>> Soumya
>>>> 
>>>> 
>>>> 
>>>>> On 1 Dec 2025, at 2:03 PM, Soumya AR <[email protected]> wrote:
>>>>> 
>>>>> External email: Use caution opening links or attachments
>>>>> 
>>>>> 
>>>>> Ping.
>>>>> 
>>>>> Thanks,
>>>>> Soumya
>>>>> 
>>>>>> On 13 Nov 2025, at 11:43 AM, Soumya AR <[email protected]> wrote:
>>>>>> 
>>>>>> AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit
>>>>>> 
>>>>>> This patch adds a new AArch64 RTL pass that optimizes 64-bit
>>>>>> general purpose register operations to use 32-bit W-registers when the
>>>>>> upper 32 bits of the register are known to be zero.
>>>>>> 
>>>>>> This is beneficial for the Olympus core, which benefits from using 32-bit
>>>>>> W-registers over 64-bit X-registers if possible. This is recommended by 
>>>>>> the
>>>>>> updated Olympus Software Optimization Guide, which will be published 
>>>>>> soon.
>>>>>> 
>>>>>> This pass can be controlled with -mnarrow-gp-writes and is active at -O2 
>>>>>> and
>>>>>> above, but not enabled by default, except for -mcpu=olympus.
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> In AArch64, each 64-bit X register has a corresponding 32-bit W register
>>>>>> that maps to its lower half.  When we can guarantee that the upper 32 
>>>>>> bits
>>>>>> are never used, we can safely narrow operations to use W registers 
>>>>>> instead.
>>>>>> 
>>>>>> For example, this code:
>>>>>> uint64_t foo(uint64_t a) {
>>>>>>   return (a & 255) + 3;
>>>>>> }
>>>>>> 
>>>>>> Currently compiles to:
>>>>>> and x8, x0, #0xff
>>>>>> add x0, x8, #3
>>>>>> 
>>>>>> But with this pass enabled, it optimizes to:
>>>>>> and x8, x0, #0xff
>>>>>> add w0, w8, #3      // Using W register instead of X
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> The pass operates in two phases:
>>>>>> 
>>>>>> 1) Analysis Phase:
>>>>>> - Using RTL-SSA, iterates through extended basic blocks (EBBs)
>>>>>> - Computes nonzero bit masks for each register definition
>>>>>> - Recursively processes PHI nodes
>>>>>> - Identifies candidates for narrowing
>>>>>> 2) Transformation Phase:
>>>>>> - Applies narrowing to validated candidates
>>>>>> - Converts DImode operations to SImode where safe
>>>>>> 
>>>>>> The pass runs late in the RTL pipeline, after register allocation, to 
>>>>>> ensure
>>>>>> stable def-use chains and avoid interfering with earlier optimizations.
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> nonzero_bits(src, DImode) is a function defined in rtlanal.cc that 
>>>>>> recursively
>>>>>> analyzes RTL expressions to compute a bitmask. However, nonzero_bits has 
>>>>>> a
>>>>>> limitation: when it encounters a register, it conservatively returns the 
>>>>>> mode
>>>>>> mask (all bits potentially set). Since this pass analyzes all defs in an
>>>>>> instruction, this information can be used to refine the mask. The pass 
>>>>>> maintains
>>>>>> a hash map of computed bit masks and installs a custom RTL hooks callback
>>>>>> to consult this mask when encountering a register.
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> PHI nodes require special handling to merge masks from all inputs. This 
>>>>>> is done
>>>>>> by combine_mask_from_phi. 3 cases are tackled here:
>>>>>> 1. Input Edge has a Definition: This is the simplest case. For each input
>>>>>> edge to the PHI, the def information is retreived and its mask is looked 
>>>>>> up.
>>>>>> 2. Input Edge has no Definition: A conservative mask is assumed for that
>>>>>> input.
>>>>>> 3. Input Edge is a PHI: Recursively call combine_mask_from_phi to
>>>>>> merge the masks of all incoming values.
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> When processing regular instructions, the pass first tackles SET and 
>>>>>> PARALLEL
>>>>>> patterns with compare instructions.
>>>>>> 
>>>>>> Single SET instructions:
>>>>>> 
>>>>>> If the upper 32 bits of the source are known to be zero, then the 
>>>>>> instruction
>>>>>> qualifies for narrowing. Instead of just using lowpart_subreg for the 
>>>>>> source,
>>>>>> we define narrow_dimode_src to attempt further optimizations:
>>>>>> 
>>>>>> - Bitwise operations (AND/OR/XOR/ASHIFT): simplified via 
>>>>>> simplify_gen_binary
>>>>>> - IF_THEN_ELSE: simplified via simplify_gen_ternary
>>>>>> 
>>>>>> PARALLEL Instructions (Compare + SET):
>>>>>> 
>>>>>> The pass tackles flag-setting operations (ADDS, SUBS, ANDS, etc.) where 
>>>>>> the SET
>>>>>> source equals the first operand of the COMPARE. Depending on the 
>>>>>> condition code
>>>>>> for the compare, the pass checks for the required bits to be zero:
>>>>>> 
>>>>>> - CC_Zmode/CC_NZmode: Upper 32 bits
>>>>>> - CC_NZVmode: Upper 32 bits and bit 31 (for overflow)
>>>>>> 
>>>>>> If the instruction does not match the above patterns (or matches but 
>>>>>> cannot be
>>>>>> optimized), the pass still analyzes all its definitions to ensure 
>>>>>> nzero_map is
>>>>>> complete. This ensures every definition has an entry in nzero_map.
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> When transforming the qualified instructions, the pass uses 
>>>>>> rtl_ssa::recog and
>>>>>> rtl_ssa::change_is_worthwhile to verify the new pattern and determine if 
>>>>>> the
>>>>>> transformation is worthwhile.
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> As an additional benefit, testing on Neoverse-V2 shows that instances of
>>>>>> 'and x1, x2, #0xffffffff' are converted to zero-latency 'mov w1, w2'
>>>>>> instructions after this pass narrows them.
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>>>>> regression.
>>>>>> OK for mainline?
>>>>>> 
>>>>>> Co-authored-by: Kyrylo Tkachov <[email protected]>
>>>>>> Signed-off-by: Soumya AR <[email protected]>
>>>>>> 
>>>>>> gcc/ChangeLog:
>>>>>> 
>>>>>>  * config.gcc: Add aarch64-narrow-gp-writes.o.
>>>>>>  * config/aarch64/aarch64-passes.def (INSERT_PASS_BEFORE): Insert
>>>>>>  pass_narrow_gp_writes before pass_cleanup_barriers.
>>>>>>  * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION):
>>>>>>  Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES.
>>>>>>  * config/aarch64/tuning_models/olympus.h:
>>>>>>  Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES to tune_flags.
>>>>>>  * config/aarch64/aarch64-protos.h (make_pass_narrow_gp_writes): Declare.
>>>>>>  * config/aarch64/aarch64.opt (mnarrow-gp-writes): New option.
>>>>>>  * config/aarch64/t-aarch64: Add aarch64-narrow-gp-writes.o rule.
>>>>>>  * doc/invoke.texi: Document -mnarrow-gp-writes.
>>>>>>  * config/aarch64/aarch64-narrow-gp-writes.cc: New file.
>>>>>> 
>>>>>> gcc/testsuite/ChangeLog:
>>>>>> 
>>>>>>  * gcc.target/aarch64/narrow-gp-writes-1.c: New test.
>>>>>>  * gcc.target/aarch64/narrow-gp-writes-2.c: New test.
>>>>>>  * gcc.target/aarch64/narrow-gp-writes-3.c: New test.
>>>>>>  * gcc.target/aarch64/narrow-gp-writes-4.c: New test.
>>>>>>  * gcc.target/aarch64/narrow-gp-writes-5.c: New test.
>>>>>>  * gcc.target/aarch64/narrow-gp-writes-6.c: New test.
>>>>>>  * gcc.target/aarch64/narrow-gp-writes-7.c: New test.
>>>>>> 
>>>>>> 
>>>>>> <0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch>


Reply via email to