On Fri, 20 Feb 2026 17:13:58 GMT, Andrew Haley <[email protected]> wrote:
>>> That would be really useful! I tinkered with it a bit but would be nice to
>>> see what you had in mind
>>
>> Like this:
>>
>> address generate_intpoly_montgomeryMult_P256() {
>>
>> __ align(CodeEntryAlignment);
>> StubId stub_id = StubId::stubgen_intpoly_montgomeryMult_P256_id;
>> StubCodeMark mark(this, stub_id);
>> address start = __ pc();
>> __ enter();
>>
>> static const int64_t modulus[] = {
>> 0x000fffffffffffffL, 0x00000fffffffffffL,
>> 0x0000001000000000L, 0x0000ffffffff0000L,
>> 0L
>> };
>>
>> int shift1 = 12; // 64 - bits per limb
>> int shift2 = 52; // bits per limb
>>
>> // Registers that are used throughout entire routine
>> const Register a = c_rarg0;
>> const Register b = c_rarg1;
>> const Register result = c_rarg2;
>>
>> RegSet regs = RegSet::range(r0, r28) + rfp + lr - a - b - result;
>> FloatRegSet floatRegs = FloatRegSet::range(v0, v31)
>> - FloatRegSet::range(v8, v15) // Caller saved vectors
>> - FloatRegSet::range(v16, v31); // Manually-allocated vectors
>>
>> auto common_regs = regs.begin();
>> Register limb_mask = *common_regs++,
>> c_ptr = *common_regs++,
>> mod_0 = *common_regs++,
>> mod_1 = *common_regs++,
>> mod_3 = *common_regs++,
>> mod_4 = *common_regs++,
>> b_0 = *common_regs++,
>> b_1 = *common_regs++,
>> b_2 = *common_regs++,
>> b_3 = *common_regs++,
>> b_4 = *common_regs++;
>> regs = common_regs.remaining();
>>
>> auto common_vectors = floatRegs.begin();
>> FloatRegister limb_mask_vec = *common_vectors++,
>> b_lows = *common_vectors++,
>> b_highs = *common_vectors++,
>> a_vals = *common_vectors++;
>>
>> // Push callee saved registers on to the stack
>> RegSet callee_saved = RegSet::range(r19, r28);
>> __ push(callee_saved, sp);
>>
>> // Allocate space on the stack for carry values
>> __ sub(sp, sp, 48);
>> __ mov(c_ptr, sp);
>>
>> // Calculate limb mask
>> __ mov(limb_mask, -UCONST64(1) >> (64 - shift2));
>> __ dup(limb_mask_vec, __ T2D, limb_mask);
>>
>> // Load input arrays and modulus
>> {
>> auto r = regs.begin();
>> Register a_ptr = *r++, mod_ptr = *r++;
>> __ add(a_ptr, a, 24);
>> __ lea(mod_ptr, ExternalAddress((address)modulus));
>> __ ldr(b_0, Address(b));
>> __ ldr(b_1, Address(b, 8));
>> __ ldr(b_2, Address(b, 16));
>> __ ldr(b_3, Address(b, 24));
>> __ ldr(b_4, Address(b, 32));
>> ...
>
> Note that in a few places I've had to push back dead registers so that they
> can be reused. This is necessary because the live ranges for some registers
> partailly overlap.
>
> It's much better if you don't do that: instead, write a structured
> assembly-language program in which registers are allocated in scopes as
> needed, as I've done in the section which begins like this:
>
>
> // Load input arrays and modulus
> {
> auto r = regs.begin();
> Register a_ptr = *r++, mod_ptr = *r++;
>
>
> here, the register that contain`a_ptr` and `mod_ptr` are taken from the outer
> block, and are free for reuse when the inner block exits.
>
> I hope the advantages of this style are clear: the program is easier to
> write, to maintain, and much less risky. Also, and most importantly for me,
> it's much easier to review!
Thanks for taking the time to write all this out! Will do a refactor and
integrate these changes shortly
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27946#discussion_r2855165827