On Thu, Dec 4, 2025 at 7:18 AM Feng Xue OS <[email protected]> wrote:
>
> Although GCC provides __uint128 extension for beyond 64-bit integer 
> computation, some programmers still count on portable but less efficient 
> plain emulation-based code to do 64x64->128 integer multiplication, as the 
> case given in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122488.
>
>     void multiply_uint64_to_uint128(uint64_t op0, uint64_t op1, uint64_t *res)
>     {
>       uint64_t op0_lo = op0 & 0x00000000FFFFFFFF;
>       uint64_t op1_lo = op1 & 0x00000000FFFFFFFF;
>       uint64_t op0_hi = op0 >> 32;
>       uint64_t op1_hi = op1 >> 32;
>       uint64_t mul_lo = op0_lo * op1_lo;
>       uint64_t mul_mi_01 = op0_hi * op1_lo;
>       uint64_t mul_mi_10 = op1_hi * op0_lo;
>       uint64_t mul_hi = op0_hi * op1_hi;
>       uint64_t sum_mi = mul_mi_01 + mul_mi_10;
>       uint64_t sum_mi_carry = sum_mi < mul_mi_01;
>       uint64_t sum_32 = (mul_lo >> 32) + (sum_mi & 0x00000000FFFFFFFF);
>
>       res[1] = mul_hi + (sum_mi_carry << 32) + (sum_mi >> 32) + (sum_32 >> 
> 32);
>       res[0] = (sum_32 << 32) | (mul_lo & 0x00000000FFFFFFFF);
>     }
>
> On aarch64 and x86, the above sequence could actually be optimized to a 
> 64-bit mult-high and mult, and we did observe a nice benefit for some reality 
> application if the simplification is applied. Recently, LLVM just has already 
> supported this pattern matching.  https://godbolt.org/z/7dsorWdKa
>
> So I'd like to discuss recognition of the pattern in GCC, and could add the 
> feature following with an agreed proposal. Something special makes the thing 
> a little more complicated, and simply introducing a new AST-based pattern 
> rule in match.pd might do not work.
>
>     o. Not just linear code sequence, the emulated multiplication could also 
> be implemented in a manner using conditional statement, such as the variant 
> case in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107090. GCC does not 
> translate the conditional statement to COND_EXPR in the middle-end, and only 
> rely on backend to map the statement to COND-SELECT instruction.
>
>     // Linear code
>     sum = a + b;
>     carry = sum < a;
>     value += carry << 32;
>
>     // Conditional code
>     sum = a + b;
>     if (sum < a)
>         value += 1 << 32;
>
>     o. The final result is computed from additions of more than two parts, 
> whose associative order could be arbitrary. The fixed order constraint 
> naturally arising from match.pd mechanism could not handle this gracefully, 
> one possible solution is to enumerate all combinations and create one pattern 
> respectively, which tends to be a lengthy trick.
>
>     result = ((part1 + part2) + part3) + part4;
>     result =  (part4 + part1) + (part3 + part2);
>     ...
>     result =  part2 + ((part4 + part3) + part1));
>
> At the same time, other addition that does not belong to the sequence may 
> break entirety of the pattern when reassociation transform takes effect.
>
>          result = part1 + part2 + part3 + part4;
>          final = other1 + result + other2;
>
>          // After reassociation
>          tmp0 = other1 + part1;
>          tmp1 = part2 + part3;
>          tmp2 = part4 + other2;
>          final = tmp0 + tmp1 + tmp2;
>
> o. The pattern is comprised of two sub-patterns, mult-to-highpart and 
> mult-to-lowpart, the latter much simpler and is exactly equivalent to a 64x64 
> operation. But some intermediate values are shared by two sub-patterns, and 
> are not thought as single-used temporaries, which would cause the idea of 
> recognizing those two patterns independently to become somewhat harder, we 
> have to take them both into consideration.
>
> A possible approach to reference is detection of tabled-based CTZ in 
> ssa-forward pass, which might be a suitable position for this pattern, in 
> that ssa-forward pass is invoked more than one time, some happen before 
> reassociation pass. But simplification of the 64x64->128 pattern should be 
> classified as mathematics related optimization, and is better to put it in 
> tree-ssa-mathopt.cc for logical consistency. In the file, there is a 
> pass_optimize_widening_mul that is very close to what we want, but it is too 
> late to keep entirety of the pattern against reassociation. So I consider 
> adding a dedicated pass like pass_cse_reciprocals, meanwhile, place it prior 
> to reassociation, and the major procedure is manually coding based matching, 
> and only some leaf patterns are defined via match.pd.

As you figured we already have highpart multiplication recognition.
It's the late reassoc pass that is the problem, given it applies
reassoc_width?  Or is the early reassoc pass also problematic?  We do
have some special-casing of fma patters in reassoc, so
in theory we could handle widening mult candidates similarly, but then
doing widen-mult detection where we do cse_sincos/reciprocals
(those back-to-back passes should ideally share their CFG walk) is a
viable approach.  I would suggest against the ssa-forward pass,
esp. against trying to introduce highpart multiplication before
vectorization or other loop optimizations.

Richard.

> Now, I am working on a prototype implementation, and hope your suggestions on 
> this.
>
> Thanks,
> Feng
>
>
>
>

Reply via email to