On Thu, Dec 4, 2025 at 10:38 AM Feng Xue OS <[email protected]> wrote:
>
> >>
> >> A possible approach to reference is detection of tabled-based CTZ in
> >> ssa-forward pass, which might be a suitable position for this pattern, in
> >> that ssa-forward pass is invoked more than one time, some happen before
> >> reassociation pass. But simplification of the 64x64->128 pattern should be
> >> classified as mathematics related optimization, and is better to put it in
> >> tree-ssa-mathopt.cc for logical consistency. In the file, there is a
> >> pass_optimize_widening_mul that is very close to what we want, but it is
> >> too late to keep entirety of the pattern against reassociation. So I
> >> consider adding a dedicated pass like pass_cse_reciprocals, meanwhile,
> >> place it prior to reassociation, and the major procedure is manually
> >> coding based matching, and only some leaf patterns are defined via
> >> match.pd.
>
> > As you figured we already have highpart multiplication recognition.
> > It's the late reassoc pass that is the problem, given it applies
> > reassoc_width? Or is the early reassoc pass also problematic? We do
> > have some special-casing of fma patters in reassoc, so
> > in theory we could handle widening mult candidates similarly, but then
> > doing widen-mult detection where we do cse_sincos/reciprocals
> > (those back-to-back passes should ideally share their CFG walk) is a
> > viable approach. I would suggest against the ssa-forward pass,
> > esp. against trying to introduce highpart multiplication before
> > vectorization or other loop optimizations.
>
> Current highpart multiplication and widen-mul recognition only handle cases
> where operations involves truncation from large integer type to small one. An
> example from regression testcases:
>
> unsigned long long foo(unsigned long long x, unsigned long long y)
> {
> return ((__uint128_t)x * (__uint128_t)y) >> 64;
> }
>
> That is, __uint128_t is explicitly used in source code, then it is nothing
> about
> addition, and suffers no impact from reassociation. However, the above
> pattern is completely emulated with uint64_t operations, which
> is unaware of 128-bit integer operation literally. It is synthesized from
> small
> type to large, this is the difference.
>
> I'm afraid we might not specially handle the pattern as fma in reassoc, in
> that
> it is much complex.
I think it's important to identify the parts of your
void multiply_uint64_to_uint128(uint64_t op0, uint64_t op1, uint64_t *res)
{
uint64_t op0_lo = op0 & 0x00000000FFFFFFFF;
uint64_t op1_lo = op1 & 0x00000000FFFFFFFF;
uint64_t op0_hi = op0 >> 32;
uint64_t op1_hi = op1 >> 32;
uint64_t mul_lo = op0_lo * op1_lo;
uint64_t mul_mi_01 = op0_hi * op1_lo;
uint64_t mul_mi_10 = op1_hi * op0_lo;
uint64_t mul_hi = op0_hi * op1_hi;
uint64_t sum_mi = mul_mi_01 + mul_mi_10;
uint64_t sum_mi_carry = sum_mi < mul_mi_01;
uint64_t sum_32 = (mul_lo >> 32) + (sum_mi & 0x00000000FFFFFFFF);
res[1] = mul_hi + (sum_mi_carry << 32) + (sum_mi >> 32) + (sum_32 >> 32);
res[0] = (sum_32 << 32) | (mul_lo & 0x00000000FFFFFFFF);
}
example that make up the highpart multiply to recognize since for example
changing the sink from a store of the two 64bit parts to something else will
have the chance to disrupt things. If we take away the lowpart of the result,
one key "element" is computation of the carry and that we operate on
the 32bit halves of the two inputs. There is another pass which I think faces
similar issues and that's the CRC (loop) recognition which uses symbolic
execution to recognize the overall effect of a series of operations. Maybe
such could be leveraged when starting from the op0/op1 definition side
and also be less dependent on particular association of the individual
operations?
Richard.
> Regards,
> Feng