Although GCC provides __uint128 extension for beyond 64-bit integer 
computation, some programmers still count on portable but less efficient plain 
emulation-based code to do 64x64->128 integer multiplication, as the case given 
in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122488.

    void multiply_uint64_to_uint128(uint64_t op0, uint64_t op1, uint64_t *res)
    {
      uint64_t op0_lo = op0 & 0x00000000FFFFFFFF;
      uint64_t op1_lo = op1 & 0x00000000FFFFFFFF;
      uint64_t op0_hi = op0 >> 32;
      uint64_t op1_hi = op1 >> 32;
      uint64_t mul_lo = op0_lo * op1_lo;
      uint64_t mul_mi_01 = op0_hi * op1_lo;
      uint64_t mul_mi_10 = op1_hi * op0_lo;
      uint64_t mul_hi = op0_hi * op1_hi;
      uint64_t sum_mi = mul_mi_01 + mul_mi_10;
      uint64_t sum_mi_carry = sum_mi < mul_mi_01;
      uint64_t sum_32 = (mul_lo >> 32) + (sum_mi & 0x00000000FFFFFFFF);

      res[1] = mul_hi + (sum_mi_carry << 32) + (sum_mi >> 32) + (sum_32 >> 32);
      res[0] = (sum_32 << 32) | (mul_lo & 0x00000000FFFFFFFF);
    }

On aarch64 and x86, the above sequence could actually be optimized to a 64-bit 
mult-high and mult, and we did observe a nice benefit for some reality 
application if the simplification is applied. Recently, LLVM just has already 
supported this pattern matching.  https://godbolt.org/z/7dsorWdKa

So I'd like to discuss recognition of the pattern in GCC, and could add the 
feature following with an agreed proposal. Something special makes the thing a 
little more complicated, and simply introducing a new AST-based pattern rule in 
match.pd might do not work.

    o. Not just linear code sequence, the emulated multiplication could also be 
implemented in a manner using conditional statement, such as the variant case 
in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107090. GCC does not translate 
the conditional statement to COND_EXPR in the middle-end, and only rely on 
backend to map the statement to COND-SELECT instruction.
    
    // Linear code 
    sum = a + b;
    carry = sum < a;
    value += carry << 32;

    // Conditional code 
    sum = a + b;
    if (sum < a)
        value += 1 << 32;

    o. The final result is computed from additions of more than two parts, 
whose associative order could be arbitrary. The fixed order constraint 
naturally arising from match.pd mechanism could not handle this gracefully, one 
possible solution is to enumerate all combinations and create one pattern 
respectively, which tends to be a lengthy trick.

    result = ((part1 + part2) + part3) + part4;
    result =  (part4 + part1) + (part3 + part2);
    ...
    result =  part2 + ((part4 + part3) + part1));

At the same time, other addition that does not belong to the sequence may break 
entirety of the pattern when reassociation transform takes effect.

         result = part1 + part2 + part3 + part4;
         final = other1 + result + other2; 

         // After reassociation
         tmp0 = other1 + part1;
         tmp1 = part2 + part3;
         tmp2 = part4 + other2;
         final = tmp0 + tmp1 + tmp2;

o. The pattern is comprised of two sub-patterns, mult-to-highpart and 
mult-to-lowpart, the latter much simpler and is exactly equivalent to a 64x64 
operation. But some intermediate values are shared by two sub-patterns, and are 
not thought as single-used temporaries, which would cause the idea of 
recognizing those two patterns independently to become somewhat harder, we have 
to take them both into consideration.

A possible approach to reference is detection of tabled-based CTZ in 
ssa-forward pass, which might be a suitable position for this pattern, in that 
ssa-forward pass is invoked more than one time, some happen before 
reassociation pass. But simplification of the 64x64->128 pattern should be 
classified as mathematics related optimization, and is better to put it in 
tree-ssa-mathopt.cc for logical consistency. In the file, there is a 
pass_optimize_widening_mul that is very close to what we want, but it is too 
late to keep entirety of the pattern against reassociation. So I consider 
adding a dedicated pass like pass_cse_reciprocals, meanwhile, place it prior to 
reassociation, and the major procedure is manually coding based matching, and 
only some leaf patterns are defined via match.pd.

Now, I am working on a prototype implementation, and hope your suggestions on 
this.

Thanks,
Feng




Reply via email to