https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106678

            Bug ID: 106678
           Summary: Inefficiency in long integer multiplication
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

The code from PR 103109

#include <stdint.h>

void Long_multiplication( uint64_t multiplicand[],
                          uint64_t multiplier[],
                          uint64_t sum[],
                          uint64_t ilength, uint64_t jlength )
{
  uint64_t acarry, mcarry, product;

  for( uint64_t i = 0;
       i < (ilength + jlength);
       i++ )
    sum[i] = 0;

  acarry = 0;
  for( uint64_t j = 0; j < jlength; j++ )
    {
      mcarry = 0;
      for( uint64_t i = 0; i < ilength; i++ )
        {
          __uint128_t mcarry_prod;
          __uint128_t acarry_sum;
          mcarry_prod = ((__uint128_t) multiplicand[i]) * ((__uint128_t)
multiplier[j])
            + (__uint128_t) mcarry;
          mcarry = mcarry_prod >> 64;
          product = mcarry_prod;
          acarry_sum = ((__uint128_t) sum[i+j]) + ((__uint128_t) acarry) +
product;
          sum[i+j] += acarry_sum;
          acarry = acarry_sum >> 64;
          //      {mcarry, product} = multiplicand[i]*multiplier[j]
          //                            + mcarry;
          //      {acarry,sum[i+j]} = {sum[i+j]+acarry} + product;

        }
    }
}

still shows some inefficiency after r13-2107.

Compiling the function with gcc 13.0.0 20220818, with

$ gcc  -mcpu=power9 -O3 -c loop.c

and disassembling the output (for easier reading) gives (looking only
at the main part)

  7c:   00 00 80 38     li      r4,0
  80:   00 00 80 3b     li      r28,0
  84:   00 00 60 38     li      r3,0
  88:   00 00 00 38     li      r0,0
  8c:   ff ff c0 38     li      r6,-1
  90:   00 00 e0 38     li      r7,0
  94:   20 00 c1 fa     std     r22,32(r1)
  98:   28 00 e1 fa     std     r23,40(r1)
  9c:   60 00 c1 fb     std     r30,96(r1)
  a0:   68 00 e1 fb     std     r31,104(r1)
  a4:   00 00 00 60     nop
  a8:   00 00 00 60     nop
  ac:   00 00 42 60     ori     r2,r2,0
  b0:   a6 03 49 7f     mtctr   r26
  b4:   78 c3 0c 7f     mr      r12,r24
  b8:   14 22 b9 7c     add     r5,r25,r4
  bc:   00 00 00 39     li      r8,0
  c0:   09 00 6c e9     ldu     r11,8(r12)
  c4:   2a 20 5d 7d     ldx     r10,r29,r4
  c8:   09 00 25 e9     ldu     r9,8(r5)
  cc:   33 52 cb 13     maddld  r30,r11,r10,r8
  d0:   31 52 eb 13     maddhdu r31,r11,r10,r8
  d4:   38 30 d6 7f     and     r22,r30,r6
  d8:   38 38 f7 7f     and     r23,r31,r7
  dc:   78 fb e8 7f     mr      r8,r31
  e0:   14 48 56 7d     addc    r10,r22,r9
  e4:   14 01 77 7d     adde    r11,r23,r0
  e8:   14 18 4a 7d     addc    r10,r10,r3
  ec:   14 52 29 7d     add     r9,r9,r10
  f0:   94 01 6b 7c     addze   r3,r11
  f4:   00 00 25 f9     std     r9,0(r5)
  f8:   c8 ff 00 42     bdnz    c0 <Long_multiplication+0xc0>
  fc:   01 00 9c 3b     addi    r28,r28,1
 100:   08 00 84 38     addi    r4,r4,8
 104:   40 e0 3b 7c     cmpld   r27,r28
 108:   a8 ff 82 40     bne     b0 <Long_multiplication+0xb0>

In these two nested loops, r6 is not changed, so it is always -1.

  d4:   38 30 d6 7f     and     r22,r30,r6

just assigns r30 to r22, so r30 could have been used instead of
r22.

Similarly,

  d8:   38 38 f7 7f     and     r23,r31,r7

just sets r23 to zero because r7 is always zero.

Reply via email to