https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106678
Bug ID: 106678 Summary: Inefficiency in long integer multiplication Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tkoenig at gcc dot gnu.org Target Milestone: --- The code from PR 103109 #include <stdint.h> void Long_multiplication( uint64_t multiplicand[], uint64_t multiplier[], uint64_t sum[], uint64_t ilength, uint64_t jlength ) { uint64_t acarry, mcarry, product; for( uint64_t i = 0; i < (ilength + jlength); i++ ) sum[i] = 0; acarry = 0; for( uint64_t j = 0; j < jlength; j++ ) { mcarry = 0; for( uint64_t i = 0; i < ilength; i++ ) { __uint128_t mcarry_prod; __uint128_t acarry_sum; mcarry_prod = ((__uint128_t) multiplicand[i]) * ((__uint128_t) multiplier[j]) + (__uint128_t) mcarry; mcarry = mcarry_prod >> 64; product = mcarry_prod; acarry_sum = ((__uint128_t) sum[i+j]) + ((__uint128_t) acarry) + product; sum[i+j] += acarry_sum; acarry = acarry_sum >> 64; // {mcarry, product} = multiplicand[i]*multiplier[j] // + mcarry; // {acarry,sum[i+j]} = {sum[i+j]+acarry} + product; } } } still shows some inefficiency after r13-2107. Compiling the function with gcc 13.0.0 20220818, with $ gcc -mcpu=power9 -O3 -c loop.c and disassembling the output (for easier reading) gives (looking only at the main part) 7c: 00 00 80 38 li r4,0 80: 00 00 80 3b li r28,0 84: 00 00 60 38 li r3,0 88: 00 00 00 38 li r0,0 8c: ff ff c0 38 li r6,-1 90: 00 00 e0 38 li r7,0 94: 20 00 c1 fa std r22,32(r1) 98: 28 00 e1 fa std r23,40(r1) 9c: 60 00 c1 fb std r30,96(r1) a0: 68 00 e1 fb std r31,104(r1) a4: 00 00 00 60 nop a8: 00 00 00 60 nop ac: 00 00 42 60 ori r2,r2,0 b0: a6 03 49 7f mtctr r26 b4: 78 c3 0c 7f mr r12,r24 b8: 14 22 b9 7c add r5,r25,r4 bc: 00 00 00 39 li r8,0 c0: 09 00 6c e9 ldu r11,8(r12) c4: 2a 20 5d 7d ldx r10,r29,r4 c8: 09 00 25 e9 ldu r9,8(r5) cc: 33 52 cb 13 maddld r30,r11,r10,r8 d0: 31 52 eb 13 maddhdu r31,r11,r10,r8 d4: 38 30 d6 7f and r22,r30,r6 d8: 38 38 f7 7f and r23,r31,r7 dc: 78 fb e8 7f mr r8,r31 e0: 14 48 56 7d addc r10,r22,r9 e4: 14 01 77 7d adde r11,r23,r0 e8: 14 18 4a 7d addc r10,r10,r3 ec: 14 52 29 7d add r9,r9,r10 f0: 94 01 6b 7c addze r3,r11 f4: 00 00 25 f9 std r9,0(r5) f8: c8 ff 00 42 bdnz c0 <Long_multiplication+0xc0> fc: 01 00 9c 3b addi r28,r28,1 100: 08 00 84 38 addi r4,r4,8 104: 40 e0 3b 7c cmpld r27,r28 108: a8 ff 82 40 bne b0 <Long_multiplication+0xb0> In these two nested loops, r6 is not changed, so it is always -1. d4: 38 30 d6 7f and r22,r30,r6 just assigns r30 to r22, so r30 could have been used instead of r22. Similarly, d8: 38 38 f7 7f and r23,r31,r7 just sets r23 to zero because r7 is always zero.