https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113560
Roger Sayle <roger at nextmovesoftware dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |roger at nextmovesoftware dot com --- Comment #2 from Roger Sayle <roger at nextmovesoftware dot com> --- The costs look sane, and I'd expect the synth_mult generated sequence to be faster, though it would be good to get some microbenchmarking. A reduced test case is: __int128 foo(__int128 x) { return x*100; } The x86 backend thinks that a 128-bit (TImode) multiplication would take 14 cycles, so instead generates: x2 = x+x 2 cycles x3 = x2+x 2 cycles x24 = x<<3 2 cycles x25 = x24+x 2 cycles x100 = x<<2 2 cycles which is a total of 10 cycles, and predicted to be faster than the generic implementation (requiring 2 IMULQ, 1 MULQ and 2 ADDQ) for __int128 bar(__int128 x, int y) { return x*y; }