On Wed, 9 May 2012, Daniel Marschall wrote:

I could sucessfully do a benchmark of my code. I found out that the no-typecast-version (imull+movslq) needed 47 secs for 12 working packages, while the typecast-version (imulq) needed only 38 secs per 12 working packages. That is incredible!

Maybe you should still consider preferring imulq instead of imull+movslq ?

I wonder if GCC has an optimization which optimizes the machine code itself, without knowledge of the underlaying C code, e.g. it could eliminate unnecessary mov commands if a register is not used resp. using operations which do have lower latency. I think such an "assembler-only" optimization still can get additional performance since the rules of the underlaying programming language (e.g. the expansion to signed int) can be ignored if the end-result is the same. But I fear that this is rather a hard task and maybe not possible.

A lot of optimizations in gcc completely ignore the original code. At the rtl level, you could try matching:

(set (reg:SI 1) (zero_extend:SI (match_operand:QI 4))
(set (reg:SI 2) (zero_extend:SI (match_operand:QI 3))
(set (reg:SI 5) (mult:SI (match_dup 1) (match_dup 2)))
(set (reg:DI 6) (sign_extend:DI (match_dup 5)))

and replacing it with your version that zero-extends to DI and does the multiplication there.

--
Marc Glisse

Reply via email to