On Wed, 9 May 2012, Daniel Marschall wrote:
I could sucessfully do a benchmark of my code. I found out that the
no-typecast-version (imull+movslq) needed 47 secs for 12 working packages,
while the typecast-version (imulq) needed only 38 secs per 12 working
packages. That is incredible!
Maybe you should still consider preferring imulq instead of imull+movslq ?
I wonder if GCC has an optimization which optimizes the machine code itself,
without knowledge of the underlaying C code, e.g. it could eliminate
unnecessary mov commands if a register is not used resp. using operations
which do have lower latency. I think such an "assembler-only" optimization
still can get additional performance since the rules of the underlaying
programming language (e.g. the expansion to signed int) can be ignored if the
end-result is the same. But I fear that this is rather a hard task and maybe
not possible.
A lot of optimizations in gcc completely ignore the original code. At the
rtl level, you could try matching:
(set (reg:SI 1) (zero_extend:SI (match_operand:QI 4))
(set (reg:SI 2) (zero_extend:SI (match_operand:QI 3))
(set (reg:SI 5) (mult:SI (match_dup 1) (match_dup 2)))
(set (reg:DI 6) (sign_extend:DI (match_dup 5)))
and replacing it with your version that zero-extends to DI and does the
multiplication there.
--
Marc Glisse