Hello,

Look for the Intel Optimization Manual on intel.com.  The appendixes
have latency and throughput information for the instruction set on
various Intel processors.

Uh-oh, that's hard. I tried to find the information, but I did only found a part of the informations I was looking for.

First, I used -masm=intel to use the Intel syntax and got.

- for the no-typecast-variant (imull):

imul    ecx, esi   # imull
movsx   rcx, ecx   # movslq

- for the typecast-variant (imulq):

imul    rcx, rsi   # imulq

In the Intel manual I collected following informations from Appendix C, Table C-16a:

                Latency         Throughput
                0f_3h   0f_2h   0f_3h   0f_2h
imul r32        10      14      1       3
imul imm32      -       14      1       3
imul            -       15-18   -       5
mov             1       0.5     0.5     0.5
movsb/movsw     1       0.5     0.5     0.5


I have 3 problems:
1. I do not know my DisplayName/DisplayFamily (0f_2h or 0f_3h?).
2. The table does not contain "movsx"
3. Should I compare Latency or Throughput if I want to produce fast code? Or doesn't it matter which value I compare?

I assume that movsx has the same latency of movsw (but not sure) and I think that "imul" in the table refers to AT&T's "imulq" resp. Intel's "imul rcx, rsi" while "imul r32" in the table refers to AT&T's "imull" resp. Intel's "imul ecx, esi". Am I right?

Daniel

Am 09.05.2012 20:30, schrieb Ian Lance Taylor:
Daniel Marschall <daniel-marsch...@viathinksoft.de> writes:

I did understand that the compiler used "signed" multiplication
instead of an unsigned one because char*char needs to be extended.

Maybe I am wrong, but couldn't the compiler "know" that the result
will be at least unsigned because unsigned * unsigned = unsigned ?

Well, but the rules of C say that the unsigned char values are
zero-extended to int, and then they are multiplied using a signed
multiplication.  So the result is not unsigned.  The compiler really
would have to do some sort of type or value based reasoning here to
determine that an unsigned multiplication would work also.

Mh... good point. I do not know much about Assembler so I just thought
the shorter the code the better.

Sadly, no.


If imull is faster than imulq, then
the question is, if imull+movslq is still faster than a single
imulq. Do you know where I can find these informations for my CPU
(Intel Xeon X3440)? I was searching for a table which shows how many
CPU-ticks the imull, imulq and movslq need, but yet I have not found
one.

My Linux is 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC 2012 x86_64
GNU/Linux .

And the CPU is "Intel(R) Xeon(R) CPU X3440  @ 2.53GHz". (I hope the
"amd64" version of Debian is the correct one, or should our admin have
installed the "ia64" variant since it is an Intel CPU?)


Ian

Reply via email to