[mpir-devel] Re: Code freeze on K8/K10 assembler code

Bill Hart Fri, 13 Mar 2009 18:46:25 -0700

Oh that's nonsense, there's no such thing.

Bill.


2009/3/14 Bill Hart <goodwillh...@googlemail.com>:
> Presumably those are dependent muls. What about independent?
>
> e.g.
>
> mul rcx
> xchg rbx, rax
> mul rcx
> xchg
>
> etc.
>
> Bill.
>
> 2009/3/14 jason <ja...@njkfrudils.plus.com>:
>>
>> Timing  1000 copys of mul rcx on a i7 I get 1 mul per 3 cycles , so
>> this is the latency to get rax the lower product
>> . Doing the same with mul rdx I get 1 mul per 8 or 9 cycles , I dont
>> know i I'm doing this right as it seems a lot slower than the low part
>> of the product. Doing the same for
>> mov $1,%rax
>> mul %rcx
>> I get 1.7 to 2 cycles per mul
>>
>> Even if it does do 1c/l it may be impossible to use as other
>> bottlenecks will cut in first
>>
>> add,adc seems to be 1/2 cycles respectively
>>
>>
>> On Mar 6, 3:47 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>> Given the support MPIR already has with more on the way, I won't be
>>> surprised if hardware companies soon start giving us access to, or
>>> just outright giving us recent hardware to help us support them.
>>>
>>> 1 mul per cycle is huge, if that is correct. If that is per core, then
>>> it is outstanding. Some of the stats I have seen published for that
>>> chip, specifically for the sort of maths we do, are off the chart.
>>> It's the chip I am most enthusiastic about presently.
>>>
>>> Bill.
>>>
>>> 2009/3/6 Jason Martin <jason.worth.mar...@gmail.com>:
>>>
>>>
>>>
>>> > On Thu, Mar 5, 2009 at 7:38 PM,  <ja...@njkfrudils.plus.com> wrote:
>>>
>>> >> On Thursday 05 March 2009 22:54:34 Jason Martin wrote:
>>> >>> > I've got a 4.4c/l , Agner Fog's says the thruput on 64 bit mul is 
>>> >>> > 4c/l ,
>>> >>> > but in another section it says you can issue one every cycle , if the
>>> >>> > latter section means 32bit then , 4c/l could be right ( I say could 
>>> >>> > not
>>> >>> > would as consider the K8) , I expect I can do a bit better than 
>>> >>> > 4.4c/l ,
>>> >>> > but who knows.A lot of the speed comes from reducing the overhead in
>>> >>> > mul_basecase.
>>>
>>> >>> > It quite a wide cpu , I would expect the combined functions(like 
>>> >>> > addlsh1)
>>> >>> > to make a real difference on it.
>>>
>>> >>> I've measure it directly to be 4 cycles per mul for sustained 64-bit
>>> >>> multiplies on core2, which agrees with Agner Fog.  Even though there
>>> >>> are 3 ALU issue ports on the core2, only one of them can issue the
>>> >>> 64-bit mul.
>>>
>>> >>> By the way, I ended up sitting next to an Intel hardware engineer on a
>>> >>> recent trans-atlantic flight.  He assured me that the Core i7 can
>>> >>> perform sustained 64-bit multiplies at 1 cycle each.  Of course, I'll
>>> >>> need to try this myself to believe it, but I'm pretty excited to get
>>> >>> my hands on the hardware.
>>>
>>> >>> --jason
>>>
>>> >> 64bit x 64bit --> 128 bit  ???
>>>
>>> > That's what the man said (and you know who you are, so feel free to
>>> > chime in on this discussion if you're reading it!).  Of course, that's
>>> > a pretty significant improvement over core 2, so I'm going to need
>>> > actual hardware to verify it.
>>>
>>> > --jason
>>> > - Show quoted text -
>> >>
>>
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

[mpir-devel] Re: Code freeze on K8/K10 assembler code

Reply via email to