[mpir-devel] Re: mpn_mul_1 on K8

Bill Hart Sun, 28 Dec 2008 00:56:52 -0800

Nice find! I guess this is essentially the problem David Harvey was
talking about.


Bill.
On 28/12/2008, jason <ja...@njkfrudils.plus.com> wrote:
>
>
>
> On Nov 23, 11:16 pm, ja...@njkfrudils.plus.com wrote:
>> On Sunday 23 November 2008 22:49:21 Jason Martin wrote:
>>
>>
>>
>> > > You assume OOO works perfectly.
>>
>> > >     mov $0,%r11
>> > >        mul %rcx
>> > >        add %rax,%r10
>> > >        mov 24(%rsi,%rbx,8),%rax
>> > >        adc %rdx,%r11
>> > >        mov %r10,16(%rdi,%rbx,8)
>> > >        mul %rcx
>> > > here        mov $0,%r8
>> > >        add %rax,%r11
>> > >        mov 32(%rsi,%rbx,8),%rax
>> > >        adc %rdx,%r8
>> > >        mov %r11,24(%rdi,%rbx,8)
>>
>> > > moving the line at "here" up one before the mul , slows things down
>> > > from
>> > > 2.78 to 3.03 c/l , whereas if OOO was perfect , it should not have any
>> > > effect. This may be due to a cpu scheduler bug , or perhaps the
>> > > shedulers
>> > > not perfect , mul being long latency , two macro ops , two pipes ,
>> > > only
>> > > pipe 0_1 etc
>> > > If its a bug then perhaps K10 is better?
>>
>> > I've seen similar wackiness with the core 2 out-of-order engine.  It's
>> > strange enough that sometimes sticking in a nop actually saves a
>> > cycle!
>>
>> another oddity..
>>
>> loop:
>>         mov     (%rdi),%rcx
>>         adc     %rcx,%rcx
>>         mov     %rcx,(%rdi)
>> ... 8 way unrolled lshift by 1
>>         mov     56(%rdi),%r9
>>         adc     %r9,%r9
>>         mov     %r9,56(%rdi)
>>         lea     64(%rdi),%rdi
>>         dec     %rsi
>>         jnz     loop
>>
>> runs at 1.11c/l
>>
>> whereas the rshift by 1 (ie with rcr instead of adc) does not, you have to
>> bunch them up into 4's to get to 1.11c/l
>>
>>        mov     (%rdi),%rcx
>>         mov     -8(%rdi),%r8
>>         mov     -16(%rdi),%r9
>>         mov     -24(%rdi),%r10
>>         rcr     $1,%rcx
>>         rcr     $1,%r8
>>         rcr     $1,%r9
>>         rcr     $1,%r10
>>         mov     %rcx,(%rdi)
>>         mov     %r8,-8(%rdi)
>>         mov     %r9,-16(%rdi)
>>         mov     %r10,-24(%rdi)
>>
>>        mov     -32(%rdi),%rcx
>>         mov     -40(%rdi),%r8
>>         mov     -48(%rdi),%r9
>>         mov     -56(%rdi),%r10
>>         rcr     $1,%rcx
>>         rcr     $1,%r8
>>         rcr     $1,%r9
>>         rcr     $1,%r10
>>         mov     %rcx,-32(%rdi)
>>         mov     %r8,-40(%rdi)
>>         mov     %r9,-48(%rdi)
>>         mov     %r10,-56(%rdi)
>>
>>         lea     -64(%rdi),%rdi
>>         dec     %rsi
>>         jnz     loop
>>
>> Again , it looks like the OOO is broken.
>> But if you look at the gmp-4.2.4 mpn_mul_1 , which runs at 3c/l , the OOO
>> has
>> to get work from three separate iterations to fill out the slots.
>>
>> While I'm at it , I got some more complaints :)
>>
>> timing mpn_add/sub_n with the gmp speed program the results stay fairly
>> consistent . You may get say 24.5 cycles in one run and 24.6 in another.
>> Ok ,
>> occasionally you 200 cycles , but I assume thats an interupt or some such
>> thing. But , for my mpn_com_n , which is mind numbingly simple
>> (mov,not,mov) , sometimes I get 20cycles , 40 cycles, 30 cycles  .... .
>> Whats
>> going on there! , I dont know.
>>
>> Confused.
>
> I think I understand whats going on (at least a bit more!) . My above
> mpn_com_n has a cache bank conflict .
>
> My old one was
> load not store
> load not store
> load not store
> load not store
> etc
>
> and it run at 1.3c/l , however for some alignments of src/dst it would
> run at 2.0c/l . This also appears to make the timings with the speed
> program also vary a lot. (Note this suggests if your timings vary alot
> perhaps your having these kind of problems).
> A new one is
>
> load load not not store store
> load load not not store store
> etc
>
> This runs at 1.3c/l for all alignments , and the timings dont vary.
> (just a little bit of jitter)
>
> So far I checked some of my existing asm fns and found no problems.
> add/sub/addmul/submul/mul/lshift/rshift/addlsh1/sublsh1/com are done.
> mul_basecase is still to do out of the ones mpir has.
>
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

[mpir-devel] Re: mpn_mul_1 on K8

Reply via email to