On Tuesday 05 July 2011 12:23:34 jason wrote:
> On Jul 4, 8:48 pm, Jason <ja...@njkfrudils.plus.com> wrote:
> > On Monday 04 July 2011 20:21:46 Cactus wrote:
> > > Looks good!
> > > 
> > > I notice that there is new k8 assembler.  Is this going to be repeated
> > > for the Intel architectures?
> > 
> > Yep , the current code is already an improvement on Intel , but I can do
> > better eg the multiple carry handling.
> > 
> > > Is it stable enough to do conversion for
> > > Windows?
> > 
> > I'd leave it for a week or so , I'll simplify the feed-in code , the
> > wind- down(as rdx is fixed) , and merge the small loops. We did lose
> > some speed in the inner loop so perhaps a longer/more sophisticated
> > search may find something better.
> > One interesting "feature" , the K10 version uses popcount :)
> > 
> > >     Brian
> 
> The AMD versions are ready for conversion , note that they are all
> very similar
> 
> Jason

After running on various cpu's we get these results

cycles per word

                just    with            kara    ld/st   latency op
cpu             add     addadd  add     bnd     bound   bound
K8/k10  4.5     3.7             2.35    2.0     2.25            2.041666
K102    4.5     3.5             2.3     2.0     2,25            2.041666
core2   6.0     5.5             5.0     3.0     4.125   2.583333
penryn  6.0     5,8             4.9     3.0     4.125   2.583333
nehalem 6.0     5.5             4.7     3.0     4.125   2.583333
westmere        6.0                             3.0     4.125   2.583333
sandybr 4.8     ---             4.0             2.625   2.583333
bobcat  7.5     6.2             3.75            2.25            3.0625
atom            12.3    -----           8.0             3.75            

Fairly sure the core2/nehalem can be improved. The latency bound comes from a 
false dependence , we could improve this by storing one of the carry flags on 
the stack (setc,bt), this very slightly increase the ld/st bound and reduces 
the latency bound eg 
on K8/k10/K102 ld/st is 2.125 latency is 1.5 op is 2.041666
on core2..westmere ld/st is 3.125 latency is 2.75 op is 2.58333

I'll post some results on the mul speedups later

Jason

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Reply via email to