On Dec 4, 3:20 am, Jason <ja...@njkfrudils.plus.com> wrote:
> On Saturday 04 December 2010 02:01:48 Jason wrote:
>
>
>
>
>
>
>
>
>
> > On Saturday 04 December 2010 01:40:13 Bill Hart wrote:
> > > On 4 December 2010 00:52, Jason <ja...@njkfrudils.plus.com> wrote:
> > > > Hi
>
> > > > Heres the first lot of new assembler code for the x64 (in trunk)
>
> > > > popcount/hamdist are not terribly useful for MPIR , but they do offer a
> > > > simple way to practice stuff.
>
> > > > K8 popcount was 5.5c/l with 2way unroll now 4.66c/l with 3way
> > > > K8 hamdist was 5.5c/l with 2way unroll now 5.0c/l with 3way
>
> > > > The above was just practice for the core2 version which uses SSE , if
> > > > I'm going to try to use SSE for anything other than trivial
> > > > copys/logic then I need the practice.
>
> > > > core2/penryn popcount was 6.5c/l with 4way unroll now 2.75c/l with 4way
>
> > > > The hamdist shows similar improvements , just have to write the
> > > > horrible SSE alignment stuff , yuck..
>
> > > > K10 popcount was 1.5c/l with 4way unroll now 1.0c/l with 2way
> > > > K10 hamdist was 1.9c/l with 4way unroll now 1.5c/l with 4way
>
> > > Wow, sounds like a lot of great work Jason.
>
> > > > The above are "optimal" , although for very large unrolls 28way(10way
> > > > is probably the minimum) we could get down to 0.87c/l for popcount
> > > > because we do have a spare ALU slot.
> > > > The above is more interesting than that as it's very similar to the
> > > > limits of addmul
>
> > > Not sure what you mean. Do you mean that the point at which it drops
> > > to the lower time is the same as for addmul.
>
> > No , it's just the way the code is arranged.
>
> > > > Should be able to get the nehalem to run at the same speed as the K10
> > > > but so far a conflict of scheduling with the jcc inst is preventing
> > > > this. Best so far(and current code in trunk) is 1.25 and 1.9 c/l
>
> > > > I'll see if I can come up with the Windows version tomorrow.
>
> > > Sounds good. I'm not sure where these get used, but if it gives you
> > > practice for other things then its pretty valuable.
>
> > Doing the nehalem for the same reason I did the K10.
>
> > > Bill.
>
> I got the nehalem popcount at 1.0c/l , it's quite interesting
>
> The nehalem like the core2 has a loopback buffer (like a level 0 cache)(see
> Agner Fog's manuals) with which to use for small loops , however on the
> nehalem there is a 1 cycle clock penalty , so it may appear that we can never
> get 1.0c/l however...
>
> each limb requires 1load , 1pop 1add
> plus add and jmp for loop control
> so for an N-way unroll we need 3N+2 micro-ops
> as we can do at most 4 micro-ops per cycle plus the 1 cycle delay
>
> N       3N+2    ceil(3N+2)/4    (3N+2)/4+1      c/l
> 1       5               2                       3                       3
> 2       8               2                       3                       1.5
> 3       11              3                       4                       1.333
> 4       14              4                       5                       1.25
> 5       17              5                       6                       1.20
> 6       20              5                       6                       1.0
>
> most of these tables like this the c/l just approaches the asymptote , but in
> this case we reach it :) , and it matches what we already have for 2,3,4-way
> unroll.
>
> So now all I have to do is reduce the number of registers I used(always use as
> many as you can to make the scheduling easier)
>
> Jason

Hi Jason,

Is this stuff that it is worthwhile to translate for Windows?

Writing one version of a new routine and then making a small number of
changes to produce N versions for different architectures makes
translation difficult.

If there are only minor changes, it makes much more sense to translate
a single 'master' version and then make the small number of changes
needed after translation.

On the other hand if the changes are extensive it makes more sense to
translate each of the multiple versions.

If we are in the first situation and translating for Windows makes
sense, can you let me have (or designate) a 'master' file for popcount
and hamdist?

I also need to know when a file has become stable as I don't wnat to
translate anything that is likely to change.

   Brian

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-de...@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Reply via email to