On Saturday 04 December 2010 10:46:35 Jason wrote: > On Saturday 04 December 2010 09:35:02 Cactus wrote: > > On Dec 4, 3:20 am, Jason <ja...@njkfrudils.plus.com> wrote: > > > On Saturday 04 December 2010 02:01:48 Jason wrote: > > > > On Saturday 04 December 2010 01:40:13 Bill Hart wrote: > > > > > On 4 December 2010 00:52, Jason <ja...@njkfrudils.plus.com> wrote: > > > > > > Hi > > > > > > > > > > > > Heres the first lot of new assembler code for the x64 (in trunk) > > > > > > > > > > > > popcount/hamdist are not terribly useful for MPIR , but they do > > > > > > offer a simple way to practice stuff. > > > > > > > > > > > > K8 popcount was 5.5c/l with 2way unroll now 4.66c/l with 3way > > > > > > K8 hamdist was 5.5c/l with 2way unroll now 5.0c/l with 3way > > > > > > > > > > > > The above was just practice for the core2 version which uses SSE > > > > > > , if I'm going to try to use SSE for anything other than trivial > > > > > > copys/logic then I need the practice. > > > > > > > > > > > > core2/penryn popcount was 6.5c/l with 4way unroll now 2.75c/l > > > > > > with 4way > > > > > > > > > > > > The hamdist shows similar improvements , just have to write the > > > > > > horrible SSE alignment stuff , yuck.. > > > > > > > > > > > > K10 popcount was 1.5c/l with 4way unroll now 1.0c/l with 2way > > > > > > K10 hamdist was 1.9c/l with 4way unroll now 1.5c/l with 4way > > > > > > > > > > Wow, sounds like a lot of great work Jason. > > > > > > > > > > > The above are "optimal" , although for very large unrolls > > > > > > 28way(10way is probably the minimum) we could get down to 0.87c/l > > > > > > for popcount because we do have a spare ALU slot. > > > > > > The above is more interesting than that as it's very similar to > > > > > > the limits of addmul > > > > > > > > > > Not sure what you mean. Do you mean that the point at which it > > > > > drops to the lower time is the same as for addmul. > > > > > > > > No , it's just the way the code is arranged. > > > > > > > > > > Should be able to get the nehalem to run at the same speed as the > > > > > > K10 but so far a conflict of scheduling with the jcc inst is > > > > > > preventing this. Best so far(and current code in trunk) is 1.25 > > > > > > and 1.9 c/l > > > > > > > > > > > > I'll see if I can come up with the Windows version tomorrow. > > > > > > > > > > Sounds good. I'm not sure where these get used, but if it gives you > > > > > practice for other things then its pretty valuable. > > > > > > > > Doing the nehalem for the same reason I did the K10. > > > > > > > > > Bill. > > > > > > I got the nehalem popcount at 1.0c/l , it's quite interesting > > > > > > The nehalem like the core2 has a loopback buffer (like a level 0 > > > cache)(see Agner Fog's manuals) with which to use for small loops , > > > however on the nehalem there is a 1 cycle clock penalty , so it may > > > appear that we can never get 1.0c/l however... > > > > > > each limb requires 1load , 1pop 1add > > > plus add and jmp for loop control > > > so for an N-way unroll we need 3N+2 micro-ops > > > as we can do at most 4 micro-ops per cycle plus the 1 cycle delay > > > > > > N 3N+2 ceil(3N+2)/4 (3N+2)/4+1 c/l > > > 1 5 2 3 > > > 3 2 8 2 3 > > > 1.5 3 11 3 4 > > > > > > 1.333 4 14 4 5 > > > > > > 1.25 5 17 5 6 > > > > > > 1.20 6 20 5 6 > > > > > > 1.0 > > > > > > most of these tables like this the c/l just approaches the asymptote , > > > but in this case we reach it :) , and it matches what we already have > > > for 2,3,4-way unroll. > > > > > > So now all I have to do is reduce the number of registers I used(always > > > use as many as you can to make the scheduling easier) > > > > > > Jason > > > > Hi Jason, > > > > Is this stuff that it is worthwhile to translate for Windows? > > > > Writing one version of a new routine and then making a small number of > > changes to produce N versions for different architectures makes > > translation difficult. > > > > If there are only minor changes, it makes much more sense to translate > > a single 'master' version and then make the small number of changes > > needed after translation. > > > > On the other hand if the changes are extensive it makes more sense to > > translate each of the multiple versions. > > > > If we are in the first situation and translating for Windows makes > > sense, can you let me have (or designate) a 'master' file for popcount > > and hamdist? > > > > I also need to know when a file has become stable as I don't wnat to > > translate anything that is likely to change. > > > > Brian > > Yep , I would wait until I have finished , although I was going to do the > translation myself , it's good practice and these are easy examples. > > The new nehalem hamdist is in now in trunk , and using the same arguments > as for popcount we can do it with a 2-way unroll (before was 4-way) to get > the optimal speed of 2.0c/l (bound by ld/st) . Note: the timings on my > nehalem are always slightly under when I measure it ie it reads as > 1.940c/l , it's like there is a negative overhead on all measurements , > must be the OOO putting the rdtsc adhead ;) > > Using a mixed int/SSE we could beat this ld/st bound of 2.0c/l , I may try > it as this is an easy example (if it works) . Quite a few other functions > could benefit from a mixed int/SSE if we can do it efficiently , I have > tried before but never got anywhere , and with this being such an easy > function I should be able to figure out where I was going wrong. > > The nehalem popcount is interesting , as again it is a simple example of a > situation I have come across before. When we have a large unroll (given an > inner loop) , what is the best way to do the feed-in and wind-down. > I could jump into the middle of the loop but we would need to calculate > size%6 but could change the unroll to 8-way , or as there is no > dependency's between limbs just jump into a 6part winddown(like for logic) > > Jason
I've done the nehalem popcount (in trunk) runs at 1.0c/l , in the end I went for a 6-way unroll and just a fall thru to finish the remaining limbs , the 8- way was too difficult to space out the inner loop so that the jump-in didn't need a table , and the size difference between them is small (as the fall thru is very small) I'll start on the WIndows conversion nowish. Jason -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-de...@googlegroups.com. To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en.