Hi

Well I've just wasted my time on trying to do a 2-way unrolled mul_2 for the 
core2 . Our current code is 3-way unrolled and runs at the optimal 8.0c/l(or 
4.0c/l if you prefer to think in terms of mul's done) . I do have an existing 
2-way unrolled which if we assume src!=dst I can get to run at 8.0c/l , 
however  when src==dst we have to do the read before the write , this severely 
restricts the flexibility of the schedulers , and the code does run a little 
bit slower. However by storing the read in an extra register we can restore 
the flexibility and in my tests I got the optimal speed in all cases , it even 
retained the speed when I put the inner loop in our existing code , however 
when I wrote the correct feed-in/wind-down code we lost all the speed :( I've 
seen this happen many times on Intel chips , the speed of the loop depends on 
the state of the pipeline when entering the loop.This is much less common on 
AMD chips as the (integer)pipelines are identical(nearly).Well after mucking 
about , I found that a spurious add in the feed-in restored the speed, but 
because the 2-way unroll is more pipelined than the 3-way it is slower (by a 
couple of cycles ~20) , so I can't see it as being any good :( , unless I can 
come up with a less pipelined version.I dont think I'll bother. Now onto mul_2 
for the nehalem

Jason

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-de...@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Reply via email to