Hi Well I've just wasted my time on trying to do a 2-way unrolled mul_2 for the core2 . Our current code is 3-way unrolled and runs at the optimal 8.0c/l(or 4.0c/l if you prefer to think in terms of mul's done) . I do have an existing 2-way unrolled which if we assume src!=dst I can get to run at 8.0c/l , however when src==dst we have to do the read before the write , this severely restricts the flexibility of the schedulers , and the code does run a little bit slower. However by storing the read in an extra register we can restore the flexibility and in my tests I got the optimal speed in all cases , it even retained the speed when I put the inner loop in our existing code , however when I wrote the correct feed-in/wind-down code we lost all the speed :( I've seen this happen many times on Intel chips , the speed of the loop depends on the state of the pipeline when entering the loop.This is much less common on AMD chips as the (integer)pipelines are identical(nearly).Well after mucking about , I found that a spurious add in the feed-in restored the speed, but because the 2-way unroll is more pipelined than the 3-way it is slower (by a couple of cycles ~20) , so I can't see it as being any good :( , unless I can come up with a less pipelined version.I dont think I'll bother. Now onto mul_2 for the nehalem
Jason -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-de...@googlegroups.com. To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en.