On Tue, Mar 06, 2007 at 10:29:41 -0800, Stephen Hemminger wrote: > Don't count the existing Newton-Raphson out. It turns out that to get enough > precision for 32 bits, only 4 iterations are needed. By unrolling those, it > gets much better timing. > > Slightly gross test program (with original cubic wraparound bug fixed). ... > {~0, 2097151}, ^^^^^^^ this should be 2642245.
Without serializing instruction before rdtsc and with one loop I do not get very accurate results (104 for ncubic, > 1000 for others). #define rdtscll_serialize(val) \ __asm__ __volatile__("movl $0, %%eax\n\tcpuid\n\trdtsc\n" : "=A" (val) : : "ebx", "ecx") Here Pentium D timings for 1000 loops. ~0, 2097151 Function clocks mean(us) max(us) std(us) total error ocubic 912 0.306 20.317 0.730 545101 ncubic 777 0.261 14.799 0.486 576263 acbrt 1168 0.392 21.681 0.547 547562 hcbrt 827 0.278 15.244 0.387 2410 ~0, 2642245 Function clocks mean(us) max(us) std(us) total error ocubic 908 0.305 20.210 0.656 7 ncubic 775 0.260 14.792 0.550 31169 acbrt 1176 0.395 22.017 0.970 2468 hcbrt 826 0.278 15.326 0.670 547504 And I found bug in gcc-4.1.2, it gave 0 for ncubic results when doing 1000 loops test... gcc-4.0.3 works. -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/