On Tue, Mar 06, 2007 at 10:29:41 -0800, Stephen Hemminger wrote:
> Don't count the existing Newton-Raphson out. It turns out that to get enough
> precision for 32 bits, only 4 iterations are needed. By unrolling those, it
> gets much better timing.
> 
> Slightly gross test program (with original cubic wraparound bug fixed).
...
>       {~0, 2097151},
             ^^^^^^^
this should be 2642245.

Without serializing instruction before rdtsc and with one loop
I do not get very accurate results (104 for ncubic, > 1000 for others).

#define rdtscll_serialize(val) \
  __asm__ __volatile__("movl $0, %%eax\n\tcpuid\n\trdtsc\n" : "=A" (val) : : 
"ebx", "ecx")

Here Pentium D timings for 1000 loops. 

~0, 2097151

Function     clocks mean(us)  max(us)  std(us) total error
ocubic          912    0.306   20.317    0.730      545101
ncubic          777    0.261   14.799    0.486      576263
acbrt          1168    0.392   21.681    0.547      547562
hcbrt           827    0.278   15.244    0.387        2410

~0, 2642245

Function     clocks mean(us)  max(us)  std(us) total error
ocubic          908    0.305   20.210    0.656           7
ncubic          775    0.260   14.792    0.550       31169
acbrt          1176    0.395   22.017    0.970        2468
hcbrt           826    0.278   15.326    0.670      547504

And I found bug in gcc-4.1.2, it gave 0 for ncubic results
when doing 1000 loops test... gcc-4.0.3 works.

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to