I been trying to account for each and every cycle in the asm routines , and I 
came across this oddity

let mpn_fn be unrolled to X-way
then run
./speed -cD -s X-X*20 -t X mpn_fn

and you see the time differences per loop as you expect , except for the size 
9*X , where you have an extra 12-14 cycles (a branch miss-prediction)

for example for my mpn_add_n which is 4-way unroll

~/gmpextra-1.0.2/tune# ./speed  -cD -s 4-80  -t 4 mpn_add_n
overhead 6.04 cycles, precision 10000 units of 5.53e-10 secs, CPU freq 1808.23 
MHz
            mpn_add_n
4             (16.11)
8                5.04
12               5.06
16               6.05
20               6.01
24               6.04
28               6.12
32               6.06
36              19.97
40               9.18
44               5.98
48               6.08
52               6.07
56               6.01
60               6.04
64               6.04
68               6.05
72               6.04
76               6.05
80               6.47

This is quite significant , as it amounts to a 21% slowdown at 36 limbs , and 
even at 100 limbs we still losing 10% speed. It seems to be a general problem 
as it effects every mpn fn I've tested. The problem appears to be , when you 
have a loop with a count >=9 , then the cpu will always have a branch 
misprediction once. It's possible that this is a problem only with my exact 
cpu(steping) below

/gmp-4.2.4/tune# cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 44
model name      : AMD Sempron(tm) Processor 3000+
stepping        : 2
cpu MHz         : 1808.227
cache size      : 128 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 
3dnowext 3dnow up rep_good pni lahf_lm
bogomips        : 3620.38
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

The only fix I could think of is to unroll the function further , and so push 
the problem to larger sizes , where the slowdown% will be less.



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to