On Wednesday 29 June 2011 20:10:50 jason wrote:
> On Jun 19, 11:24 pm, Jason <ja...@njkfrudils.plus.com> wrote:
> > > 24) New Toom22 code , the new code is smaller if we let the high part
> > > 
> > > >= low part which is the opposite of the current code , so it's
> > > 
> > > probably easier just to rewrite the whole thing.
> > 
> > Hi
> > 
> > Here is a outline of the new toom22_n code , there are obvious O(1)
> > speedups to do , but I'll leave them until I've tested the new assembler
> > code as the linear part O(n) is what has improved . I rewrote all the
> > code as that was the easiest way as there are other slight minor
> > differences(and I do so hate reading other's code). The original code
> > has the differences between high an low parts and this has not changed ,
> > what has changed is the last section where we add/sub the sub-products
> > together to form the desired full product. Originally this consisted of
> > three add's which on the K8 run at 4.5 cycles per word , this was
> > improved with the new mpn_addadd_n function which ran at 3.5 cycles per
> > word and now I have a new mpn_karaadd(ie mpn_addaddadd) function which
> > runs at 2.5 cycles a word. The addadd function gave us 2-7% speedup and
> > I pretty much expect the same again.The lower bound is actually 2.0
> > cycles a word and I think I may be able to get it without too much pipe
> > lining. Similar improvements are possible on the Intel cpu's and many
> > others( the RISC cpu's are probably easier). I've only writen the inner
> > loop of addaddadd function so far but I dont for-see any difficulties ,
> > should be able to finish it it next week.In case you are wondering the
> > new asm code wont be general code like the mpn_addadd_n was but will be
> > specific for toom22 multiplication as it has to cope with operand
> > overlap and the "odd" cases.
> > 
> > Jason
> 
> Hi
> 
> I've saved about 5 instructions from the inner loop which is good and
> I still searching for something faster , so far I have got it down to
> 2.350 cycles per word without any pipelining. There are a number of
> possible cravats , as we are starting to hit the load/store bound for
> the K8 , it could get very sensitve to the relative alignment(mod 8)
> of the pointers ( ie rp , rp+n , tp) , my standard testing will be to
> try over all combinations of rp and tp , but in this case we also need
> to consider rp+n . I think at this stage I'll ignore it and hope for
> the best. The K10 can schedule load/stores better than the K8 , so
> maybe they may be a better inner loop in that case and as long as I
> dont use pipelining , writing both is a trivial exercise.
> Incidentally the new K103 (or K10.5 or lano is nearly out) , this is a
> K102 with a pipelined hardware divider , not terribly useful for us ,
> but I've heard that the multiplier has been improved , dont know in
> what way :( , I doubt it's the floating point multiplier as the chip
> is intergrated with a GPU (like bobcat) , the schedulers are deeper
> which may be useful , but the clock speeds don't look great , but
> perhaps if you turn off the gpu?
> 
> Jason
> 
> 
> Jason

Here's some timings comparing our existing code with the new toom22 code.
                new                     old
8             #187.03        1.0053
9              252.04       #0.9999
10             307.04       #1.0000
11             359.05       #1.0000
12             407.05       #1.0000
13             471.04       #1.0000
14             551.06       #1.0000
15             609.06       #1.0000
16             680.08       #1.0000
17             766.06       #1.0000
18             869.09       #1.0000
19             962.10       #0.9844
20            1046.11       #0.9837
21            1137.10       #1.0000
22            1283.14       #0.9844
23            1362.14       #0.9875
24           #1358.16        1.0346
25           #1510.17        1.0284
26           #1585.15        1.0183
27           #1726.20        1.0301
28           #1821.23        1.0105
29           #1934.28        1.0201
30           #1994.21        1.0241
31           #2161.31        1.0166
32           #2240.22        1.0014
33           #2408.29        1.0166
34           #2508.22        1.0124
35           #2696.32        1.0330
36           #2804.38        1.0178
37           #2992.43        1.0083
38           #3078.35        1.0052
39           #3244.34        1.0108
40           #3267.44        1.0263
41           #3515.42        1.0244
42           #3635.32        1.0265
43           #3861.45        1.0355
44           #4021.43        1.0130
45           #4206.61        1.0118
46           #4315.47        1.0105
47           #4339.58        1.0712
48           #4507.58        1.0477
49           #4783.50        1.0316
50           #4940.34        1.0163
51           #5063.67        1.0379
52           #5177.43        1.0128
53           #5398.64        1.0530
54           #5484.74        1.0471
55           #5662.66        1.0729
56           #5760.76        1.0696
57           #6075.70        1.0362
58           #6246.49        1.0300
59           #6347.90        1.0430
60           #6401.80        1.0302
61           #6784.80        1.0348
62           #6892.88        1.0315
63           #6992.82        1.0555
64           #7110.83        1.0619
65           #7562.79        1.0303
66           #7616.81        1.0375
67           #7873.86        1.0321
68           #7988.04        1.0274
69           #8379.88        1.0378
70           #8452.11        1.0497
71           #8702.11        1.0656
72           #8812.26        1.0541
73           #9270.19        1.0237
74           #9427.58        1.0209
75           #9675.37        1.0256
76           #9830.18        1.0132
77          #10090.27        1.0259
78          #10104.33        1.0266
79          #10262.67        1.0515
80          #10443.37        1.0479
81          #11042.32        1.0233
82          #11237.29        1.0163
83          #11571.62        1.0184
84          #11716.38        1.0211
85          #12230.69        1.0077
86          #12294.33        1.0186
87          #12410.64        1.0105
88          #12990.13        1.0093
89          #13153.21        1.0080
90          #13098.65        1.0133
91          #13487.69        1.0180
92          #13453.87        1.0264
93          #13431.33        1.0240
94          #14273.08        1.0085
95          #14447.49        1.0049
96          #14458.82        1.0011
97          #14969.55        1.0030
98           15141.60       #0.9946
99          #14964.62        1.0104
100         #15698.93        1.0140
101         #15915.17        1.0063
102         #15947.75        1.0063
103         #16316.65        1.0186
104         #16446.49        1.0195
105         #16485.55        1.0330
106         #17266.78        1.0147
107         #17392.72        1.0223
108         #17508.21        1.0165
109         #17939.94        1.0176
110         #18114.49        1.0136
111         #18001.94        1.0167
112         #18811.92        1.0120
113         #18948.66        1.0050
114         #18986.09        1.0045
115         #19370.65        1.0076
116         #19432.51        1.0140
117         #19320.56        1.0318
118         #20331.91        1.0146
119         #20453.21        1.0174
120         #20520.11        1.0164
121         #21070.15        1.0126
122         #21319.24        1.0051
123         #21457.58        1.0071
124         #22266.78        1.0042
125         #22377.65        1.0106
126         #22431.58        1.0059
127         #22875.10        1.0154
128         #22958.98        1.0179
129         #23055.60        1.0252
130         #23723.02        1.0130
131         #23974.14        1.0150
132         #24023.57        1.0091
133         #24544.49        1.0063
134         #24566.16        1.0136
135         #24848.13        1.0038
136         #25379.66        1.0157
137         #25356.86        1.0175
138         #25389.61        1.0218
139         #25637.65        1.0409
140         #25805.51        1.0359
141         #25782.60        1.0610
142         #27390.00        1.0129
143         #27617.72        1.0173
144         #27392.64        1.0291
145         #28011.73        1.0238
146         #27848.45        1.0381
147         #27982.61        1.0279
148         #28782.28        1.0295
149         #28948.68        1.0297
150         #29031.27        1.0261

A nice little speedup of 0-7% , there are still many O(1) optimizations to do 
, and tuning.

Jason


-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Reply via email to