On Wednesday 29 June 2011 20:10:50 jason wrote: > On Jun 19, 11:24 pm, Jason <ja...@njkfrudils.plus.com> wrote: > > > 24) New Toom22 code , the new code is smaller if we let the high part > > > > > > >= low part which is the opposite of the current code , so it's > > > > > > probably easier just to rewrite the whole thing. > > > > Hi > > > > Here is a outline of the new toom22_n code , there are obvious O(1) > > speedups to do , but I'll leave them until I've tested the new assembler > > code as the linear part O(n) is what has improved . I rewrote all the > > code as that was the easiest way as there are other slight minor > > differences(and I do so hate reading other's code). The original code > > has the differences between high an low parts and this has not changed , > > what has changed is the last section where we add/sub the sub-products > > together to form the desired full product. Originally this consisted of > > three add's which on the K8 run at 4.5 cycles per word , this was > > improved with the new mpn_addadd_n function which ran at 3.5 cycles per > > word and now I have a new mpn_karaadd(ie mpn_addaddadd) function which > > runs at 2.5 cycles a word. The addadd function gave us 2-7% speedup and > > I pretty much expect the same again.The lower bound is actually 2.0 > > cycles a word and I think I may be able to get it without too much pipe > > lining. Similar improvements are possible on the Intel cpu's and many > > others( the RISC cpu's are probably easier). I've only writen the inner > > loop of addaddadd function so far but I dont for-see any difficulties , > > should be able to finish it it next week.In case you are wondering the > > new asm code wont be general code like the mpn_addadd_n was but will be > > specific for toom22 multiplication as it has to cope with operand > > overlap and the "odd" cases. > > > > Jason > > Hi > > I've saved about 5 instructions from the inner loop which is good and > I still searching for something faster , so far I have got it down to > 2.350 cycles per word without any pipelining. There are a number of > possible cravats , as we are starting to hit the load/store bound for > the K8 , it could get very sensitve to the relative alignment(mod 8) > of the pointers ( ie rp , rp+n , tp) , my standard testing will be to > try over all combinations of rp and tp , but in this case we also need > to consider rp+n . I think at this stage I'll ignore it and hope for > the best. The K10 can schedule load/stores better than the K8 , so > maybe they may be a better inner loop in that case and as long as I > dont use pipelining , writing both is a trivial exercise. > Incidentally the new K103 (or K10.5 or lano is nearly out) , this is a > K102 with a pipelined hardware divider , not terribly useful for us , > but I've heard that the multiplier has been improved , dont know in > what way :( , I doubt it's the floating point multiplier as the chip > is intergrated with a GPU (like bobcat) , the schedulers are deeper > which may be useful , but the clock speeds don't look great , but > perhaps if you turn off the gpu? > > Jason > > > Jason
Here's some timings comparing our existing code with the new toom22 code. new old 8 #187.03 1.0053 9 252.04 #0.9999 10 307.04 #1.0000 11 359.05 #1.0000 12 407.05 #1.0000 13 471.04 #1.0000 14 551.06 #1.0000 15 609.06 #1.0000 16 680.08 #1.0000 17 766.06 #1.0000 18 869.09 #1.0000 19 962.10 #0.9844 20 1046.11 #0.9837 21 1137.10 #1.0000 22 1283.14 #0.9844 23 1362.14 #0.9875 24 #1358.16 1.0346 25 #1510.17 1.0284 26 #1585.15 1.0183 27 #1726.20 1.0301 28 #1821.23 1.0105 29 #1934.28 1.0201 30 #1994.21 1.0241 31 #2161.31 1.0166 32 #2240.22 1.0014 33 #2408.29 1.0166 34 #2508.22 1.0124 35 #2696.32 1.0330 36 #2804.38 1.0178 37 #2992.43 1.0083 38 #3078.35 1.0052 39 #3244.34 1.0108 40 #3267.44 1.0263 41 #3515.42 1.0244 42 #3635.32 1.0265 43 #3861.45 1.0355 44 #4021.43 1.0130 45 #4206.61 1.0118 46 #4315.47 1.0105 47 #4339.58 1.0712 48 #4507.58 1.0477 49 #4783.50 1.0316 50 #4940.34 1.0163 51 #5063.67 1.0379 52 #5177.43 1.0128 53 #5398.64 1.0530 54 #5484.74 1.0471 55 #5662.66 1.0729 56 #5760.76 1.0696 57 #6075.70 1.0362 58 #6246.49 1.0300 59 #6347.90 1.0430 60 #6401.80 1.0302 61 #6784.80 1.0348 62 #6892.88 1.0315 63 #6992.82 1.0555 64 #7110.83 1.0619 65 #7562.79 1.0303 66 #7616.81 1.0375 67 #7873.86 1.0321 68 #7988.04 1.0274 69 #8379.88 1.0378 70 #8452.11 1.0497 71 #8702.11 1.0656 72 #8812.26 1.0541 73 #9270.19 1.0237 74 #9427.58 1.0209 75 #9675.37 1.0256 76 #9830.18 1.0132 77 #10090.27 1.0259 78 #10104.33 1.0266 79 #10262.67 1.0515 80 #10443.37 1.0479 81 #11042.32 1.0233 82 #11237.29 1.0163 83 #11571.62 1.0184 84 #11716.38 1.0211 85 #12230.69 1.0077 86 #12294.33 1.0186 87 #12410.64 1.0105 88 #12990.13 1.0093 89 #13153.21 1.0080 90 #13098.65 1.0133 91 #13487.69 1.0180 92 #13453.87 1.0264 93 #13431.33 1.0240 94 #14273.08 1.0085 95 #14447.49 1.0049 96 #14458.82 1.0011 97 #14969.55 1.0030 98 15141.60 #0.9946 99 #14964.62 1.0104 100 #15698.93 1.0140 101 #15915.17 1.0063 102 #15947.75 1.0063 103 #16316.65 1.0186 104 #16446.49 1.0195 105 #16485.55 1.0330 106 #17266.78 1.0147 107 #17392.72 1.0223 108 #17508.21 1.0165 109 #17939.94 1.0176 110 #18114.49 1.0136 111 #18001.94 1.0167 112 #18811.92 1.0120 113 #18948.66 1.0050 114 #18986.09 1.0045 115 #19370.65 1.0076 116 #19432.51 1.0140 117 #19320.56 1.0318 118 #20331.91 1.0146 119 #20453.21 1.0174 120 #20520.11 1.0164 121 #21070.15 1.0126 122 #21319.24 1.0051 123 #21457.58 1.0071 124 #22266.78 1.0042 125 #22377.65 1.0106 126 #22431.58 1.0059 127 #22875.10 1.0154 128 #22958.98 1.0179 129 #23055.60 1.0252 130 #23723.02 1.0130 131 #23974.14 1.0150 132 #24023.57 1.0091 133 #24544.49 1.0063 134 #24566.16 1.0136 135 #24848.13 1.0038 136 #25379.66 1.0157 137 #25356.86 1.0175 138 #25389.61 1.0218 139 #25637.65 1.0409 140 #25805.51 1.0359 141 #25782.60 1.0610 142 #27390.00 1.0129 143 #27617.72 1.0173 144 #27392.64 1.0291 145 #28011.73 1.0238 146 #27848.45 1.0381 147 #27982.61 1.0279 148 #28782.28 1.0295 149 #28948.68 1.0297 150 #29031.27 1.0261 A nice little speedup of 0-7% , there are still many O(1) optimizations to do , and tuning. Jason -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com. To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en.