A few days ago, David Harvey gave a really nice talk at Warwick about some new number theoretic transform code he has been working on. It uses some nice tricks to get *really* fast butterflies. Even a C implementation turned out to be quite fast, although he is also working on some assembly optimised versions. Two really nice features are very low memory usage and ability to easily parallelise it (which he has done).
The slides are available here: http://wiki.sagemath.org/SageFlintDays/slides?action=AttachFile&do=view&target=fastntt.pdf On the final slide you will find some timings of his new code versus mpir-2.4.0 and gmp-5.0.2. On that machine (an Intel Xeon) it appears that GMP is way faster than MPIR (and his code is slightly faster again, and much faster when parallelised). Anyhow, I decided to do all my timings for gmp-5.0.2 to see how it compares with mpir-2.4.0 and my new fft on my machine (a K10-2). I don't have a copy of David's code, so I can't compare with that at present. bits iters mpir new_fft gmp 195840 1000 1.149s 1.105s 0.997s 261120 1000 1.483s 1.415s 1.396s 391296 100 0.261s 0.248s 0.282s 521728 100 0.344s 0.315s 0.411s 782592 100 0.577s 0.539s 0.628s 1043456 100 0.706s 0.688s 0.848s 1569024 100 1.229s 1.153s 1.317s 2092032 100 1.543s 1.462s 2.765s 3127296 10 0.283s 0.286s 0.408s 4169728 10 0.357s 0.350s 0.543s 6273024 10 0.621s 0.615s 0.843s 8364032 10 0.831s 0.799s 1.156s 12539904 10 1.441s 1.471s 1.798s 16719872 1 0.230s 0.209s 0.288s 25122816 1 0.379s 0.357s 0.434s 33497088 1 0.524s 0.462s 0.646s 50245632 1 0.833s 0.782s 1.035s 66994176 1 1.596s 0.967s 1.358s 100577280 1 1.906s 1.704s 2.177s 134103040 1 2.784s 2.162s 2.984s 201129984 1 3.971s 3.524s 4.536s 268173312 1 5.146s 4.572s 5.781s 402456576 1 7.548s 7.041s 9.867s 536608768 1 9.841s 9.428s 12.71s 804913152 1 15.48s 14.18s 20.06s 1073217536 1 21.17s 19.33s 27.19s 1610219520 1 31.64s 30.83s 43.37s 2146959360 1 43.25s 39.17s 57.66s 3220340736 1 70.14s 61.39s 92.94s 4293787648 1 96.00s 80.15s 146.1s 6441566208 1 150.2s 136.9s 217.5s 8588754944 1 208.4s 183.3s 312.8s 12883132416 1 327.4s 290.2s 447.7s 17177509888 1 485.0s 382.2s 614.2s I've checked multiple ways that I am really compiling against the right GMP. It also appears that GMP is configuring correctly on my machine. So it looks very much like either MPIR misconfigured on David's machine, or the GMP FFT is much, much better with Intel machines or the MPIR FFT is way, way worse with Intel machines. I'll try to get some configuration details from David and eventually I'll do some timings on an Intel machine to see how everything compares. Last night I tried combining pointwise multiplications with the inverse transform to improve memory locality. However, this failed because calling mpir's Nussbaumer convolution rapidly in succession results in an "Unable to allocate memory" error. I have no idea what causes this (it is certainly not out of memory). The code was exhibiting some very odd behaviour which I was unable to account for. My best guess is that it is to do with too many allocations on the stack or something. I am doubtlessly calling the function in a way it was not intended to be called. Anyhow, this just underscores the need to write a fast Nussbaumer convolution, which I currently do not have time to do (I don't need it for my application). Bill. -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com. To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en.