A few days ago, David Harvey gave a really nice talk at Warwick about
some new number theoretic transform code he has been working on. It
uses some nice tricks to get *really* fast butterflies. Even a C
implementation turned out to be quite fast, although he is also
working on some assembly optimised versions. Two really nice features
are very low memory usage and ability to easily parallelise it (which
he has done).

The slides are available here:

http://wiki.sagemath.org/SageFlintDays/slides?action=AttachFile&do=view&target=fastntt.pdf

On the final slide you will find some timings of his new code versus
mpir-2.4.0 and gmp-5.0.2. On that machine (an Intel Xeon) it appears
that GMP is way faster than MPIR (and his code is slightly faster
again, and much faster when parallelised).

Anyhow, I decided to do all my timings for gmp-5.0.2 to see how it
compares with mpir-2.4.0 and my new fft on my machine (a K10-2). I
don't have a copy of David's code, so I can't compare with that at
present.

bits                iters   mpir    new_fft gmp
195840          1000  1.149s 1.105s 0.997s
261120          1000  1.483s 1.415s 1.396s
391296          100    0.261s 0.248s 0.282s
521728          100    0.344s 0.315s 0.411s
782592          100    0.577s 0.539s 0.628s
1043456        100    0.706s 0.688s 0.848s
1569024        100    1.229s 1.153s 1.317s
2092032        100    1.543s 1.462s 2.765s
3127296        10     0.283s 0.286s 0.408s
4169728        10     0.357s 0.350s 0.543s
6273024        10     0.621s 0.615s 0.843s
8364032        10     0.831s 0.799s 1.156s
12539904       10    1.441s 1.471s 1.798s
16719872       1      0.230s 0.209s 0.288s
25122816       1      0.379s 0.357s 0.434s
33497088       1      0.524s 0.462s 0.646s
50245632       1      0.833s 0.782s 1.035s
66994176       1      1.596s 0.967s 1.358s
100577280     1      1.906s 1.704s 2.177s
134103040     1      2.784s 2.162s 2.984s
201129984     1      3.971s 3.524s 4.536s
268173312     1      5.146s 4.572s 5.781s
402456576     1      7.548s 7.041s 9.867s
536608768     1      9.841s 9.428s 12.71s
804913152     1     15.48s 14.18s 20.06s
1073217536   1     21.17s 19.33s 27.19s
1610219520   1     31.64s 30.83s 43.37s
2146959360   1     43.25s 39.17s 57.66s
3220340736   1     70.14s 61.39s 92.94s
4293787648   1     96.00s 80.15s 146.1s
6441566208   1    150.2s 136.9s 217.5s
8588754944   1    208.4s 183.3s 312.8s
12883132416 1    327.4s 290.2s 447.7s
17177509888 1    485.0s 382.2s 614.2s

I've checked multiple ways that I am really compiling against the
right GMP. It also appears that GMP is configuring correctly on my
machine. So it looks very much like either MPIR misconfigured on
David's machine, or the GMP FFT is much, much better with Intel
machines or the MPIR FFT is way, way worse with Intel machines.

I'll try to get some configuration details from David and eventually
I'll do some timings on an Intel machine to see how everything
compares.

Last night I tried combining pointwise multiplications with the
inverse transform to improve memory locality. However, this failed
because calling mpir's Nussbaumer convolution rapidly in succession
results in an "Unable to allocate memory" error. I have no idea what
causes this (it is certainly not out of memory). The code was
exhibiting some very odd behaviour which I was unable to account for.
My best guess is that it is to do with too many allocations on the
stack or something. I am doubtlessly calling the function in a way it
was not intended to be called. Anyhow, this just underscores the need
to write a fast Nussbaumer convolution, which I currently do not have
time to do (I don't need it for my application).

Bill.

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Reply via email to