Re: Any interest in multi-threaded code?

Richard Foster Wed, 04 Apr 2018 13:21:58 -0700

Hi

It is probably not useful to work much on the current FFT code. Wehave

long planned--and worked quite a bit on--replacement FFT code with much
better memory access pattern.


The replacement code works with multiple small coefficient fields
instead of a large ring.  It also implements Bailey's multidimensional
FFT trick for memory locality improvements.

I have forgotten the exact state of the code, but I am sure Nisse can
fill me in.

Any parallel work should go in that code, but only after it is
well-tuned for a single thread, I think.

Making GMP scale well for multiple thread is highly non-trivial.  Only
superlinear algorithm should even be attempted.  And then, one will
want to make it as course grain as possible, perhaps even with some
cache topology understanding built in to the code.

Generating data in thread A which is then processed in thread B will
typically mean that data needs to be transferred from one cache to
another.  That is super expensive, and will offset any benefit untless
the comutation before or after the data transfer is huge.

The basic idea I've been looking at is not to run the FFT itself inparallel, but to parallelise the inner loop. The algorithm recursesuntil the operand sizes are below a suitable size, then loops K times;each iteration multiplies 2 numbers of n limbs.Naively, if you have m CPUs, then you should get a speedup if each CPUmultiplies 2 numbers of n/m limbs. There will be a total of mKmultiplications, but since each is smaller, it should go faster.As Torbjörn says, cache usage is key: but within some uncertainty, thecache requirement should be roughly the same. 2m operands of n/m limbsshould need roughly the same total space as 2 operands of n limbs, andthe code should take up roughly the same space; at this point in thealgorithm all the operands are the same size, so the GMP code will pickthe same method for all the iterations (schoolbook, Toom, etc), so therewill only be 1 copy of the code in cache - assuming it's re-entrant, asall modern code should be. There is an overhead of a stack per thread,which must be checked out.The other assumption is that the forward FFT leaves the operands in asuitable state, ie, each operand is in contiguous memory. This is anadvantage anyway, even if you're not running in parallel, so I thinkit's a reasonable assumption. Also, since each of the smaller operandsis a piece of the original, cache locality for the smallermultiplications is no worse than before.Finally, the multiplications must leave their results in a suitablestate for the inverse FFT. Again, I think this is a reasonableassumption, as the non-parallel version of the loop does the same thingas the parallel version, just with larger operands.


Does the reasoning appear OK, or am I totally on the wrong track?

Even it looks OK in today's code, does Nisse's new FFT code changeanything?


Regards
Richard Foster


_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Re: Any interest in multi-threaded code?

Reply via email to