Re: Any interest in multi-threaded code?
"Marco Bodrato"writes: > I merged the current head with that 10 years old repository, try again. Thanks for taking care of that! Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Any interest in multi-threaded code?
Ciao, Il Lun, 7 Maggio 2018 9:02 pm, Victor Shoup ha scritto: > I finally have time to study this. > Then I found out about .bootstrap...that doesn't work either > (no aclocal-1.8). Tried editing .bootstrap to remove "-1.8"...errors. I merged the current head with that 10 years old repository, try again. >> On Apr 4, 2018, at 4:53 PM, Niels Möllerwrote: >> Victor Shoup writes: >>> So I guess I'm just asking where the FFT code is at and if there would >> Here: https://gmplib.org/repo/gcd-nisse/ >> >> The relation to gcd is that subquadratic gcd includes multiplication of >> 2x2 matrices, But this is currently disabled, because current hgcd does not allocate the correct temp size. But tune/tuneup will find the threshold. Ĝis, m -- http://bodrato.it/papers/ ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Any interest in multi-threaded code?
Hi, I finally have time to study this. One problem is, I don't seem to be able to build it. No configure file in the repo, so I trued autoconflots of errors. Then I found out about .bootstrap...that doesn't work either (no aclocal-1.8). Tried editing .bootstrap to remove "-1.8"...errors. Any advice? And while I'm at it, is there a brief statement about what exactly the function mpn_fourier_transform_ct actually does? I mean, questions like: are the inputs/outputs in standard order or bit-reverse order? Thanks! > On Apr 4, 2018, at 4:53 PM, Niels Möllerwrote: > > Victor Shoup writes: > >> So I guess I'm just asking where the FFT code is at and if there would be >> any objections to my using it in that way. > > Here: https://gmplib.org/repo/gcd-nisse/ > > The relation to gcd is that subquadratic gcd includes multiplication of > 2x2 matrices, and I wrote some code to do that multiplication with > shared fft transforms (and possibly also doing some of the the additions > in the transform domain; I don't quite remember), which was a nice > exercise of the new fft primitives. > > Regards, > /Niels > > -- > Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. > Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Any interest in multi-threaded code?
Presumably, that idea will work. But based on my experience with multi-threading, because of various overheads and latencies involved, you have to make sure that there is enough work to do per thread to make it worthwhile. Parallelizing the inner loops like this may or may not yield the expected speedup. One would have to try to and see...my intuition and experience makes me somewhat skeptical, but I'd be happy to be proved wrong. It looks as though the majority vote is for parallelising at the outer level rather than the inner loop. I'll leave that to more experienced hands, although I might have a private attempt at the inner version, just for my own interest. Richard ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Any interest in multi-threaded code?
Thanks! > On Apr 4, 2018, at 4:53 PM, Niels Möllerwrote: > > Victor Shoup writes: > >> So I guess I'm just asking where the FFT code is at and if there would be >> any objections to my using it in that way. > > Here: https://gmplib.org/repo/gcd-nisse/ > > The relation to gcd is that subquadratic gcd includes multiplication of > 2x2 matrices, and I wrote some code to do that multiplication with > shared fft transforms (and possibly also doing some of the the additions > in the transform domain; I don't quite remember), which was a nice > exercise of the new fft primitives. > > Regards, > /Niels > > -- > Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. > Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Any interest in multi-threaded code?
Victor Shoupwrites: > So I guess I'm just asking where the FFT code is at and if there would be any > objections to my using it in that way. Here: https://gmplib.org/repo/gcd-nisse/ The relation to gcd is that subquadratic gcd includes multiplication of 2x2 matrices, and I wrote some code to do that multiplication with shared fft transforms (and possibly also doing some of the the additions in the transform domain; I don't quite remember), which was a nice exercise of the new fft primitives. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Any interest in multi-threaded code?
Presumably, that idea will work. But based on my experience with multi-threading, because of various overheads and latencies involved, you have to make sure that there is enough work to do per thread to make it worthwhile. Parallelizing the inner loops like this may or may not yield the expected speedup. One would have to try to and see...my intuition and experience makes me somewhat skeptical, but I'd be happy to be proved wrong. > On Apr 4, 2018, at 2:56 PM, Richard Fosterwrote: > > Hi > >> It is probably not useful to work much on the current FFT code. We have >> long planned--and worked quite a bit on--replacement FFT code with much >> better memory access pattern. >> >> The replacement code works with multiple small coefficient fields >> instead of a large ring. It also implements Bailey's multidimensional >> FFT trick for memory locality improvements. >> >> I have forgotten the exact state of the code, but I am sure Nisse can >> fill me in. >> >> Any parallel work should go in that code, but only after it is >> well-tuned for a single thread, I think. >> >> Making GMP scale well for multiple thread is highly non-trivial. Only >> superlinear algorithm should even be attempted. And then, one will >> want to make it as course grain as possible, perhaps even with some >> cache topology understanding built in to the code. >> >> Generating data in thread A which is then processed in thread B will >> typically mean that data needs to be transferred from one cache to >> another. That is super expensive, and will offset any benefit untless >> the comutation before or after the data transfer is huge. > The basic idea I've been looking at is not to run the FFT itself in parallel, > but to parallelise the inner loop. The algorithm recurses until the operand > sizes are below a suitable size, then loops K times; each iteration > multiplies 2 numbers of n limbs. > Naively, if you have m CPUs, then you should get a speedup if each CPU > multiplies 2 numbers of n/m limbs. There will be a total of mK > multiplications, but since each is smaller, it should go faster. > As Torbjörn says, cache usage is key: but within some uncertainty, the cache > requirement should be roughly the same. 2m operands of n/m limbs should need > roughly the same total space as 2 operands of n limbs, and the code should > take up roughly the same space; at this point in the algorithm all the > operands are the same size, so the GMP code will pick the same method for all > the iterations (schoolbook, Toom, etc), so there will only be 1 copy of the > code in cache - assuming it's re-entrant, as all modern code should be. There > is an overhead of a stack per thread, which must be checked out. > The other assumption is that the forward FFT leaves the operands in a > suitable state, ie, each operand is in contiguous memory. This is an > advantage anyway, even if you're not running in parallel, so I think it's a > reasonable assumption. Also, since each of the smaller operands is a piece of > the original, cache locality for the smaller multiplications is no worse than > before. > Finally, the multiplications must leave their results in a suitable state for > the inverse FFT. Again, I think this is a reasonable assumption, as the > non-parallel version of the loop does the same thing as the parallel version, > just with larger operands. > > Does the reasoning appear OK, or am I totally on the wrong track? > > Even it looks OK in today's code, does Nisse's new FFT code change anything? > > Regards > Richard Foster > > > ___ > gmp-devel mailing list > gmp-devel@gmplib.org > https://gmplib.org/mailman/listinfo/gmp-devel ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Any interest in multi-threaded code?
Hi It is probably not useful to work much on the current FFT code. We have long planned--and worked quite a bit on--replacement FFT code with much better memory access pattern. The replacement code works with multiple small coefficient fields instead of a large ring. It also implements Bailey's multidimensional FFT trick for memory locality improvements. I have forgotten the exact state of the code, but I am sure Nisse can fill me in. Any parallel work should go in that code, but only after it is well-tuned for a single thread, I think. Making GMP scale well for multiple thread is highly non-trivial. Only superlinear algorithm should even be attempted. And then, one will want to make it as course grain as possible, perhaps even with some cache topology understanding built in to the code. Generating data in thread A which is then processed in thread B will typically mean that data needs to be transferred from one cache to another. That is super expensive, and will offset any benefit untless the comutation before or after the data transfer is huge. The basic idea I've been looking at is not to run the FFT itself in parallel, but to parallelise the inner loop. The algorithm recurses until the operand sizes are below a suitable size, then loops K times; each iteration multiplies 2 numbers of n limbs. Naively, if you have m CPUs, then you should get a speedup if each CPU multiplies 2 numbers of n/m limbs. There will be a total of mK multiplications, but since each is smaller, it should go faster. As Torbjörn says, cache usage is key: but within some uncertainty, the cache requirement should be roughly the same. 2m operands of n/m limbs should need roughly the same total space as 2 operands of n limbs, and the code should take up roughly the same space; at this point in the algorithm all the operands are the same size, so the GMP code will pick the same method for all the iterations (schoolbook, Toom, etc), so there will only be 1 copy of the code in cache - assuming it's re-entrant, as all modern code should be. There is an overhead of a stack per thread, which must be checked out. The other assumption is that the forward FFT leaves the operands in a suitable state, ie, each operand is in contiguous memory. This is an advantage anyway, even if you're not running in parallel, so I think it's a reasonable assumption. Also, since each of the smaller operands is a piece of the original, cache locality for the smaller multiplications is no worse than before. Finally, the multiplications must leave their results in a suitable state for the inverse FFT. Again, I think this is a reasonable assumption, as the non-parallel version of the loop does the same thing as the parallel version, just with larger operands. Does the reasoning appear OK, or am I totally on the wrong track? Even it looks OK in today's code, does Nisse's new FFT code change anything? Regards Richard Foster ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Any interest in multi-threaded code?
Hi all, Since this FFT code is sitting around, I wonder if there's a chance I could adapt it to NTL. I've actually implemented David Harvey's ideas a few years ago (and his work is actually based on older NTL logic which I never really publicized). NTL currently only uses special types like __uint128_t (or whatever it's called) or a few inline asm code sequences. What's missing is logic to keep things in the L1 cache, so there is definitely much room for improvement. I've planned on doing these improvements myself for some time, but haven't yet found the time. If I could start from an existing code base and adapt it, that would probably speed things up. Hopefully, as long as everything is LGPL compatible and nobody minds me adapting their code, this would work out. So I guess I'm just asking where the FFT code is at and if there would be any objections to my using it in that way. BTW, I've done the multithreaded implementation of this type of thing. It works great, although in my setting, the number of primes is usually larger (like a dozen or more). -Victor > On Apr 4, 2018, at 3:41 PM, Niels Möllerwrote: > > "Richard Foster" writes: > >> Even it looks OK in today's code, does Nisse's new FFT code change >> anything? > > The small-prime FFT code uses fft-based multiplication of polynomials > over Z_p for a couple of single-limb primes p. Usually in the range from > 3 to 5 primes, depending on how the input size fits constraints on transform > size, and it's probably beneficial to go a bit beyond 5. > > The FFTs are organized to spend most of the time doing small FFTs > fitting the L1 cache and implemented as tight assembly loops, and on top > of that somewhat hairy logic doing twiddle factors, data transpositions, > truncated fft logic, and the like. > > Regarding threading, the best strategy will likely differ depending on > the number of cores you target. For a small number of cores, it seems > reasonable to let one core handle each p, doing all of forward > transforms, pointwise multiply, and inverse transforms. They would then > work independently, with only an initial pass to distrivute the mod p > work to threads, and a final pass to combine mod p results using crt and > assemble the final product. I'm not that familiar with threading, but > I'd expect a reasonable speedup for large operands, aiming to have each > core mainly exercise its local L1 cache. If cores end up competing for > available cache or memory bandwidth, gain will be smaller. > > It's also possible to try to spawn worker threads to do the smallest mod > p FFTs, but as Torbjörn says, there's a risk that lots of data has to be > passed from one core to another. So I'd first try to split computation > work between threads in much larger pieces than a single small FFT at a > time. > > As for performance, the assembly loops when I worked on this (which is > close to 10(!) years ago) ran at some 8-10 cycles per butterfly, on the > AMD Opteron which was high end at the time. > > I'm aware of a trick by David Harvey, who got running time down to only > 6 cycles per butterfly (totally saturating the multiplication units) on > the same hardware, by using primes close to 2^63 rather than close to > 2^64. > > Regards, > /Niels > > -- > Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. > Internet email is subject to wholesale government surveillance. > ___ > gmp-devel mailing list > gmp-devel@gmplib.org > https://gmplib.org/mailman/listinfo/gmp-devel ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Any interest in multi-threaded code?
"Richard Foster"writes: > Even it looks OK in today's code, does Nisse's new FFT code change > anything? The small-prime FFT code uses fft-based multiplication of polynomials over Z_p for a couple of single-limb primes p. Usually in the range from 3 to 5 primes, depending on how the input size fits constraints on transform size, and it's probably beneficial to go a bit beyond 5. The FFTs are organized to spend most of the time doing small FFTs fitting the L1 cache and implemented as tight assembly loops, and on top of that somewhat hairy logic doing twiddle factors, data transpositions, truncated fft logic, and the like. Regarding threading, the best strategy will likely differ depending on the number of cores you target. For a small number of cores, it seems reasonable to let one core handle each p, doing all of forward transforms, pointwise multiply, and inverse transforms. They would then work independently, with only an initial pass to distrivute the mod p work to threads, and a final pass to combine mod p results using crt and assemble the final product. I'm not that familiar with threading, but I'd expect a reasonable speedup for large operands, aiming to have each core mainly exercise its local L1 cache. If cores end up competing for available cache or memory bandwidth, gain will be smaller. It's also possible to try to spawn worker threads to do the smallest mod p FFTs, but as Torbjörn says, there's a risk that lots of data has to be passed from one core to another. So I'd first try to split computation work between threads in much larger pieces than a single small FFT at a time. As for performance, the assembly loops when I worked on this (which is close to 10(!) years ago) ran at some 8-10 cycles per butterfly, on the AMD Opteron which was high end at the time. I'm aware of a trick by David Harvey, who got running time down to only 6 cycles per butterfly (totally saturating the multiplication units) on the same hardware, by using primes close to 2^63 rather than close to 2^64. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel