Re: Any interest in multi-threaded code?

2018-05-15 Thread Niels Möller
"Marco Bodrato"  writes:

> I merged the current head with that 10 years old repository, try again.

Thanks for taking care of that!

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Any interest in multi-threaded code?

2018-05-13 Thread Marco Bodrato
Ciao,

Il Lun, 7 Maggio 2018 9:02 pm, Victor Shoup ha scritto:
> I finally have time to study this.

> Then I found out about .bootstrap...that doesn't work either
> (no aclocal-1.8).  Tried editing .bootstrap to remove "-1.8"...errors.

I merged the current head with that 10 years old repository, try again.

>> On Apr 4, 2018, at 4:53 PM, Niels Möller  wrote:
>> Victor Shoup  writes:
>>> So I guess I'm just asking where the FFT code is at and if there would

>> Here: https://gmplib.org/repo/gcd-nisse/
>>
>> The relation to gcd is that subquadratic gcd includes multiplication of
>> 2x2 matrices,

But this is currently disabled, because current hgcd does not allocate the
correct temp size. But tune/tuneup will find the threshold.

Ĝis,
m

-- 
http://bodrato.it/papers/

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Any interest in multi-threaded code?

2018-05-07 Thread Victor Shoup
Hi,

I finally have time to study this.
One problem is, I don't seem to be able to build it.
No configure file in the repo, so I trued autoconflots of errors.
Then I found out about .bootstrap...that doesn't work either
(no aclocal-1.8).  Tried editing .bootstrap to remove "-1.8"...errors.

Any advice?

And while I'm at it, is there a brief statement about 
what exactly the function mpn_fourier_transform_ct actually does?
I mean, questions like: are the inputs/outputs in standard order
or bit-reverse order?

Thanks!
 
> On Apr 4, 2018, at 4:53 PM, Niels Möller  wrote:
> 
> Victor Shoup  writes:
> 
>> So I guess I'm just asking where the FFT code is at and if there would be 
>> any objections to my using it in that way. 
> 
> Here: https://gmplib.org/repo/gcd-nisse/
> 
> The relation to gcd is that subquadratic gcd includes multiplication of
> 2x2 matrices, and I wrote some code to do that multiplication with
> shared fft transforms (and possibly also doing some of the the additions
> in the transform domain; I don't quite remember), which was a nice
> exercise of the new fft primitives.
> 
> Regards,
> /Niels
> 
> -- 
> Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
> Internet email is subject to wholesale government surveillance.

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Any interest in multi-threaded code?

2018-04-05 Thread Richard Foster

Presumably, that idea will work.
But based on my experience with multi-threading,
because of various overheads and latencies involved,
you have to make sure that there is enough work to do per
thread to make it worthwhile.  Parallelizing the inner loops
like this may or may not yield the expected speedup.
One would have to try to and see...my intuition and experience
makes me somewhat skeptical, but I'd be happy to be proved wrong.
It looks as though the majority vote is for parallelising at the outer 
level rather than the inner loop.


I'll leave that to more experienced hands, although I might have a 
private attempt at the inner version, just for my own interest.


Richard

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Any interest in multi-threaded code?

2018-04-04 Thread Victor Shoup
Thanks!

> On Apr 4, 2018, at 4:53 PM, Niels Möller  wrote:
> 
> Victor Shoup  writes:
> 
>> So I guess I'm just asking where the FFT code is at and if there would be 
>> any objections to my using it in that way. 
> 
> Here: https://gmplib.org/repo/gcd-nisse/
> 
> The relation to gcd is that subquadratic gcd includes multiplication of
> 2x2 matrices, and I wrote some code to do that multiplication with
> shared fft transforms (and possibly also doing some of the the additions
> in the transform domain; I don't quite remember), which was a nice
> exercise of the new fft primitives.
> 
> Regards,
> /Niels
> 
> -- 
> Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
> Internet email is subject to wholesale government surveillance.

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Any interest in multi-threaded code?

2018-04-04 Thread Niels Möller
Victor Shoup  writes:

> So I guess I'm just asking where the FFT code is at and if there would be any 
> objections to my using it in that way. 

Here: https://gmplib.org/repo/gcd-nisse/

The relation to gcd is that subquadratic gcd includes multiplication of
2x2 matrices, and I wrote some code to do that multiplication with
shared fft transforms (and possibly also doing some of the the additions
in the transform domain; I don't quite remember), which was a nice
exercise of the new fft primitives.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Any interest in multi-threaded code?

2018-04-04 Thread Victor Shoup
Presumably, that idea will work.
But based on my experience with multi-threading,
because of various overheads and latencies involved,
you have to make sure that there is enough work to do per
thread to make it worthwhile.  Parallelizing the inner loops
like this may or may not yield the expected speedup.
One would have to try to and see...my intuition and experience
makes me somewhat skeptical, but I'd be happy to be proved wrong.


> On Apr 4, 2018, at 2:56 PM, Richard Foster  wrote:
> 
> Hi
> 
>> It is probably not useful to work much on the current FFT code.  We have
>> long planned--and worked quite a bit on--replacement FFT code with much
>> better memory access pattern.
>> 
>> The replacement code works with multiple small coefficient fields
>> instead of a large ring.  It also implements Bailey's multidimensional
>> FFT trick for memory locality improvements.
>> 
>> I have forgotten the exact state of the code, but I am sure Nisse can
>> fill me in.
>> 
>> Any parallel work should go in that code, but only after it is
>> well-tuned for a single thread, I think.
>> 
>> Making GMP scale well for multiple thread is highly non-trivial.  Only
>> superlinear algorithm should even be attempted.  And then, one will
>> want to make it as course grain as possible, perhaps even with some
>> cache topology understanding built in to the code.
>> 
>> Generating data in thread A which is then processed in thread B will
>> typically mean that data needs to be transferred from one cache to
>> another.  That is super expensive, and will offset any benefit untless
>> the comutation before or after the data transfer is huge.
> The basic idea I've been looking at is not to run the FFT itself in parallel, 
> but to parallelise the inner loop. The algorithm recurses until the operand 
> sizes are below a suitable size, then loops K times; each iteration 
> multiplies 2 numbers of n limbs.
> Naively, if you have m CPUs, then you should get a speedup if each CPU 
> multiplies 2 numbers of n/m limbs. There will be a total of mK 
> multiplications, but since each is smaller, it should go faster.
> As Torbjörn says, cache usage is key: but within some uncertainty, the cache 
> requirement should be roughly the same. 2m operands of n/m limbs should need 
> roughly the same total space as 2 operands of n limbs, and the code should 
> take up roughly the same space; at this point in the algorithm all the 
> operands are the same size, so the GMP code will pick the same method for all 
> the iterations (schoolbook, Toom, etc), so there will only be 1 copy of the 
> code in cache - assuming it's re-entrant, as all modern code should be. There 
> is an overhead of a stack per thread, which must be checked out.
> The other assumption is that the forward FFT leaves the operands in a 
> suitable state, ie, each operand is in contiguous memory. This is an 
> advantage anyway, even if you're not running in parallel, so I think it's a 
> reasonable assumption. Also, since each of the smaller operands is a piece of 
> the original, cache locality for the smaller multiplications is no worse than 
> before.
> Finally, the multiplications must leave their results in a suitable state for 
> the inverse FFT. Again, I think this is a reasonable assumption, as the 
> non-parallel version of the loop does the same thing as the parallel version, 
> just with larger operands.
> 
> Does the reasoning appear OK, or am I totally on the wrong track?
> 
> Even it looks OK in today's code, does Nisse's new FFT code change anything?
> 
> Regards
> Richard Foster
> 
> 
> ___
> gmp-devel mailing list
> gmp-devel@gmplib.org
> https://gmplib.org/mailman/listinfo/gmp-devel

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Any interest in multi-threaded code?

2018-04-04 Thread Richard Foster

Hi

It is probably not useful to work much on the current FFT code.  We 
have

long planned--and worked quite a bit on--replacement FFT code with much
better memory access pattern.

The replacement code works with multiple small coefficient fields
instead of a large ring.  It also implements Bailey's multidimensional
FFT trick for memory locality improvements.

I have forgotten the exact state of the code, but I am sure Nisse can
fill me in.

Any parallel work should go in that code, but only after it is
well-tuned for a single thread, I think.

Making GMP scale well for multiple thread is highly non-trivial.  Only
superlinear algorithm should even be attempted.  And then, one will
want to make it as course grain as possible, perhaps even with some
cache topology understanding built in to the code.

Generating data in thread A which is then processed in thread B will
typically mean that data needs to be transferred from one cache to
another.  That is super expensive, and will offset any benefit untless
the comutation before or after the data transfer is huge.
The basic idea I've been looking at is not to run the FFT itself in 
parallel, but to parallelise the inner loop. The algorithm recurses 
until the operand sizes are below a suitable size, then loops K times; 
each iteration multiplies 2 numbers of n limbs.
Naively, if you have m CPUs, then you should get a speedup if each CPU 
multiplies 2 numbers of n/m limbs. There will be a total of mK 
multiplications, but since each is smaller, it should go faster.
As Torbjörn says, cache usage is key: but within some uncertainty, the 
cache requirement should be roughly the same. 2m operands of n/m limbs 
should need roughly the same total space as 2 operands of n limbs, and 
the code should take up roughly the same space; at this point in the 
algorithm all the operands are the same size, so the GMP code will pick 
the same method for all the iterations (schoolbook, Toom, etc), so there 
will only be 1 copy of the code in cache - assuming it's re-entrant, as 
all modern code should be. There is an overhead of a stack per thread, 
which must be checked out.
The other assumption is that the forward FFT leaves the operands in a 
suitable state, ie, each operand is in contiguous memory. This is an 
advantage anyway, even if you're not running in parallel, so I think 
it's a reasonable assumption. Also, since each of the smaller operands 
is a piece of the original, cache locality for the smaller 
multiplications is no worse than before.
Finally, the multiplications must leave their results in a suitable 
state for the inverse FFT. Again, I think this is a reasonable 
assumption, as the non-parallel version of the loop does the same thing 
as the parallel version, just with larger operands.


Does the reasoning appear OK, or am I totally on the wrong track?

Even it looks OK in today's code, does Nisse's new FFT code change 
anything?


Regards
Richard Foster


___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Any interest in multi-threaded code?

2018-04-04 Thread Victor Shoup
Hi all,

Since this FFT code is sitting around, I wonder if there's a chance I could 
adapt it to NTL. I've actually implemented David Harvey's ideas a few years ago 
(and his work is actually based on older NTL logic which I never really 
publicized).  NTL currently only uses special types like __uint128_t (or 
whatever it's called) or a few inline asm code sequences. What's missing is 
logic to keep things in the L1 cache, so there is definitely much room for 
improvement. 

I've planned on doing these improvements myself for some time, but haven't yet 
found the time.  If I could start from an existing code base and adapt it, that 
would probably speed things up. Hopefully, as long as everything is LGPL 
compatible and nobody minds me adapting their code, this would work out. 

So I guess I'm just asking where the FFT code is at and if there would be any 
objections to my using it in that way. 

BTW, I've done the multithreaded implementation of this type of thing. It works 
great, although in my setting, the number of primes is usually larger (like a 
dozen or more).

-Victor 

> On Apr 4, 2018, at 3:41 PM, Niels Möller  wrote:
> 
> "Richard Foster"  writes:
> 
>> Even it looks OK in today's code, does Nisse's new FFT code change
>> anything?
> 
> The small-prime FFT code uses fft-based multiplication of polynomials
> over Z_p for a couple of single-limb primes p. Usually in the range from
> 3 to 5 primes, depending on how the input size fits constraints on transform
> size, and it's probably beneficial to go a bit beyond 5.
> 
> The FFTs are organized to spend most of the time doing small FFTs
> fitting the L1 cache and implemented as tight assembly loops, and on top
> of that somewhat hairy logic doing twiddle factors, data transpositions,
> truncated fft logic, and the like.
> 
> Regarding threading, the best strategy will likely differ depending on
> the number of cores you target. For a small number of cores, it seems
> reasonable to let one core handle each p, doing all of forward
> transforms, pointwise multiply, and inverse transforms. They would then
> work independently, with only an initial pass to distrivute the mod p
> work to threads, and a final pass to combine mod p results using crt and
> assemble the final product. I'm not that familiar with threading, but
> I'd expect a reasonable speedup for large operands, aiming to have each
> core mainly exercise its local L1 cache. If cores end up competing for
> available cache or memory bandwidth, gain will be smaller.
> 
> It's also possible to try to spawn worker threads to do the smallest mod
> p FFTs, but as Torbjörn says, there's a risk that lots of data has to be
> passed from one core to another. So I'd first try to split computation
> work between threads in much larger pieces than a single small FFT at a
> time.
> 
> As for performance, the assembly loops when I worked on this (which is
> close to 10(!) years ago) ran at some 8-10 cycles per butterfly, on the
> AMD Opteron which was high end at the time.
> 
> I'm aware of a trick by David Harvey, who got running time down to only
> 6 cycles per butterfly (totally saturating the multiplication units) on
> the same hardware, by using primes close to 2^63 rather than close to
> 2^64.
> 
> Regards,
> /Niels
> 
> -- 
> Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
> Internet email is subject to wholesale government surveillance.
> ___
> gmp-devel mailing list
> gmp-devel@gmplib.org
> https://gmplib.org/mailman/listinfo/gmp-devel

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Any interest in multi-threaded code?

2018-04-04 Thread Niels Möller
"Richard Foster"  writes:

> Even it looks OK in today's code, does Nisse's new FFT code change
> anything?

The small-prime FFT code uses fft-based multiplication of polynomials
over Z_p for a couple of single-limb primes p. Usually in the range from
3 to 5 primes, depending on how the input size fits constraints on transform
size, and it's probably beneficial to go a bit beyond 5.

The FFTs are organized to spend most of the time doing small FFTs
fitting the L1 cache and implemented as tight assembly loops, and on top
of that somewhat hairy logic doing twiddle factors, data transpositions,
truncated fft logic, and the like.

Regarding threading, the best strategy will likely differ depending on
the number of cores you target. For a small number of cores, it seems
reasonable to let one core handle each p, doing all of forward
transforms, pointwise multiply, and inverse transforms. They would then
work independently, with only an initial pass to distrivute the mod p
work to threads, and a final pass to combine mod p results using crt and
assemble the final product. I'm not that familiar with threading, but
I'd expect a reasonable speedup for large operands, aiming to have each
core mainly exercise its local L1 cache. If cores end up competing for
available cache or memory bandwidth, gain will be smaller.

It's also possible to try to spawn worker threads to do the smallest mod
p FFTs, but as Torbjörn says, there's a risk that lots of data has to be
passed from one core to another. So I'd first try to split computation
work between threads in much larger pieces than a single small FFT at a
time.
 
As for performance, the assembly loops when I worked on this (which is
close to 10(!) years ago) ran at some 8-10 cycles per butterfly, on the
AMD Opteron which was high end at the time.

I'm aware of a trick by David Harvey, who got running time down to only
6 cycles per butterfly (totally saturating the multiplication units) on
the same hardware, by using primes close to 2^63 rather than close to
2^64.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel