[mpir-devel] Re: Fast computation of binomial coefficients

Bill Hart Sat, 02 May 2009 22:21:00 -0700

2009/5/3 Marc <marc.gli...@gmail.com>:
>
> On Apr 30, 5:34 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
>
>> >> * Support for parallel processing (e.g. multithreaded code) and GPU's
>>
>> > This is the part I am curious about: what are your plans there? In
>> > particular, at what level do you place the parallelism? The code is
>> > already thread-safe. Do you want to port it so it runs on the GPU, and
>> > each GPU core can do a separate operation (the parallelism is still at
>> > the level of the application)?
>>
>> No, that's not the plan at this stage. I doubt it would be very useful.
>
> Many people use GMP for some kind of portable (and sometimes slightly
> longer) long long, so I don't think it is completely unrealistic,
> although there would be many restrictions (with allocations, in
> particular). For instance, there are RSA implementations on the GPU,
> but they use only some parallelism inside each RSA op and rely on
> executing a batch of many RSA ops in parallel for performance. Now I
> can understand concentrating on very large numbers for MPIR. I see
> there are implementations of the FFT available on the GPU (with
> float), it already looks more relevant to your goals.


Probably we'll parallelise more than the FFT. It's hard to say where
parallel code will stop being useful, but I would think we'd get down
to parallel Karatsuba.

>
>> > For multi-threading, there are some easy possibilities
>> > in mpq (not so much in mpz, except for things like the Miller-Rabin
>> > test) and inside mpn, for instance in the fft and toom codes (a couple
>> > openmp pragmas and a second thread are enough to reduce the runtime of
>> > a large multiplication by 25%), assuming applications are unable to
>> > get enough parallelism on their own.
>>
>> That's pretty much what we have in mind.
>
> Is there a branch somewhere with people working on it? Have you
> decided on a framework (not sure that's the right term), like openmp
> for instance?

Hopefully that will sort itself out in the next two weeks. There's no
branch as such at this point, so if you are thinking you'd like to be
involved, you turned up at just the right moment. I personally won't
be starting on that for two more weeks though, as I have some other
things to finish off first.

I believe we will start by using OpenMP. However there will be three
fronts on which we will be working on parallelism:

1) OpenMP C code for "ordinary" multi CPU machines - we'll start with
multiplication, as it is the most important - but possibly not the
FFT, as the FFT we finally want to use in MPIR is not in there yet

My intention is to introduce new functions, mpir_mt_blah(),
mpn_mt_blah(), mpz_mt_blah, where mt stands for multithreaded.

2) Use of SPE's on the Cell architecture (we won't start work on this
for four weeks, as I won't have access to the hardware again until
then)

3) GPU's, specifically via NVIDIA GPU assembly code (we have access to
hardware now)

>
>> I suspect that (as you indicate), nails would have to be modified to
>> make it useful. I personally haven't thought about its implications
>> for parallel processing.
>> If you have some ideas, it would be interesting to hear them.
>
> Oh, nothing special. If you are in a model of computation where you
> can do as many operations in parallel as you want as long as there is
> no dependency or interference between them (that is not really a good
> approximation of a GPU or anything else though), without nails, an
> operation like sub(2^n,1) or add(2^n-1,1) must still take linear time
> to propagate the carry. However, if you allow some non-determinism
> (including nail bits), you can make add_n a constant time operation
> (for instance 1: add limb by limb; 2: canonicalize even limbs, adding
> the carry to the next limb; 3: same thing for odd limbs; in the end
> some even limbs are not canonical, but that's ok), and similarly for
> sub_n (which shows that the nail representation required is not as
> simple as it looked from add_n), mul_1, etc. Although it might be
> better to group limbs into blocks of 10 limbs and have nails for each
> block instead of each limb. Now that may be completely irrelevant for
> any realistic implementation... It places the parallelism at the
> lowest level, which is why I was mentioning vector architectures, but
> I don't think any vector architecture provides the right instructions
> to make it useful. And I don't think you are trying to create some
> bignum hardware and minimizing the instruction depth either :-)

That's a nice idea. I wonder if it makes sense to have a NAILS mode
which is supported by all functions, or whether it is more useful to
implement a nails layer in low level parallel code which supports a
higher level ordinary non-nailified interface. What I mean is, suppose
someone wants to multiply two n limb integers. Firstly these gets
split up into nailified limbs, the data is then operated on using nail
enabled functions, then the data is converted back to ordinary
integers. This means that maintenance of a nails enabled interface can
be managed per architecture and only a handful of functions need to
support it. It also means that nails can be used only where they are
certain to be faster. It also means libraries such as NTL, Pari, etc,
which cannot currently use nails at the mpn level, will still operate
with nails turned on, even on parallel architectures.

Another thing we want to support is scalar operations. One will pass
an array of integers to such a function and the same operation will be
performed on them all. We leave it to the user to figure out what to
do with such scalar functions. Such functions can be trivially
parallelised, if the interface is clever enough.

It might end up that we don't parallelise the FFT, but instead
implement a small prime FFT or multimodular multiplication for large
integers which can more trivially be parallelised. I'm not too keen on
trying to push integer data through parallel floating point FFT's, as
we need to be sure no data is lost, and current provable bounds for
floating point FFT's are pretty pathetic.

Bill.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

[mpir-devel] Re: Fast computation of binomial coefficients

Reply via email to