Regarding Forth, check out Jonesforth. Truly one of the most beautiful
pieces of code on the web.

Bill.

On 22 June 2010 23:45, Antony Vennard <antony.venn...@gmail.com> wrote:
> Just a quick check in from me.
>
> I am indeed still working on this. Work has been suspended slightly
> because I set up a company and am just finishing with a client project
> (gotta fund myself somehow). However, the upside is I've written a
> python script to automate quite a bit of the process of writing a django
> app (because nothing's worse than writing the same thing over and over
> again inserting different variable names). So fingers crossed it
> shouldn't take me too much longer. Next few weeks, I promise and
> something more will be working.
>
> Now, I'm a programmer, not a website designer, which means whenever I go
> near CSS I produce ugly websites no matter how hard I try, so if anyone
> has a taste for design or wants to explore their creative side, say hi
> now. Or, if you're a django whiz, say hi now also. I'm not fussy.
>
> Disclaimer: the following will be bsd-licensed and/or relate to bsdnt
>
> In other news - I've been trying to get around how to parallelise
> operations in bsdnt - it seems I'm trying to solve a very hard problem
> however in parellelising the core functions which are already very
> efficient if implemented from scratch in assembly and performing at a
> level comparable to existing bignum libraries. Bill suggested actually
> what might be better is to write vector functions where one passes an
> array of arbitrary precision numbers to a operation function and it does
> all of them at once.
>
> This is actually relatively trivial to do. Based on mpir (any
> contribution I make is bsd-licensed or gplv2, take your pick) one could
> do this:
>
> int mpz_mul_vec(mpz_ptr* w, mpz_srcptr* u, mpz_srcptr v, const int dim)
> {
>        int i = 0;
>        #pragma omp parallel for shared w,u,v
>        for ( i = 0; i < dim; i++ )
>        {
>                mpz_mul(w[i], u[i], v[i]);
>        }
>
>        return 0;
> }
>
> Basically openmp would then break this down into pthread_create function
> calls (try the ldd trace on any program not linking to pthreads and
> you'll find pthreads linked in...) or CreateThread() on Windows. Compile
> with -fopenmp on gcc/clang and whatever it is with msvc (not sure but I
> know it is there).
>
> Disclaimer - not tested (yet).
>
> The same format happens for all functions you want to parallelise or
> happen in "vector" format - add, sub, whatever.
>
> You could go further and detect the number of cpus present / number of
> threads requested and divide dim by that and have two loops, such that
> you stick to the "optimum" number of threads per cpu rather than going
> all-out parallel and getting nowhere because the cpu is constantly
> switching between threads. Here's a rough sketch of how it might be done
> (disclaimer - I wrote this in my email client - it can probably be done
> better):
>
> int mpz_mul_vec(mpz_ptr* w, mpz_srcptr* u, mpz_srcptr v, const int dim)
> {
>        int nt = omp_get_num_threads();
>        int i = 0; int x = 0;
>        int ns = (dim / nt) + 1;
>        int nsdim = dim;
>        int tdim = dim
>
>        for ( x = 0; x < ns; x++ )
>        {
>                /* there has to be a better way to achieve this bit
>                 * too late at night to be doing this
>                 */
>                if ( tdim >= dim )
>                {
>                        nsdim = dim;
>                }
>                else
>                {
>                        nsdim = tdim;
>                }
>
>                #pragma omp parallel for shared w,u,v
>                for ( i = 0; i < nsdim; i++ )
>                {
>                        mpz_mul(w[i], u[i], v[i]);
>                }
>
>                tdim -= dim;
>        }
>
>        return 0;
> }
>
> Disclaimer: definitely not tested yet.
>
> Now, the next stage is to get things going on the gpu. For that, the
> word forth was mentioned somewhere so I'm currently slowly, when I have
> a spare moment I can't face using for web dev, writing one in x86-64
> linux assembly so I can then port the ideas to arb or whatever nvidia
> asm lang is. Either way, I'm learning a heck of a lot, so...
>
> ...that's me for now.
>
> Antony
>
> On 06/22/2010 12:30 AM, Bill Hart wrote:
>> Hi all,
>>
>> this is the first of three posts, primary audience sage-devel, esp.
>> those attending Sage Days in Leiden.
>>
>> As many of you know, I recently stepped down from the MPIR project.
>> The main reason was that I need to focus on FLINT until about 1 Oct
>> 2010, then need to transition to 80% research, 20% coding (my
>> colleagues at Warwick will be aghast that this is not already the
>> ratio - but they probably guessed as much). Now that I have stepped
>> down from MPIR, I can divulge all the "secret" development plans I
>> had, in the hope that some other people might get inspired. That is
>> what this email is about - potential MPIR projects that *anyone* could
>> work on (assuming they code in C), even research level projects!!
>>
>> Firstly, in my mind, MPIR is not going away just because I stopped
>> working on it regularly. Since the last release I managed, a major
>> MPIR has been released, 2.1.1, which includes support for MSVC 2010
>> from Dr. Brian Gladman and proper xgcd normalisation (actually done by
>> me) and a number of other fixes by Jason Moxham. Minh Nguyen managed
>> the release, and did a stellar job!
>>
>> Brian Gladman has always said, MPIR *is* him, me and Jason, so without
>> me it's just him and Jason. But Jason recently posted to mpir-devel
>> with plans for the next release. So the project is not stopping.
>>
>> As everyone is aware, we switched to LGPL v3+ so that code could be
>> exchanged with GMP devels and indeed much code from GMP 5.x.y went
>> into MPIR 2.0.0. Even if MPIR turns into GMP with proper MSVC Windows
>> support, it'll still be a success in my book. But a *lot* more is
>> possible.  Here are some of the "secret" plans I had for the future:
>>
>> Multiplication:
>>
>> * We currently don't have Toom 6 (GMP does). Someone could pull this
>> code from GMP and put it in MPIR. It is largely written by Marco
>> Bodrato, using a new "mixing" idea, which we had not thought of in
>> MPIR. So it is very fast and well coded. Marco Bodrato and Alberto
>> Zanoni are the leading experts on Toom related stuff.
>>
>> * We do have Toom 8 (also largely by Marco), but what should come
>> next? Perhaps Toom 12? Perhaps a general Toom 2n? This is research
>> mathematics. Currently such a thing doesn't exist. You could write a
>> paper on it. Do please talk to Marco, who is very friendly and
>> approachable and probably has lots of good ideas about what is
>> possible. There's an open question as to what the actual asymptotics
>> of a general Toom 2n is. The best known result is due to Knuth in
>> TAOCP.
>>
>> * FFT - I have written a completely new FFT (I did this last
>> Christmas). Apart from not having the sqrt2 trick implemented, this is
>> actually almost ready to be put into MPIR. It should be *much* better
>> than the FFT currently there, and I even have some new ideas which I
>> could discuss with anyone who would like to do even better. I will be
>> giving a talk at the upcoming Sage Days in Leiden on the FFT. If you
>> get inspired, talk to me and I'll explain my new ideas. __I will *not*
>> have time to complete this FFT___, but wrote extensive notes on what
>> needs to be done to finish it, and there's a thousand lines of code or
>> so ready for someone to work on if they so pleased. This new code is
>> currently not used by anything! This is a project absolutely *anyone*
>> could work on.
>>
>> Division
>>
>> * Anyone who's checked will notice MPIR's division code is actually
>> slower than GMP's. This is fully intentional. It is because I decided
>> we should use David Harvey's divide and conquer code (just a quick
>> note - he disclaims responsibility for bugs in our code - we used his
>> pre-implementation and adapted it for MPIR and he bears no
>> responsibility for the ways in which we massacred it). Now
>> theoretically divide and conquer code based on middle product should
>> be faster than that based on straight multiplication (as currently
>> used by GMP say). But this boils down to as much work being put into
>> middle product code as into ordinary multiplication code, which is
>> currently not the case. Specifically David Harvey invented this
>> fantastic integer middle product algorithm (which was completely new).
>> We've got an implementation largely by him (some by some of the GMP
>> devels) which has assembly code for a basecase, then there is the
>> equivalent of karatsuba multiplication for middle product (it's called
>> toom42_mulmid). Actually you can "invert" any ordinary multiplication
>> algorithm and turn it into an algorithm for middle product. This uses
>> a type of inversion which is called Tellegen's principal (I can
>> provide references, though they are also contained in David Harvey's
>> paper on his website). Now, the next logical step is to implement
>> Toom63_mulmid. This is an inverted Toom3 algorithm. In fact I have
>> derived the correct identity for doing this (it took me ages as I did
>> it by hand ;-)). Warning: David said he once gave this a go and it
>> wasn't faster. However, he still left the door open for it to be
>> successful because without proper assembly code for some of the
>> components, it simply *can't* compete with the Toom3 multiplication in
>> MPIR. But done right, it could in theory speed up division in MPIR by
>> a *factor of 2*!!! Again, you might even get a small paper out of this
>> work!
>>
>> * For asymptotically fast division, we use a precomputed inverse, i.e.
>> to get f/g we first compute 1/g using Newton iteration, then we
>> essentially multiply by f (that's a slight simplification, but not
>> much). Now this can be made *much* faster too. Basically, Newton
>> iteration can be done in something like 5/3*M(n) where M(n) is the
>> time for multiplication (it's unknown whether 4/3 is possible for
>> integer inversion). With FFT precaching you can combine this with the
>> multiplication and get the whole division down to probably something
>> like 2M(n) (might be slightly higher or lower - didn't look it up for
>> a while). We are probably currently around 3M(n) or something of that
>> order. Obviously this depends on the new FFT being put in, but in the
>> end, we can speed up MPIR's asymptotically fast division (the concept
>> for which I implemented - making reuse of some of the divide and
>> conquer code from GMP for "boilerplate" stuff). In fact, it can be
>> sped up a *lot*. Really a lot!
>>
>> GCD/XGCD
>>
>> * The current GCD/XGCD code in MPIR could do with a *lot* of cleaning
>> up and general optimisation. GMP's is currently faster. It's all based
>> on Niel's Moller's implementation of his ngcd algorithm, which is the
>> best known. But I did it messily in MPIR.
>>
>> * I also have an idea for a new GCD algorithm for very small integers
>> (a few limbs and less). I've implemented it for single limbs and it is
>> way faster than what GMP uses. But it can be extended. I'd be happy to
>> explain this to anyone who'd like to work on it. Again there *might*
>> be some kind of paper in it. Maybe. Not sure though. (Just cause it's
>> faster doesn't always get you a paper.)
>>
>> Jacobi Symbol
>>
>> * Richard Brent and Paul Zimmermann recently developed a new algorithm
>> for computing Jacobi symbol asymptotically fast. They've implemented
>> it and I think a future GMP will have an implementation. It's
>> subquadratic with the same complexity as Schoenhage's continued
>> fraction algorithm. It is not too hard to find a copy of the slides
>> Paul and Richard will use to announce it at an upcoming meeting.
>> There's also a paper floating around somewhere.... But this algorithm
>> is based on Stehle and Zimmermann's half gcd algorithm. It is not at
>> all clear to me why one couldn't derive a similar algorithm based on
>> Niel's Moller's ngcd algorithm (if Paul is around, perhaps he can
>> comment) at least his slides didn't rule this out. *If* such a thing
>> would work, I believe it should be faster. Again, you *might* get a
>> paper out of this. Actually, I don't see why not, so long as a
>> comparable implementation beats Paul and Richard's implementation of
>> their algorithm. As GMP will have the Brent-Zimmermann algorithm, it
>> seems logical that MPIR could have the other.
>>
>> By the way, I want to draw attention to a new algorithm of Niels
>> Moller and Torbjorn Granlund for "Improved division by invariant
>> integers". MPIR already uses this algorithm as it was implemented in
>> GMP and we now use that code for basecase division in MPIR. It's an
>> interesting improvement and amazingly, if I understand correctly, they
>> actually beat the chip for 2 by 1 limb division!! (flint2 uses this
>> new algorithm extensively in the new nmod_poly module).
>>
>> Powm
>>
>> * MPIR is woefully slow at this. I vaguely recall some new REDD
>> algorithm mentioned on the GMP website for this. I have no idea what
>> this is about, or if it is even implemented, but a *pure guess* is it
>> might be some generalised REDC algorithm, perhaps based on a similar
>> trick to the divide by 3 algorithm that David Harvey and I
>> (re)invented some years ago when working on FLINT. I reckon the
>> likelihood is high (based on the name) that this is what it is, in
>> which case, let's put it in MPIR. If not, well then there's something
>> to study....
>>
>> BSDNT
>>
>> * Many, many people have supported the idea of writing a BSD licensed
>> Bignum library. Most recently Ondre Certik wrote to me *strongly*
>> supporting the idea. This is complex at best, because it must be so
>> different from GMP as to cause *absolutely no question whatsoever*
>> that it is independent, and thus must be implemented completely from
>> scratch. One idea I had was to write all the low level routines in
>> LLVM instead of assembly, then just build suitable optimisations into
>> LLVM so that nice assembly is automatically produced. This is entirely
>> feasible and would mean only one lot of low level routines would ever
>> need to be written, rather than a different set for each platform
>> (LLVM supports *lots* of platforms, even the playstation SPU!!).
>> However, LLVM doesn't support carry propagation well (as it is written
>> with the C front end in mind and C doesn't need to support carry
>> propagation). So I'm still umming and aahhing about whether this might
>> be more work than just writing lots of assembly code from scratch. At
>> this point I don't know. And yes it matters, it makes at least a
>> factor of 2 difference in times, which we can't just "live with" for a
>> bignum library.
>>
>> * Antony Vennard was working hard on a website for a bsdnt project.
>> Bits and pieces of it function already (it's high tech). If anyone
>> knows Django and would love to help Antony, let us know - lots is done
>> but more needs to be done. I don't see a project as going anywhere
>> without a website....
>>
>> * I wrote some code for multiplication which is relatively quick up to
>> a few hundred limbs, I also wrote some basecase division, but want to
>> change how it works because I have a better way of doing it now. All
>> this code is in very few lines, but still relies on assembly code in
>> MPIR to be efficient. We obviously can't have a BSD licensed library
>> relying on an LGPL library. Doesn't make sense...
>>
>> * I just want to emphasise, bsdnt has *nothing* to do with MPIR. It's
>> a totally separate project and it came about because lots of people
>> kept telling us they really wanted a BSD licensed bignum library. Its
>> main goal is clean, simple code which anyone can read. Each function
>> will be only a few lines. Performance will come second, (although
>> there is a goal to at least be reasonable in comparison with MPIR, say
>> within a factor of 2 in the first year and within 20% in the second -
>> clock didn't start yet ;-)). Performance is *very* difficult to
>> achieve if you specifically do *not* want to replicate any portion of
>> GMP/MPIR, so there's no intention to ever beat MPIR. Actually, I don't
>> think that is even possible. They represent the state-of-the art by
>> some orders of magnitude.
>>
>> Don't forget, I'll be around at Sage Days Leiden for anyone who is
>> going to be there. But also feel free to contact me on list here about
>> anything I've discussed.
>>
>> Two more posts to go....
>>
>> Bill.
>>
>
> --
> You received this message because you are subscribed to the Google Groups 
> "mpir-devel" group.
> To post to this group, send email to mpir-de...@googlegroups.com.
> To unsubscribe from this group, send email to 
> mpir-devel+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/mpir-devel?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-de...@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Reply via email to