Here's some things to consider:

* I timed MPIR and GMP 4.3.0 at 700 limbs, just below our Toom 4/7
cutoff. There is a 6 % difference. Adjusting for the speed of the
assembly code which will be in our next release, there is zero
difference.

* Our Toom 3 is optimised in a completely different way to GMP 4.3.0,
perhaps pushing our cutoffs higher. Whether Toom 4 can be optimised as
effectively as Toom 3 remains to be seen.

* At 700 limbs Toom 4 gives GMP 4.3.0 3% over its Toom 3. The
optimisations we make to our Toom 3 give us about 2% over our original
Toom 3. I didn't look to see if any improvements have been made to the
GMP Toom 3 since the last version, or whether they make any measurable
difference. I recall reading of a saving of half an operation on
Bodrato's website. I'm pretty sure our original Toom 3 doesn't have
this.

There is no question our Toom 4 can be optimised beyond its current
level but it is unclear if this will make more than 1% difference. In
the case of Toom 7, things seem hopeless. Optimising any single
operation, except one of the mp multiplies makes no measurable
difference. I've tried optimising up to three scalar operations at a
time and still not been able to measure the effect.

It is true that switching to a "fixed" precision solution like GMP
instead of the current sign/size solution, would make it easier to
optimise. But this affects other things negatively.

At present there are factors of 2 or 4 to be had elsewhere in the
library or even optimisations which will give a uniform speedup for
almost everything of a few percent. Those seem far more important at
present.

Bill.

2009/4/19 Bill Hart <goodwillh...@googlemail.com>:
> Both of the algorithms are similarly optimised. Actually Toom 4 is
> slightly more optimised than Toom 7.
>
> You may be right, and no "premature decision" has been made. However
> it occurs to me that Toom 7 breaks into smaller chunks, so the
> multiplications are cheaper.
>
> The crossover is currently 700. Toom 7 breaks into chunks of 100 and
> 13 multiplies are done. Toom 4 breaks into chunks of 175 and 5
> multiplies are done. It seems to me to be quite possible for Toom 7 to
> always beat Toom 4, or nearly so purely because of this sort of thing.
> I'm not saying this is the case. I'm just saying it may be the case.
>
> We are working on lots of assembly code to optimise Toom 4 and Toom 7.
> As for managing the sign/size stuff, I haven't decided whether I want
> to do it this way in the long run or not. It has some advantages which
> the other method does not. One thing I can say for sure is that the
> difference in speed in the generic case is negligible. The other
> optimisations we are going to be doing will make a much greater
> difference.
>
> Bill.
>
> 2009/4/19 David Harvey <dmhar...@cims.nyu.edu>:
>>
>>
>>
>> On Apr 15, 7:11 pm, Bill Hart <goodwillh...@googlemail.com> wrote:
>>
>>> Clicking on the graph does make it bigger, but yeah, the critical
>>> region we are talking about is a bit compressed.
>>
>> I just took a look at the toom4 code. It looks like there is a lot of
>> unnecessary sign/size management and data copying. I would suggest
>> some serious optimisation work before making a premature decision to
>> dump toom4 in favour of toom7. I can't think of any earthly reason why
>> a toom7 would beat toom4/5/6 near the toom3-toom4 threshold. It's like
>> claiming that a strassen matrix multiply written in C is fast for a
>> 10x10 matrix because you're comparing against an n^3 algorithm written
>> in python. Maybe the C-python analogy is unfair, but you get my drift.
>>
>> david
>>
>> >>
>>
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to