Guillermo Ballester Valor <[EMAIL PROTECTED]> writes:

> The increase in performance is moderate for most of platforms (0% to %5). For
> Itanium (the STAR of the release) this version is almost twice faster than 
> 2.8b.

Nice work on the IA64 optimizations, Guillermo! Actually, the Alpha
is another platform where preload rather than true prefetch seems
the best way to go, although the performance boost you're getting
on the Itanium is much larger than one sees on the Alpha.

> Here is the timings for an Itanium @ 800 Mhz (Compaq Blazer Itanium):

{snip}
> 1024 K  0.134/0.130
{snip}
> 4096 K  0.588/0.574

The timings you sent me some time ago for Glucas 2.7b on a 667MHz Alpha 21264
with a similar 4MB L2 cache size at these runlengths are as follows:

1024 K  0.180/0.171

4096 K  0.831/0.787

which indicates about 10-15% better per-cycle performance for the IA64
relative to the 21264. This is good, but (being greedy :) I think the
IA64 may be able to achieve even better performance with further tuning
(of both source code and compiler), because the IA64 has such great
FPU capabilities. If the 21264 could do just 2 adds per cycle (to say
nothing of multiplies) I estimate the performance on the instruction
mix typical of this kind of large-FFT code would increase by 20-30%.

> IA64 architecture has a very nice feature: predication. In the DWT used in 
> most GIMPS clients, the normalization and carry phase has a relevant cost in 
> terms of performance. There some branches hard to predict and here the 
> predication substitutes this branches with great success. 

Of course it is possible (and in many cases desirable) to do the normalize
and carry sans branches. Example: a typical construct in this part of the
algorithm is an integer-arithmetic sequence like

x = a + b
if(x > c) x = x - c

On machines with a conditional move instruction one can use that
(i.e. calculate both a + b and a + b - c and pick one, based on the
 result of the conditional), but more portable and often faster is
to use the properties of twos-complement arithmetic (here I assume
signed 32-bit ints) like so:

x = a + b
x = x - (-(int)((unsigned)x >> 31)) & c

where the cast of x to unsigned prior to shifting is to ensure one
gets a binary (not an arithmetic) right-shift. As you know, one can
do similar tricks to choose which of the two possible DWT weights
multipliers to multiply by, and which of the two inverse bases of
the Crandall-Fagin variable-base representation to divide by, thus
eliminating all branches from this part of the code. Did you ever
try such a branchless version on the IA64?

> We still have no made a good timing page, we will send it to E.Mayer and to 
> sourceforge when possible.

Yes, I've been tardy in updating the timings page - been too busy with
work and Mlucas 2.7c to spend as much time as I should on it. I should
have it somewhat up-to-date at the same time I release the new version
of Mlucas, perhaps in a couple of weeks. In any event, it's not like
Itanium users have a plethora of codes amongst which they must decide. :)

Cheers,
-Ernst


_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to