Guillermo Ballester Valor <[EMAIL PROTECTED]> writes: > The increase in performance is moderate for most of platforms (0% to %5). For > Itanium (the STAR of the release) this version is almost twice faster than > 2.8b.
Nice work on the IA64 optimizations, Guillermo! Actually, the Alpha is another platform where preload rather than true prefetch seems the best way to go, although the performance boost you're getting on the Itanium is much larger than one sees on the Alpha. > Here is the timings for an Itanium @ 800 Mhz (Compaq Blazer Itanium): {snip} > 1024 K 0.134/0.130 {snip} > 4096 K 0.588/0.574 The timings you sent me some time ago for Glucas 2.7b on a 667MHz Alpha 21264 with a similar 4MB L2 cache size at these runlengths are as follows: 1024 K 0.180/0.171 4096 K 0.831/0.787 which indicates about 10-15% better per-cycle performance for the IA64 relative to the 21264. This is good, but (being greedy :) I think the IA64 may be able to achieve even better performance with further tuning (of both source code and compiler), because the IA64 has such great FPU capabilities. If the 21264 could do just 2 adds per cycle (to say nothing of multiplies) I estimate the performance on the instruction mix typical of this kind of large-FFT code would increase by 20-30%. > IA64 architecture has a very nice feature: predication. In the DWT used in > most GIMPS clients, the normalization and carry phase has a relevant cost in > terms of performance. There some branches hard to predict and here the > predication substitutes this branches with great success. Of course it is possible (and in many cases desirable) to do the normalize and carry sans branches. Example: a typical construct in this part of the algorithm is an integer-arithmetic sequence like x = a + b if(x > c) x = x - c On machines with a conditional move instruction one can use that (i.e. calculate both a + b and a + b - c and pick one, based on the result of the conditional), but more portable and often faster is to use the properties of twos-complement arithmetic (here I assume signed 32-bit ints) like so: x = a + b x = x - (-(int)((unsigned)x >> 31)) & c where the cast of x to unsigned prior to shifting is to ensure one gets a binary (not an arithmetic) right-shift. As you know, one can do similar tricks to choose which of the two possible DWT weights multipliers to multiply by, and which of the two inverse bases of the Crandall-Fagin variable-base representation to divide by, thus eliminating all branches from this part of the code. Did you ever try such a branchless version on the IA64? > We still have no made a good timing page, we will send it to E.Mayer and to > sourceforge when possible. Yes, I've been tardy in updating the timings page - been too busy with work and Mlucas 2.7c to spend as much time as I should on it. I should have it somewhat up-to-date at the same time I release the new version of Mlucas, perhaps in a couple of weeks. In any event, it's not like Itanium users have a plethora of codes amongst which they must decide. :) Cheers, -Ernst _________________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers