Back from Christmas break to a lot of responses - thanks to all. Summarizing 
the responses thus far:

* Select and/or tune your algorithm first.

Check.

* Tuning for a specific machine is a bad idea because you'll have to retune for 
every new machine.

Check. This code is sufficiently critical that it's worth doing 
platform-specific tuning. Otherwise I wouldn't be coding in assembler anyway. 
And it's small enough to be manageable.

* The compilers are very good and can probably do better than you can.

Usually, I agree. On this small kernel, though, I'm already beating xlc and (by 
a lesser margin) gcc. Looking at the code they generate has been enlightening; 
gcc is especially aggressive on loop unrolling, and that's on my list of  
things to try.

* Modern machines are too complex to enable detailed timing formulae to be 
published.

I'm not really after detailed timing. I'm looking for implementation details of 
the same sort used by compiler writers to guide selection of instruction 
sequences, where the guiding principle is, "How can I avoid pipeline stalls?" 
As I noted, several SHARE presentations contain SOME of this information, which 
I've already benefited from, but I'm looking for more.

* Just write some timing loops and figure it out yourself.

I've been doing that, using mostly the algorithm itself plus small timing 
experiments (which can be deceiving since they aren't in the context of the 
rest of the algorithm). I've already managed to squeeze out about a 15% 
improvement over my initial code.

An example: which of these code sequences do you suppose runs faster?

         la rX,0(rI,rBase)       rX -> array[i]
         lg rY,0(,rX)            rY = array[i]
         agsi 0(rX),1            ++array[i]
* Now do something with rY

vs:
         lg rY,0(rI,rBase)       rY = array[i]
         la rX,1(,rY)            rX = rY + 1
         stg rX,0(rI,rBase)      effect is ++array[i]
* Now do something with rY

The first is substantially faster. I would have GUESSED that the second would 
be faster, since I need the value in rY anyway. (I'm in 64-bit mode, so using 
"LOAD ADDRESS for the increment is safe...)

* Suggestions regarding branch prediction.

Spot on, and I already did that, too. Luckily branch prediction is one of the 
few performance characteristics that's actually semi-architectural (see the 
description of "BRANCH PREDICTION PRELOAD" in the Principles of Operation). 
Consider these two code sequences:

loop     ds 0h
*        Compute a value in r1
         cgrjne r2,r1,loop   continue if not done
vs:

loop     ds 0h
*        Compute a value in r1
         cgrje r2,r1,loopexit    exit if done
         j loop
loopexit ds 0h

The second sequence is MUCH faster - though I haven't yet tried using "BRANCH 
PREDICTION PRELOAD" to alter the default prediction.

* Cache issues

This kernel unfortunately has little (data) cache locality; it's inherent in 
the problem. What I HAVE found is that using "PREFETCH DATA" and specifying 
store intent before the first access DOES help a bit. I haven't fiddled with 
"NEXT INSTRUCTION ACCESS INTENT" yet; that seems like a potentially enormous 
rathole. Using MVCL isn't an option; my data elements are small, aligned 
multiples of doublewords, and I determined very early on that a load/store loop 
way outperforms both MVCL and MVC.

On the I-cache side: the kernel is very small (a few dozen instructions) and 
chugs along for a good long time once it's called, so it's undoubtedly in 
cache. Following David Bond's suggestion to align the tops of loops on 16-byte 
boundaries helped a bit.

* The usual IBM-MAIN drift, reminiscences of old machines/code/operating 
systems/management practices/etc.

Check. :-)

I'm still hoping that someone will reveal the existence of a document intended 
for compiler writers...

-- Jerry

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to