What if you measured the total cpu time consumed by code such as the following to execute a truly huge number of only XR instructions and then divide by the number of XR instructions executed? I would think that this would be the smallest possible time for one XR; i.e., the maximum possible pipelining with zero stalls. LAY R0,1000000 LA R1,LOOP1 * force alignment here to a 256-byte boundary; i.e., the length of a cache line LOOP1 XR R2,R2 first of 127 such XR instructions XR R2,R2 second of 127 such XR instructions ... XR R2,R2 127th and last of 127 such XR instructions BCTR R0,R1 execute the previous 127 XR instructions one million times * at this point, we have filled one cache line with 127 consecutive XR instructions followed by the BCTR, and all 128 of these instructions fit exactly within one cache line. ... end of loop. When finished performing the loop, we will have executed 127,000,000 XR instructions and 1,000,000 BCTR instructions. Ignore the time used by the BCTR instructions. Divide total CPU time delta by 127,000,000 to compute the approximate minimum time possible to do one XR instruction. Then do the same thing for an SR, an SLR, and a LR that is loading a register from another register that has been previously zeroed. This technique could also be done with 63 consecutive LA Rx,0 instructions. Bill Fairchild Nolensville, TN
----- Original Message ----- From: "Christopher Y. Blaicher" <[email protected]> To: [email protected] Sent: Tuesday, June 3, 2014 9:50:34 AM Subject: Re: Out of Order and Superscalar - small experiment IBM stopped publishing instruction timings quite a while ago. With the advent of multiple stage processors and the inter-dependencies of the instructions, timings for single instructions became meaningless. Even for an XR or SLR a lot depends on what and when the register was used or will be used can affect what you are doing. Put another way, individual instructions don't matter as much as they used to. The sequence of instructions is much more important.
