[cctalk] Re: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

Chuck Guzis via cctalk Sat, 15 Jun 2024 14:22:31 -0700

On 6/15/24 12:39, Paul Koning wrote:
> 
> 
> I learned it from OS code reading and adopted some of it for my own work, but 
> not much because I actually only worked on the 6500 -- which doesn't have 
> multiple functional units.
> 
> Writing good code for those machines was further complicated by the fact that 
> instructions were either 1/4 or 1/2 word long, could not split across word 
> boundaries, and branches would only go to the start of the word.  So there 
> tended to be NOPs to pad out the word, which the assembler would supply.  
> Avoiding them would make the code go faster and of course make it smaller.
> 
> The other complication was a fairly limited set of registers, and the fact 
> that loads would go only to X1..X5 while stores could only come from X6 or 
> X7.  So a memcpy would involve a register to register transfer.  That takes 3 
> cycles on a 6600, so a skillful memcpy implementation would use two load 
> registers, both store registers, and two separate functional units for the 
> R-R move (one via the "boolean" unit and one via the "shift" unit).  I 
> remember my bafflement the first time I saw a shift (by zero) used to do just 
> a register to register move; on a 6500 you wouldn't have any reason to write 
> that.
> 
> I once crashed the PLATO system in mid-day, when the load hit peak (600 users 
> logged on) because I had slowed down a critical terminal output processing 
> step and the machinery didn't have flow control there.  My bosses were NOT 
> happy.  I solved the issue by cleaning up that block of code to avoid all 
> NOPs; the result was that it was both shorter and faster than the previous 
> version while still delivering the new feature.  :-)


At CDC SSD SVLOPS, it was all big gummint stuff stuff, so we had
clusters of Cyber 74s and 73s (6600/6400) linked with a few million
words of ECS (we had a QSE that expanded it to 4M words).  6600/Cyber 74
programming was the rule.  A short loop was considered to be optimal, if
it kept the instruction issue to 1/cycle and kept the whole thing "in
stack" (basically an 8-word buffer, not really a cache) to avoid
accessing CM for instructions.   Lots of bit-twiddling fun!

The 6600 had an interesting feature we called "shortstop" where the
result of an operation was available for use by a subsequent instruction
1 cycle before it materialized in a register

On early 6600s, there was a so-called "store out of order" problem where
two closely-timed stores to the same location would result in the
earlier result overwriting the later ones.  An ECO fixed that--it was
pretty fundamental.

STAR initially mapped the user's low-memory to the 256-word register
file, such that one could have vectors occupying several registers
addressed by memory location, while referring to the registers by
register number.  That apparently resulted in some serious issues,
solved eventually by simply locking out access to the first 16Kbits
(recall that the STAR is bit-addressed) of memory.  The so-called "Rev
R" ECO, if my mind isn't playing tricks on me.

CDC had a pretty close relationship with Fairchild during this time;
initially for the silicon transistors in the 6600 and later the register
file for the STAR.

Fun times!
--Chuck

[cctalk] Re: Delay slots, was: Re: Re: early microprocessor limited pipelining [was: Intel 8086 - 46 yrs. ago]

Reply via email to