> On Jun 11, 2024, at 11:52 PM, ben via cctalk <cctalk@classiccmp.org> wrote:
> 
> On 2024-06-10 10:18 a.m., Joshua Rice via cctalk wrote:
>> On 10/06/2024 05:54, dwight via cctalk wrote:
>>> No one is mentioning multiple processors on a single die and cache that is 
>>> bigger than most systems of that times complete RAM.
>>> Clock speed was dealt with clever register reassignment, pipelining and 
>>> prediction.
>>> Dwight
>> Pipelining has always been a double edged sword. Splitting the instruction 
>> cycle into smaller, faster chunks that can run simultaneously is a great 
>> idea, but if the actual instruction execution speed gets longer, failed 
>> branch predictions and subsequent pipeline flushes can truly bog down the 
>> real-life IPS. This is ultimately what led the NetBurst architecture to be 
>> the dead-end it became.
> 
> The other gotya with pipelining, is you have to have equal size chunks.
> A 16 word register file seems to be right size for a 16 bit alu.
> 64 words for words for 32 bit alu. 256 words for 64 bit alu,
> as a guess.

Huh?  There is no direct connection between word length, register count, and 
pipeline length.  

The natural pipeline length (for a given functional unit) is the number of 
steps needed to do the work, given a step that can be completed in a single 
clock cycle.  That assumes a pipe that long is affordable; if not it gets 
shorter.  Not all functional units will have the same pipeline length.

The register count is a function of cost -- for the registers themselves and 
for the scoreboard logic to sort out register conflicts.  In modern designs 
that would be die area; in older machines it would be cost in modules or 
transistors.  For example, in the CDC 6600, the registers (8 x 60 bits, 8 x 18 
bit address, 8 * 18 bit index/count) and their associated data path controls 
to/from all the functional units take up an entire chassis, 750-ish logic 
modules.
  
> You never see a gate level delays on a spec sheet.
> Our pipeline is X delays + N delays for a latch.

Gate level delays are not interesting for the machine user to know.  What is 
interesting is the detailed properties of the pipelines, including whether they 
can accept a new operation every cycle or just every N cycles (say, a 
multiplier that accepts operands every 2 cycles); how many cycles is the delay 
from input to output; and whether there are "bypass" data paths to reduce the 
delays from input or output conflicts.  Often these details are hard to pry out 
of the manufacturer; often they are not documented in the standard data sheets 
or processor user manuals.  But they are critical if you want to do work such 
as pipeline models to drive compiler optimizers.

        paul

Reply via email to