Re: Is there a source for detailed, instruction-level performance info?

Anne & Lynn Wheeler Thu, 24 Dec 2015 10:48:06 -0800

charl...@mcn.org (Charles Mills) writes:
> Not so simple anymore.
>
> "How long does a store halfword take?" used to be a question that had an
> answer. It no longer does.
>
> My working rule of thumb (admittedly grossly oversimplified) is
> "instructions take no time, storage references take forever." I have heard
> it said that storage is the new DASD. This is true so much that the z13
> processors implement a kind of "internal multiprogramming" so that one CPU
> internal thread can do something useful while another thread is waiting for
> a storage reference.
>
> Here is an example of how complex it is. I am responsible for an "event" or
> transaction driven program. I of course have test programs that will run
> events through the subject software. How many microseconds does each event
> consume? One surprising factor is how fast do you push the events through.
> If I max out the speed of event generation (as opposed to say, one event
> tenth of a second) then on a real-world shared Z the microseconds of CPU per
> event falls in HALF! Same exact sequence of instructions -- half the CPU
> time! Why? My presumption is that because if the program is running flat out
> it "owns" the caches and there is much less processor "wait" (for
> instruction and data fetch, not ECB type wait) time.


so such accounting measuring CPU time (elapsed instruction time) is
analogous to early accounting which measured by elapsed wall clock time.

cache miss/memory access latency ... when measured in count of processor
cycles is comparable to 60s disk access when measured in in count of 60s
processor cycles.

There is lot of analogy between page thrashing when overcommitting real
memory and cache misses. This is old account of motivation behind moving
370 to all virtual memory. The issue was that as processors got faster,
they spent more and more time waiting for disk. To keep the processors
busy, required increasing levels of multiprogramming to overlap
execution with waiting on disk. At the time, MVT storage allocation was
so bad that a region sizes needed to be four times larger than actually
used. As a result, a typical 1mbyte 370/165 would only have four
regions. Going to virtual memory, it would be possible to run 16 regions
in a typical 1mbyte 370/165 with little or no paging ... significantly
increasing aggregate throughput.
http://www.garlic.com/~lynn/2011d.html#73 Multiple Virtual Memory

risc has been doing cache miss compensation for decades, out-of-order
execution, branch prediction, speculative execution, hyperthreading ...
can be viewed as hardware analogy to 60s multitasking ... given the
processor something else to do while waiting for cache miss. Decade or
more ago, some of the other non-risc chips started moving to hardware
layer that translated instructions into risc micro-ops for scheduling
and execution ... largely mitigating performance difference between
those CISC architectures and RISC.

IBM documentation claimed that half the per processor improvement from
z10->z196 was the introduction of many of the features that have been
common in risc implementation for decades ... with further refinement in
ec12 and z13.

z10, 64processors, aggregate 30BIPS or 496MIPS/proc
z196, 80processors, aggregate 50BIPS or 625MIPS/proc
EC12, 101 processor, aggregate 75BIPS or 743MIPS/proc

however, z13 claims 30% more throughput than EC12 with 40% more
processors ... which would make it 700MIPS/processor

by comparison z10 era E5-2600v1 blade was about 500 BIPS, 16 processors
or 31BIPS/proce. E5-2600v4 blade is pushing 2000BIPS, 36 processors or
50BIPS/proc.

as an aside, 370/195 pipeline was doing out-of-order execution ...  but
didn't do branch proediction or speculative execution ... and
conditional branch would drain the pipeline. careful coding could keep
the execution units busy getting 10MIPS ... but normal codes typically
ran around 5MIPS (because of conditional branches). I got sucked into
helping with hyperthreading 370/195 (which never shipped), it would
simulate two processors with two instructions streams, sets of
registers, etc ... assuming two instruction streams, each running at
5MIPS would then keep all execution units running at 10MIPS.

from account of shutdown of ACS-360
http://people.cs.clemson.edu/~mark/acs_end.html

Sidebar: Multithreading

In summer 1968, Ed Sussenguth investigated making the ACS-360 into a
multithreaded design by adding a second instruction counter and a second
set of registers to the simulator. Instructions were tagged with an
additional "red/blue" bit to designate the instruction stream and
register set; and, as was expected, the utilization of the functional
units increased since more independent instructions were available.

IBM patents and disclosures on multithreading include:

US Patent 3,728,692, J.W. Fennel, Jr., Instruction selection in a
two-program counter instruction unit, filed August 1971, and issued
April 1973.

US Patent 3,771,138, J.O. Celtruda, et al., Apparatus and method for
serializing instructions from two independent instruction streams, filed
August 1971, and issued November 1973. Note: John Earle is one of the
inventors listed on the '138.  "Multiple instruction stream
uniprocessor," IBM Technical Disclosure Bulletin, January 1976,
2pp. [for S/370]

... snip ...

Note the next sidebar is ES/9000 ... containing many features from
ACS-360 more than 20yrs later (Amdahl's account is that executives ended
ACS-360 because it would advance state-of-the-art too fast and they
would loose control of the market).

other trivia ... starting in the middle to late 70s, I started
pontificting that relative system performance of disks were declining
and by the early 80s, disk relative system performance had declined by a
factor of 10 times (order of magnitude) over a period of 15 years (disks
and gotten 3-5 times faster, but processors had gotten 50 times
faster). Disk division executives took exception to my statements and
assigned the division performance group to refute what I was
saying. After several weeks they came back and effectively said that I
had understated the problem. Their analysis was then respun into SHARE
presentation (B874) on optimizing disk configurations for system throughput
... old reference
http://www.garlic.com/~lynn/2006f.html#3
and
http://www.garlic.com/~lynn/2006o.html#68

piece of my 15 year comparison ... 360/67 cp67 to 3081 vm/370
http://www.garlic.com/~lynn/93.html#31

-- 
virtualization experience starting Jan1968, online at home since Mar1970

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Is there a source for detailed, instruction-level performance info?

Reply via email to