charl...@mcn.org (Charles Mills) writes: > Not so simple anymore. > > "How long does a store halfword take?" used to be a question that had an > answer. It no longer does. > > My working rule of thumb (admittedly grossly oversimplified) is > "instructions take no time, storage references take forever." I have heard > it said that storage is the new DASD. This is true so much that the z13 > processors implement a kind of "internal multiprogramming" so that one CPU > internal thread can do something useful while another thread is waiting for > a storage reference. > > Here is an example of how complex it is. I am responsible for an "event" or > transaction driven program. I of course have test programs that will run > events through the subject software. How many microseconds does each event > consume? One surprising factor is how fast do you push the events through. > If I max out the speed of event generation (as opposed to say, one event > tenth of a second) then on a real-world shared Z the microseconds of CPU per > event falls in HALF! Same exact sequence of instructions -- half the CPU > time! Why? My presumption is that because if the program is running flat out > it "owns" the caches and there is much less processor "wait" (for > instruction and data fetch, not ECB type wait) time.
so such accounting measuring CPU time (elapsed instruction time) is analogous to early accounting which measured by elapsed wall clock time. cache miss/memory access latency ... when measured in count of processor cycles is comparable to 60s disk access when measured in in count of 60s processor cycles. There is lot of analogy between page thrashing when overcommitting real memory and cache misses. This is old account of motivation behind moving 370 to all virtual memory. The issue was that as processors got faster, they spent more and more time waiting for disk. To keep the processors busy, required increasing levels of multiprogramming to overlap execution with waiting on disk. At the time, MVT storage allocation was so bad that a region sizes needed to be four times larger than actually used. As a result, a typical 1mbyte 370/165 would only have four regions. Going to virtual memory, it would be possible to run 16 regions in a typical 1mbyte 370/165 with little or no paging ... significantly increasing aggregate throughput. http://www.garlic.com/~lynn/2011d.html#73 Multiple Virtual Memory risc has been doing cache miss compensation for decades, out-of-order execution, branch prediction, speculative execution, hyperthreading ... can be viewed as hardware analogy to 60s multitasking ... given the processor something else to do while waiting for cache miss. Decade or more ago, some of the other non-risc chips started moving to hardware layer that translated instructions into risc micro-ops for scheduling and execution ... largely mitigating performance difference between those CISC architectures and RISC. IBM documentation claimed that half the per processor improvement from z10->z196 was the introduction of many of the features that have been common in risc implementation for decades ... with further refinement in ec12 and z13. z10, 64processors, aggregate 30BIPS or 496MIPS/proc z196, 80processors, aggregate 50BIPS or 625MIPS/proc EC12, 101 processor, aggregate 75BIPS or 743MIPS/proc however, z13 claims 30% more throughput than EC12 with 40% more processors ... which would make it 700MIPS/processor by comparison z10 era E5-2600v1 blade was about 500 BIPS, 16 processors or 31BIPS/proce. E5-2600v4 blade is pushing 2000BIPS, 36 processors or 50BIPS/proc. as an aside, 370/195 pipeline was doing out-of-order execution ... but didn't do branch proediction or speculative execution ... and conditional branch would drain the pipeline. careful coding could keep the execution units busy getting 10MIPS ... but normal codes typically ran around 5MIPS (because of conditional branches). I got sucked into helping with hyperthreading 370/195 (which never shipped), it would simulate two processors with two instructions streams, sets of registers, etc ... assuming two instruction streams, each running at 5MIPS would then keep all execution units running at 10MIPS. from account of shutdown of ACS-360 http://people.cs.clemson.edu/~mark/acs_end.html Sidebar: Multithreading In summer 1968, Ed Sussenguth investigated making the ACS-360 into a multithreaded design by adding a second instruction counter and a second set of registers to the simulator. Instructions were tagged with an additional "red/blue" bit to designate the instruction stream and register set; and, as was expected, the utilization of the functional units increased since more independent instructions were available. IBM patents and disclosures on multithreading include: US Patent 3,728,692, J.W. Fennel, Jr., Instruction selection in a two-program counter instruction unit, filed August 1971, and issued April 1973. US Patent 3,771,138, J.O. Celtruda, et al., Apparatus and method for serializing instructions from two independent instruction streams, filed August 1971, and issued November 1973. Note: John Earle is one of the inventors listed on the '138. "Multiple instruction stream uniprocessor," IBM Technical Disclosure Bulletin, January 1976, 2pp. [for S/370] ... snip ... Note the next sidebar is ES/9000 ... containing many features from ACS-360 more than 20yrs later (Amdahl's account is that executives ended ACS-360 because it would advance state-of-the-art too fast and they would loose control of the market). other trivia ... starting in the middle to late 70s, I started pontificting that relative system performance of disks were declining and by the early 80s, disk relative system performance had declined by a factor of 10 times (order of magnitude) over a period of 15 years (disks and gotten 3-5 times faster, but processors had gotten 50 times faster). Disk division executives took exception to my statements and assigned the division performance group to refute what I was saying. After several weeks they came back and effectively said that I had understated the problem. Their analysis was then respun into SHARE presentation (B874) on optimizing disk configurations for system throughput ... old reference http://www.garlic.com/~lynn/2006f.html#3 and http://www.garlic.com/~lynn/2006o.html#68 piece of my 15 year comparison ... 360/67 cp67 to 3081 vm/370 http://www.garlic.com/~lynn/93.html#31 -- virtualization experience starting Jan1968, online at home since Mar1970 ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN