mw...@ssfcu.org (Ward, Mike S) writes:
> This is one area where I really have a problem. It used to be back in
> the 370 days that if a machine was rated at 50 mips and you moved up
> to 100 mips you really noticed the difference in execution time. Today
> if you have a 100 mip machine (I know they're rated at msu's not mips)
> and you moved up to a dual with 160 mips you might be cutting your own
> throat. They may give you 2 processors each rated at 80 mips for a
> total of 160 mips. If your workload is such that it can't take
> advantage of dual processors then you have just dropped down to an 80
> mip machine when you used to have a 100 mip machine. I know I'm on a
> rant, but it happened to up and we were being pressured by the vendor
> to go to the dual processor and that we would be very happy. We
> weren't. (end of rant)

370s & for a few generations ... going from uniprocessor to
dual-processor started off by slowing machine cycle of each processor
down by 10% ... bascially allowing caches a little headroom to handle
cross-cache invalidations from the other cache (store through processor
caches, every store operation would also involve sending invalidation
signal to the other cache for that cache line). So basic two-processor
hardware ran at 1.8 times a single processor. Then operating system
multiprocessor overhead would increase (back when single processor MVS
"capture ratio" could be 50%) ... leaving even less cycles for
application execution ... aka same exact 10mip uniprocessor would only
start out only being 9mip processor in two processor mode.  Note that
actual handling of cross-cache invalidation was over&above the 10%
processor cycle slowdown (in real live operation, 10mip process running
at 9mips ... would actually effectively have less than 9mips, further
reduced by multiprocessor operating system overhead & cache overhead of
handling cross-cache invalidation signals).

strategy with 3081 was to never again to offer single process at the
high-end. this ran into a couple problems ... clone processor vendors
were offering uniprocessor and ACP/TPF didn't have multiprocessor
support. All sorts of unnatural acts were done to try an make a 3081
acceptable to ACP/TPF (and head off customer base all moving to clone
processors). this is besides the issues outline here about comparison
between 3081 and clone processors:
http://www.jfsowa.com/computer/memo125.htm

eventually there was 3083 (in large part for the ACP/TPF market) which
was created by removing a processor from a 3081 (which is not as simple
as you might think, processor 0 was at the top of the frame, so
processor 1 in the middle of the frame would be the one removed ... but
that made the frame dangerously top-heavy). Being only single processor,
turning off the cross-cache 10% slowdown made the processor nearly 15%
faster (than a processor in 3081).

combining two 3081s together for a four-process 3084 was big challenge
... singe it met that each processor cache would be getting cross-cache
invalidation signals from three other caches (not just one). kernel
storage use became significant ... so operating systems running on 3084
were cache-line sensitised ... all kernel storage was changed to align
on cache-line boundaries and be multiples of cache-lines.  The problem
was that if the start of end of one storage location was at the start of
a cache-line and the start of a different storage location was at the
end of the same cache-line ... the two different storage locations could
be in use by different processors simultaneously. However, it represents
only a single storage block for cache management ... and could result in
cache "thrashing". The storage cache sensitivity change is claimed to
improve 3084 throughput by 5-6% (minimizing cache line thrashing).

However, higher-end 370s processor throughput was quite sensitive to
cache hit ratios ... which would be seriously affected by high-rate of
asynchronous i/o interrupts. For my "resource manager" ... I did some
hacks (at high i/o rates) turning off enabling for I/O interrupts for
periods of time and then draining all pending I/O interrupts. I could
demonstrate aggregate higher throughput (even I/O throughput) ... since
the batching of I/O interrupts would have much higher processor
throughput (because of better cache hit ratio) ... offsetting any delay
in taking the interrupt (note part of 370/xa was attempting to address
same issue with various kinds of i/o queuing in the hardware).

When I first did two-processor 370 support ... I was able to deploy in
production environemnt ... two processors running more than twice MIP
rate as single processor ... including processor cycle only running at
.9 that of single processor. Some games with cache affinity allowed
improved cache hit ratio ... which more than offset the 10% slowdown in
processor cycle.

-- 
virtualization experience starting Jan1968, online at home since Mar1970

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to