Hi Gary,
First, I apologize for the length of this posting. You had a lengthy
posting at http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R39973&I=1
and my response must be similarly lengthy. I was going to respond directly
to you, off line. After writing my response, I decided that some of the
comments might be interesting to subscribers who run or are considering IRD.
Gary, I never received your posting that was a response to my posting to
you on 6/9/05 (Subj: Re: IRD & Short Engine Effect at 100% CEC
utilization)! I had figured that you simply ignored my response and I was
a bit miffed. I have no idea why I did not receive the above posting,
since I run my own email/web/ftp server here in my office. Makes me wonder
what other email I don't receive...
To my chagrin, I now see that you not only did NOT ignore my posting, but
you replied with a very thoughtful and detailed posting. THANKS for the
detail. That doesn't normally happen when someone is discussing a problem,
and it is very refreshing.
For the past year, I've been working on PR/SM analysis, and for the past
several months I've been working exclusively on IRD analysis. Just this
week, I've finished the basic "this is what you should do and what you
should not do with IRD" per the Redbook. This thread is extraordinarily
timely from my R&D perspective, since I'm now moving into the "how does IRD
really work in user sites" phase.
That aside, I have a few comments on your posting at
http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R39973&I=1
<snip>
First, recommendations said to put a Minimum Weight of 1, and leave Maximum
Weight blank for Weight Management - which upon roll out we had not done :
we'd been conservative and kept the minimum weight equal to at least two
engines of capacity because we'd heard of people that had complained of
systems getting varied down to just one LCPU and it killed CICS
multiprocessing. So we picked two CECs that that had the most PCPUs, and
set their min to 1, max to blank.
</snip>
Gary, I know that the PR/SM Planning Guides say that specifying Minimum = 1
and Maximum = 999 is the "optimal" specification. However, I believe that
the wording is unfortunate because it explicitly says that "1" is an
optimal specification. This specification is neither optimal nor
recommended. There is a POTENTIAL serious problem with specifying a
Minimum Weight of 1. Essentially, you are saying that WLM LPAR Weight
Management can adjust the weight down to "1" which would leave a very low
share (potentially causing sysplex disruption). With the default of "0"
for Minimum weight, WLM LPAR Weight Management will ensure that the LPAR
never goes below 5% share of a central processor. However, if you specify
a Minimum Weight, WLM will honor your specification (see Section 7.1 of the
Redbook, and I've confirmed this Redbook paragraph with WLM developers).
While I don't think that it is too likely that WLM would actually reduce
the weight to "1", such a reduction is possible. Why take a chance? Leave
the Minimum Weight blank and WLM will use default of zero, and will never
reduce a logical processor share to less than about 5% of a physical
processor. (BTW: The Redbook says 5% of a CPC in several locations. This
is a typo; it should be 5% of a physical processor.)
<snip>
1. When both LPARs are idle (50% MVS busy or less), both get all 11 LCPUs
online. The books states this is "so the workload can take advantage of
increased multiprocessing". This doesn't seem to be an issue. I don't know
if it really helps, but it doesn't seem to hurt.
</snip>
This is goodness - it means that you can have more concurrent processes
active. The only downside is potential queuing on the PR/SM Logical
Processor Ready Queue, but this should not cause performance problems at
low utilization.
<snip>
2. When one LPAR is trying to "take over" the CEC and the other LPAR is
idle, CPU Vary _ALWAYS_ put all 11 LCPUs online to the busy LPAR and cut
back the idle one to no less than 5 LCPUs. Why 5? I don't know, and this
was one of the questions I asked to the list -- EXACTLY how does IRD
determine how many LCPUs to leave online (i.e. give me the calculation,
please)?
</snip>
See the WLM CPU Management algorithm described in Section 3.9.1 of the
Redbook (and particularly Section 3.9.2 - WLM Vary CPU Management
logic). Focus on the concept that the weight of an LPAR implies share, and
appreciate that IRD will not go below "guaranteed" share when it calculates
the Equivalent Physical Central Processors. Note that unless WLM LPAR
Weight Management had lowered the weight of the "idle" LPAR because of
goals being missed in the "busy" LPAR, the "idle" LPAR would be "entitled"
to its share as a function of the weight it had before the other LPAR tried
to "take over" the CEC. That share might have implied 4 equivalent
physical processors, plus the buffer of 1, for a total of 5 physical
processors that should remain online. Without seeing actual data, I cannot
be certain.
I wrote some code to replicate the IRD algorithms (partially so I can
recommend to folks with just plain old PR/SM that they might consider
altering the number of logical processors, and partially because I find
that writing such code helps me better understand what is going on). I
compared my algorithm results with data from some sites running IRD.
Sometimes there was a match from my results and SMF70BDA, and sometimes my
results were radically different from SMF70BDA. Page 85 of the Redbook
explained the differences (but it drove me crazy for awhile).
According to IRD Redbook (page 85), IRD will bring (or leave) logical
processor on-line if overall CPU utilization is low. That explains why I
can see SMF70BDA showing that some LPARs have a lot of logical processors
on-line for many hours (15 logical processors, for example) when my code
showed that the number should be much lower (9 logical processors, for
example). This does not, of course explain your situation since you say
that the overall CEC was near 100%. My point is that there is lots of
subtle logic in IRD that might not be immediately apparent.
<snip>
3. When both LPARs are trying to take over the CEC, with low importance-5
work (a test we contrived), we were frustrated and disappointed by IRD's
behavior.
</snip>
I'm not surprised. Very likely, LPAR Weight management was never invoked.
<snip>
4. During the usual production load on the two LPARs, IRD seemed to prefer
to keep the LPAR versus MVS busy within 20% of each other, which is a good
thing.
</snip>
I don't think that the 20% was a design goal, but perhaps more an
unintentional effect in your specific tests. To the best of my knowledge,
WLM does not have access to PR/SM Logical Processor Ready Queue information
directly. For that matter, even if WLM should compute the PR/SM Logical
Processor Ready Queue from Intercept information (using MVS Wait and
SMF70PDT), there are circumstances when SMF70EDT can be higher than
SMF70PDT. For example, delayed synchronous requests which PR/SM does not
intercept (see APAR II10549, which was thought to be hardware-related, but
is common when coupling facilities are shared). I have IRD data from sites
in which the LPAR versus MVS busy is WAY more than 20%, so 20% certainly is
not a design objective!
<snip>
It was getting a good solid 70% of the CEC, with 9 LCPUs online. Granted,
it needed 7.7 PCPUs to get that much, but why 9 LCPUs when LPA1 was tanking
so bad?
</snip>
I think that the algorithms in Section 3.9.1 of the Redbook explain this.
<snip>
Doesn't IRD look at the queue length distribution, see that nearly all
the work is in bucket 14+ and go "uh oh, I better see if I can help that one"?
</snip>
No. See Section 3.7 of the Redbook for a good discussion of the WLM logic
involved.
<snip>
My take on IRD, at the moment, is that "it's great, but watch out when you
want to push the CEC to it's limits, IRD doesn't like to do that. You're
better off capping LPARs yourself by taking LCPUs offline that you never
want the LPAR to use, regardless of IRD'd pie in the sky multiprocessing
recommendations."
</snip>
Gary, I don't as yet have an opinion about the operational aspects of IRD
when you push the CEC to its limits. The data that I've seen has mostly
been from sites where there almost always is some white space, and it
appears that IRD is "wonderful" in such situations.
FYI: Just today, I received an email from a site which has been providing
me with IRD data for my R&D purposes. This site does have 100% utilization
for many hours, and I plan to analyze this data to appreciate what happens
when the CEC is 100% busy. The email said that management has decided to
move off IRD because of perceived performance issues. The site graciously
is sending me post-IRD data so I can analyze the differences from a basic
PR/SM and IRD view. :-)
Best regards,
Don
******
Don Deese, Computer Management Sciences, Inc.
Voice: (703) 922-7027 Fax: (703) 922-7305
http://www.cpexpert.org
******
At 11:18 AM 9/28/2005, you wrote:
Sam,
Here are the messages I posted with questions about what I was seeing
and why. The last one has the most detail.
http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R18496&I=1
http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R26748&I=1
http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R39973&I=1
We assumed that the problem is with our WLM Goals and IRD setup, and
we're still reviewing them to make sure we can rule that out.
Once we're pretty sure it's not our fault, we'll get an PMR opened.
Nothing worse than to waste their time with something we should have
figured out in the first place!
Best regards,
Gary
-----Original Message-----
From: IBM Mainframe Discussion List [mailto:[EMAIL PROTECTED]
Behalf Of Knutson, Sam
Sent: Wednesday, September 28, 2005 10:01 AM
To: [email protected]
Subject: Re: IRD (+)ve R (-)ve
Hi Gary,
Just curious.... Did you open a PMR on this with IBM? What was the
CEC
size and configuration like?
--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.11.8/113 - Release Date: 9/27/2005
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html