Hi Gary,

First, I apologize for the length of this posting. You had a lengthy posting at http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R39973&I=1 and my response must be similarly lengthy. I was going to respond directly to you, off line. After writing my response, I decided that some of the comments might be interesting to subscribers who run or are considering IRD.

Gary, I never received your posting that was a response to my posting to you on 6/9/05 (Subj: Re: IRD & Short Engine Effect at 100% CEC utilization)! I had figured that you simply ignored my response and I was a bit miffed. I have no idea why I did not receive the above posting, since I run my own email/web/ftp server here in my office. Makes me wonder what other email I don't receive...

To my chagrin, I now see that you not only did NOT ignore my posting, but you replied with a very thoughtful and detailed posting. THANKS for the detail. That doesn't normally happen when someone is discussing a problem, and it is very refreshing.

For the past year, I've been working on PR/SM analysis, and for the past several months I've been working exclusively on IRD analysis. Just this week, I've finished the basic "this is what you should do and what you should not do with IRD" per the Redbook. This thread is extraordinarily timely from my R&D perspective, since I'm now moving into the "how does IRD really work in user sites" phase.


That aside, I have a few comments on your posting at
 http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R39973&I=1


<snip>
First, recommendations said to put a Minimum Weight of 1, and leave Maximum Weight blank for Weight Management - which upon roll out we had not done : we'd been conservative and kept the minimum weight equal to at least two engines of capacity because we'd heard of people that had complained of systems getting varied down to just one LCPU and it killed CICS multiprocessing. So we picked two CECs that that had the most PCPUs, and set their min to 1, max to blank.
</snip>

Gary, I know that the PR/SM Planning Guides say that specifying Minimum = 1 and Maximum = 999 is the "optimal" specification. However, I believe that the wording is unfortunate because it explicitly says that "1" is an optimal specification. This specification is neither optimal nor recommended. There is a POTENTIAL serious problem with specifying a Minimum Weight of 1. Essentially, you are saying that WLM LPAR Weight Management can adjust the weight down to "1" which would leave a very low share (potentially causing sysplex disruption). With the default of "0" for Minimum weight, WLM LPAR Weight Management will ensure that the LPAR never goes below 5% share of a central processor. However, if you specify a Minimum Weight, WLM will honor your specification (see Section 7.1 of the Redbook, and I've confirmed this Redbook paragraph with WLM developers).

While I don't think that it is too likely that WLM would actually reduce the weight to "1", such a reduction is possible. Why take a chance? Leave the Minimum Weight blank and WLM will use default of zero, and will never reduce a logical processor share to less than about 5% of a physical processor. (BTW: The Redbook says 5% of a CPC in several locations. This is a typo; it should be 5% of a physical processor.)


<snip>
1. When both LPARs are idle (50% MVS busy or less), both get all 11 LCPUs online. The books states this is "so the workload can take advantage of increased multiprocessing". This doesn't seem to be an issue. I don't know if it really helps, but it doesn't seem to hurt.
</snip>

This is goodness - it means that you can have more concurrent processes active. The only downside is potential queuing on the PR/SM Logical Processor Ready Queue, but this should not cause performance problems at low utilization.


<snip>
2. When one LPAR is trying to "take over" the CEC and the other LPAR is idle, CPU Vary _ALWAYS_ put all 11 LCPUs online to the busy LPAR and cut back the idle one to no less than 5 LCPUs. Why 5? I don't know, and this was one of the questions I asked to the list -- EXACTLY how does IRD determine how many LCPUs to leave online (i.e. give me the calculation, please)?
</snip>

See the WLM CPU Management algorithm described in Section 3.9.1 of the Redbook (and particularly Section 3.9.2 - WLM Vary CPU Management logic). Focus on the concept that the weight of an LPAR implies share, and appreciate that IRD will not go below "guaranteed" share when it calculates the Equivalent Physical Central Processors. Note that unless WLM LPAR Weight Management had lowered the weight of the "idle" LPAR because of goals being missed in the "busy" LPAR, the "idle" LPAR would be "entitled" to its share as a function of the weight it had before the other LPAR tried to "take over" the CEC. That share might have implied 4 equivalent physical processors, plus the buffer of 1, for a total of 5 physical processors that should remain online. Without seeing actual data, I cannot be certain.

I wrote some code to replicate the IRD algorithms (partially so I can recommend to folks with just plain old PR/SM that they might consider altering the number of logical processors, and partially because I find that writing such code helps me better understand what is going on). I compared my algorithm results with data from some sites running IRD. Sometimes there was a match from my results and SMF70BDA, and sometimes my results were radically different from SMF70BDA. Page 85 of the Redbook explained the differences (but it drove me crazy for awhile).

According to IRD Redbook (page 85), IRD will bring (or leave) logical processor on-line if overall CPU utilization is low. That explains why I can see SMF70BDA showing that some LPARs have a lot of logical processors on-line for many hours (15 logical processors, for example) when my code showed that the number should be much lower (9 logical processors, for example). This does not, of course explain your situation since you say that the overall CEC was near 100%. My point is that there is lots of subtle logic in IRD that might not be immediately apparent.


<snip>
3. When both LPARs are trying to take over the CEC, with low importance-5 work (a test we contrived), we were frustrated and disappointed by IRD's behavior.
</snip>

I'm not surprised.  Very likely, LPAR Weight management was never invoked.


<snip>
4. During the usual production load on the two LPARs, IRD seemed to prefer to keep the LPAR versus MVS busy within 20% of each other, which is a good thing.
</snip>

I don't think that the 20% was a design goal, but perhaps more an unintentional effect in your specific tests. To the best of my knowledge, WLM does not have access to PR/SM Logical Processor Ready Queue information directly. For that matter, even if WLM should compute the PR/SM Logical Processor Ready Queue from Intercept information (using MVS Wait and SMF70PDT), there are circumstances when SMF70EDT can be higher than SMF70PDT. For example, delayed synchronous requests which PR/SM does not intercept (see APAR II10549, which was thought to be hardware-related, but is common when coupling facilities are shared). I have IRD data from sites in which the LPAR versus MVS busy is WAY more than 20%, so 20% certainly is not a design objective!


<snip>
It was getting a good solid 70% of the CEC, with 9 LCPUs online. Granted, it needed 7.7 PCPUs to get that much, but why 9 LCPUs when LPA1 was tanking so bad?
</snip>

I think that the algorithms in Section 3.9.1 of the Redbook explain this.


<snip>
Doesn't IRD look at the queue length distribution, see that nearly all the work is in bucket 14+ and go "uh oh, I better see if I can help that one"?
</snip>

No. See Section 3.7 of the Redbook for a good discussion of the WLM logic involved.


<snip>
My take on IRD, at the moment, is that "it's great, but watch out when you want to push the CEC to it's limits, IRD doesn't like to do that. You're better off capping LPARs yourself by taking LCPUs offline that you never want the LPAR to use, regardless of IRD'd pie in the sky multiprocessing recommendations."
</snip>

Gary, I don't as yet have an opinion about the operational aspects of IRD when you push the CEC to its limits. The data that I've seen has mostly been from sites where there almost always is some white space, and it appears that IRD is "wonderful" in such situations.

FYI: Just today, I received an email from a site which has been providing me with IRD data for my R&D purposes. This site does have 100% utilization for many hours, and I plan to analyze this data to appreciate what happens when the CEC is 100% busy. The email said that management has decided to move off IRD because of perceived performance issues. The site graciously is sending me post-IRD data so I can analyze the differences from a basic PR/SM and IRD view. :-)

Best regards,

Don

******
Don Deese, Computer Management Sciences, Inc.
Voice: (703) 922-7027  Fax: (703) 922-7305
http://www.cpexpert.org
******



At 11:18 AM 9/28/2005, you wrote:
Sam,

Here are the messages I posted with questions about what I was seeing
and why.  The last one has the most detail.

http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R18496&I=1

http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R26748&I=1

http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R39973&I=1

We assumed that the problem is with our WLM Goals and IRD setup, and
we're still reviewing them to make sure we can rule that out.

Once we're pretty sure it's not our fault, we'll get an PMR opened.
Nothing worse than to waste their time with something we should have
figured out in the first place!

Best regards,

Gary

-----Original Message-----
From: IBM Mainframe Discussion List [mailto:[EMAIL PROTECTED]
Behalf Of Knutson, Sam
Sent: Wednesday, September 28, 2005 10:01 AM
To: [email protected]
Subject: Re: IRD (+)ve R (-)ve


Hi Gary,

Just curious....  Did you open a PMR on this with IBM?   What was the
CEC
size and configuration like?






--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.11.8/113 - Release Date: 9/27/2005

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Reply via email to