Re: IRD (+)ve R (-)ve

Don Deese Wed, 28 Sep 2005 13:44:30 -0700

Hi Gary,

First, I apologize for the length of this posting. You had a lengthyposting at http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R39973&I=1and my response must be similarly lengthy. I was going to respond directlyto you, off line. After writing my response, I decided that some of thecomments might be interesting to subscribers who run or are considering IRD.

Gary, I never received your posting that was a response to my posting toyou on 6/9/05 (Subj: Re: IRD & Short Engine Effect at 100% CECutilization)! I had figured that you simply ignored my response and I wasa bit miffed. I have no idea why I did not receive the above posting,since I run my own email/web/ftp server here in my office. Makes me wonderwhat other email I don't receive...

To my chagrin, I now see that you not only did NOT ignore my posting, butyou replied with a very thoughtful and detailed posting. THANKS for thedetail. That doesn't normally happen when someone is discussing a problem,and it is very refreshing.

For the past year, I've been working on PR/SM analysis, and for the pastseveral months I've been working exclusively on IRD analysis. Just thisweek, I've finished the basic "this is what you should do and what youshould not do with IRD" per the Redbook. This thread is extraordinarilytimely from my R&D perspective, since I'm now moving into the "how does IRDreally work in user sites" phase.



That aside, I have a few comments on your posting at
 http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R39973&I=1


<snip>

First, recommendations said to put a Minimum Weight of 1, and leave MaximumWeight blank for Weight Management - which upon roll out we had not done :we'd been conservative and kept the minimum weight equal to at least twoengines of capacity because we'd heard of people that had complained ofsystems getting varied down to just one LCPU and it killed CICSmultiprocessing. So we picked two CECs that that had the most PCPUs, andset their min to 1, max to blank.

</snip>

Gary, I know that the PR/SM Planning Guides say that specifying Minimum = 1and Maximum = 999 is the "optimal" specification. However, I believe thatthe wording is unfortunate because it explicitly says that "1" is anoptimal specification. This specification is neither optimal norrecommended. There is a POTENTIAL serious problem with specifying aMinimum Weight of 1. Essentially, you are saying that WLM LPAR WeightManagement can adjust the weight down to "1" which would leave a very lowshare (potentially causing sysplex disruption). With the default of "0"for Minimum weight, WLM LPAR Weight Management will ensure that the LPARnever goes below 5% share of a central processor. However, if you specifya Minimum Weight, WLM will honor your specification (see Section 7.1 of theRedbook, and I've confirmed this Redbook paragraph with WLM developers).

While I don't think that it is too likely that WLM would actually reducethe weight to "1", such a reduction is possible. Why take a chance? Leavethe Minimum Weight blank and WLM will use default of zero, and will neverreduce a logical processor share to less than about 5% of a physicalprocessor. (BTW: The Redbook says 5% of a CPC in several locations. Thisis a typo; it should be 5% of a physical processor.)



<snip>

1. When both LPARs are idle (50% MVS busy or less), both get all 11 LCPUsonline. The books states this is "so the workload can take advantage ofincreased multiprocessing". This doesn't seem to be an issue. I don't knowif it really helps, but it doesn't seem to hurt.

</snip>

This is goodness - it means that you can have more concurrent processesactive. The only downside is potential queuing on the PR/SM LogicalProcessor Ready Queue, but this should not cause performance problems atlow utilization.



<snip>

2. When one LPAR is trying to "take over" the CEC and the other LPAR isidle, CPU Vary _ALWAYS_ put all 11 LCPUs online to the busy LPAR and cutback the idle one to no less than 5 LCPUs. Why 5? I don't know, and thiswas one of the questions I asked to the list -- EXACTLY how does IRDdetermine how many LCPUs to leave online (i.e. give me the calculation,please)?

</snip>

See the WLM CPU Management algorithm described in Section 3.9.1 of theRedbook (and particularly Section 3.9.2 - WLM Vary CPU Managementlogic). Focus on the concept that the weight of an LPAR implies share, andappreciate that IRD will not go below "guaranteed" share when it calculatesthe Equivalent Physical Central Processors. Note that unless WLM LPARWeight Management had lowered the weight of the "idle" LPAR because ofgoals being missed in the "busy" LPAR, the "idle" LPAR would be "entitled"to its share as a function of the weight it had before the other LPAR triedto "take over" the CEC. That share might have implied 4 equivalentphysical processors, plus the buffer of 1, for a total of 5 physicalprocessors that should remain online. Without seeing actual data, I cannotbe certain.

I wrote some code to replicate the IRD algorithms (partially so I canrecommend to folks with just plain old PR/SM that they might consideraltering the number of logical processors, and partially because I findthat writing such code helps me better understand what is going on). Icompared my algorithm results with data from some sites running IRD.Sometimes there was a match from my results and SMF70BDA, and sometimes myresults were radically different from SMF70BDA. Page 85 of the Redbookexplained the differences (but it drove me crazy for awhile).

According to IRD Redbook (page 85), IRD will bring (or leave) logicalprocessor on-line if overall CPU utilization is low. That explains why Ican see SMF70BDA showing that some LPARs have a lot of logical processorson-line for many hours (15 logical processors, for example) when my codeshowed that the number should be much lower (9 logical processors, forexample). This does not, of course explain your situation since you saythat the overall CEC was near 100%. My point is that there is lots ofsubtle logic in IRD that might not be immediately apparent.



<snip>

3. When both LPARs are trying to take over the CEC, with low importance-5work (a test we contrived), we were frustrated and disappointed by IRD'sbehavior.

</snip>

I'm not surprised.  Very likely, LPAR Weight management was never invoked.


<snip>

4. During the usual production load on the two LPARs, IRD seemed to preferto keep the LPAR versus MVS busy within 20% of each other, which is a goodthing.

</snip>

I don't think that the 20% was a design goal, but perhaps more anunintentional effect in your specific tests. To the best of my knowledge,WLM does not have access to PR/SM Logical Processor Ready Queue informationdirectly. For that matter, even if WLM should compute the PR/SM LogicalProcessor Ready Queue from Intercept information (using MVS Wait andSMF70PDT), there are circumstances when SMF70EDT can be higher thanSMF70PDT. For example, delayed synchronous requests which PR/SM does notintercept (see APAR II10549, which was thought to be hardware-related, butis common when coupling facilities are shared). I have IRD data from sitesin which the LPAR versus MVS busy is WAY more than 20%, so 20% certainly isnot a design objective!



<snip>

It was getting a good solid 70% of the CEC, with 9 LCPUs online. Granted,it needed 7.7 PCPUs to get that much, but why 9 LCPUs when LPA1 was tankingso bad?

</snip>

I think that the algorithms in Section 3.9.1 of the Redbook explain this.


<snip>

Doesn't IRD look at the queue length distribution, see that nearly allthe work is in bucket 14+ and go "uh oh, I better see if I can help that one"?

</snip>

No. See Section 3.7 of the Redbook for a good discussion of the WLM logicinvolved.



<snip>

My take on IRD, at the moment, is that "it's great, but watch out when youwant to push the CEC to it's limits, IRD doesn't like to do that. You'rebetter off capping LPARs yourself by taking LCPUs offline that you neverwant the LPAR to use, regardless of IRD'd pie in the sky multiprocessingrecommendations."

</snip>

Gary, I don't as yet have an opinion about the operational aspects of IRDwhen you push the CEC to its limits. The data that I've seen has mostlybeen from sites where there almost always is some white space, and itappears that IRD is "wonderful" in such situations.

FYI: Just today, I received an email from a site which has been providingme with IRD data for my R&D purposes. This site does have 100% utilizationfor many hours, and I plan to analyze this data to appreciate what happenswhen the CEC is 100% busy. The email said that management has decided tomove off IRD because of perceived performance issues. The site graciouslyis sending me post-IRD data so I can analyze the differences from a basicPR/SM and IRD view. :-)


Best regards,

Don

******
Don Deese, Computer Management Sciences, Inc.
Voice: (703) 922-7027  Fax: (703) 922-7305
http://www.cpexpert.org
******



At 11:18 AM 9/28/2005, you wrote:

Sam,

Here are the messages I posted with questions about what I was seeing
and why.  The last one has the most detail.

http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R18496&I=1

http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R26748&I=1

http://bama.ua.edu/cgi-bin/wa?A2=ind0506&L=ibm-main&P=R39973&I=1

We assumed that the problem is with our WLM Goals and IRD setup, and
we're still reviewing them to make sure we can rule that out.

Once we're pretty sure it's not our fault, we'll get an PMR opened.
Nothing worse than to waste their time with something we should have
figured out in the first place!

Best regards,

Gary

-----Original Message-----
From: IBM Mainframe Discussion List [mailto:[EMAIL PROTECTED]
Behalf Of Knutson, Sam
Sent: Wednesday, September 28, 2005 10:01 AM
To: [email protected]
Subject: Re: IRD (+)ve R (-)ve


Hi Gary,

Just curious....  Did you open a PMR on this with IBM?   What was the
CEC
size and configuration like?







--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.344 / Virus Database: 267.11.8/113 - Release Date: 9/27/2005

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Re: IRD (+)ve R (-)ve

Reply via email to