When we migrated from a z9 to a z196 - big-bang swap - we had a lot of trouble getting one sysplex up. The first system came up fine, but the second or third system could not be ipl'd into that plex. It ended up on a GRS (!) wait0A3 rsn9C. The accompanying ISG message would only have been readable in a standalone dump, but never on a real console. My colleagues had not taken an sadump. z/OS 1.10.
It turned out that there was nothing whatsoever wrong with the GRS lock structure. There was nothing wrong with GRS *at all*. The cause of this was an IBM design change in z/OS 1.9, where IBM unilaterally decided to give up on the concept of maxsystem determining how many systems can join a sysplex. Our sysplex CDS was formatted with maxsystem(5) (because there used to be 5 systems in that sysplex - 2 of them gone for more than a year, and both of them occupying the first two slots in the sysplex CDS). The CFRM CDS had to get reformatted for the big-bang replacement, and it got formatted with maxsystem(3), which reflected the true capacity of the sysplex. Well, the capacity of *that* sysplex was exactly one, because every other system would get a 'CFRM CDS unusable', due to the fact that the sysplex CDS had a higher maxsystem value. In addition, it was clearly visible that the incoming system *had* established signalling connectivity with the system already in the sysplex, which it could only do by successfully!!! reading the CFRM CDS to get at the names of the signalling structures. In addition, the reply I to 're-initialize the sysplex' when the first system was IPL'd (plus the accompanying explanation in the docs that everything in the sysplex will be treated as residual) are wrong. *Nothing* is treated as residual. Looking at the 1.12 documentation for the maxsystem parm (right, *everybody* looks at that book *every* time a canned job that existed since the dawn of sysplex is submitted) says this: "When formatting the couple data set to contain the CFRM policy data, ensure that the value you specify for MAXSYSTEM matches the value for MAXSYSTEM that was used when the sysplex couple data set was formatted. When coupling facility structures are used for XCF signaling, if the MAXSYSTEM value specified for the CFRM couple data set is less than that of the sysplex couple data set, systems might not be able to join the sysplex. For example, if MAXSYSTEM=16 is specified for the sysplex couple data set and MAXSYSTEM=8 is specified for the CFRM couple data set, then only eight systems will be allowed in the sysplex." This clearly implies that the lower of all the maxsystem values is the capacity of the sysplex. IT IS NOT. It is unpredictable, especially if your sysplex CDS is so old that it still has systems in it that are long gone. Which will be preserved along with all the junk that might have once been in the sysplex. IBM told me (and I give a big thanks to the lady who actually went beyond the canned answer I first got and *looked* into this despite the fact that all I had in terms of docs was a syslog) that all of this is broken as designed: "Cluster MR support introduced the requirement to preserve information about manageable resources (which includes sysplexes, systems, CF's, structures and connectors) across a sysplex-wide IPL. To XCF, a sysplex, system, etc. that terminates and is reIPLed is an entirely new entity. To the MR infrastructure, however, it is the same entity transitioning between different states. Therefore, XCF as of V1R9 needs to preserve and reuse system slots whenever possible, regardless of the reply "I" to IXC405D, as we see." and "We agree that the design change in R9 does make the actual system capacity unpredictable. Also, the MAXSYSTEM write up in Setting up a Sysplex needs to be cleaned up. Thus, a documentation update is in order." The unpredictability of maxsystem is apparently addressed in 1.12 by 'fixing' SUG apar OA27634 which makes the GRS wait state message visible by spitting out a readable message instead of wait stating the system. So, if you have old and long gone systems in your sysplex CDS, happen to get a lower maxsystem value in your CFRM CDS and end up in a wait0A3 - delete and reformat your sysplex CDS. That will 'fix' the problem that has nothing whatsoever to do with GRS. Do NOT be fooled by the fact that signalling is established and the CFRM CDS was usable for *that* - such inconsistencies we are not supposed to care about. In any case, we will never run into this again. Permanent part of our DR setup is now to always delete both the sysplex CDS and the CFRM CDS and redefine them freshly in order to avoid this unpredictability. Barbara Nitz ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html