Re: how to lose a sysplex in 30 seconds

Gil Peleg Mon, 28 Nov 2005 06:14:28 -0800

Bill,
Thanks a lot for the explanation.

Gil.



On 11/28/05, Bill Neiman <[EMAIL PROTECTED]> wrote:
>
> Gil,
>
>     When any system detects a permanent I/O error during an attempt to
> access a couple data set, it initiates removal of that CDS from service.
> The removal protocol involves notifying all other systems of the error by
> XCF signal, which causes each of the other systems to remove the CDS from
> service as well.  Although you say you lost connectivity between your
> sites, it must have been the case that signalling connectivity still
> existed between them.  Otherwise, MVSA could not have reacted to the loss
> of the primary sysplex CDS detected by MVSB.  The existence of signalling
> connectivity created a race condition, in which MVSA and MVSB were
> competing to detect and report the loss of access to the CDS at their
> respective sites.  MVSB won the race, detecting and signalling the loss of
> the primary CDS before MVSA detected loss of the alternate.  MVSA got
> MVSB's signal, initiated removal of the primary, and then detected the
> inaccessibility of the alternate.  In that situation, with only one CDS
> remaining, MVSA wait states but does not signal loss of the remaining CDS,
> in the hope that its access problem is only a local issue (which it was).
> MVSB therefore remained alive, because it was still able to use the
> alternate CDS.
>
>     The CDS removal protocol requires that each system acknowledge the
> removal signals sent by each other system.  MVSA apparently died before
> acknowledging one of MVSB's signals, so MVSB was unable to complete
> removal of the primary CDS.  Hence the IXC256A message.  I'm not sure why
> a D R,R failed to display the outstanding message, since IXC256A is issued
> with descriptor code 11.  Our usual recommendation is that either (1) the
> installation maintain a console defined with DEL(RD) and routecode and
> level attributes that collect action and eventual action messages, and /
> or (2) automate IXC256A.
>
>     In the 7-1 case, the same race condition would exist.  If the "1"
> system detected and signalled the loss of one CDS before any of the "7"
> systems detected and signalled the loss of the other, you'd wind up with 7
> systems down and 1 up but hung waiting for the resolution of IXC256A.
>
>     To resolve IXC256A in this situation, it is necessary to partition
> the (wait-stated) systems named in it out of the sysplex.  Since a
> permanent error involving the sysplex CDS is in progress, this would
> require the FORCE form of the V XCF command (V XCF,sysname,OFF,FORCE).
> This response is documented with IXC256A.
>
>     Bill Neiman
>     z/OS Development
>
> ----------------------------------------------------------------------
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
> Search the archives at http://bama.ua.edu/archives/ibm-main.html
>

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Re: how to lose a sysplex in 30 seconds

Reply via email to