Re: IXC102A automated by SFM ?

2006-04-27 Thread Bill Neiman
Sam Knutson wrote:

>The SFM Policy did not partition the dead systems out of the Sysplex 
>without operator intervention.  The remaining two systems kept running 
>but hung up till operators manually replied.

 We have found that there is no right answer for this situation.  
Several years ago, APAR OW30926 introduced the support to consider the 
existence of signalling traffic as an indicator of "aliveness" of a 
system, because I/O delays can prevent a system from updating its 
heartbeat status in the sysplex couple data set even though it is in other 
respects healthy.  However, other installations felt that systems in this 
indeterminate degraded condition should indeed be removed from the 
sysplex.  So the APAR introduced message IXC427A / IXC426D, which prompt 
for operator guidance.

 To try to eliminate the need for operator intervention in this 
situation, APAR OA11591 has been accepted.  It is still in the design 
phases, while we try to figure out a set of externals that would provide 
the right function and not further muddy an already confusing interface.

 I'm not clear, though, on why IXC427A would have been issued in the 
scenario Sam describes.  In a CEC failure, I would not expect to see 
signals from the systems residing on the affected CEC.  And once the 
signals dried up and the affected systems were really and truly in a 
status update missing (SUM) condition, I would expect SFM to transition 
into its normal processing for isolating those systems in accordance with 
Sam's policy.  I would have to guess that the configuration remaining 
after the CEC failure did not provide the necessary CF connectivity 
between the surviving systems and the dead systems.  (Isolation requires a 
CF with connectivity to both the SUM system and the system that is trying 
to effect its removal.)

 Bill Neiman
 z/OS Development

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: IXC102A automated by SFM ?

2006-04-26 Thread Barbara Nitz
Knut,

looking at the messages you provided, why did you expect IXC256A? Were there
problems with the sysplex CDS? Judging from the descriptor code (as I don't
know which module would issue this message, and it is probably OCO anyway)
this was a branch-entered WTO that is only shown 10 seconds on an MCS
console before 'normal' message traffic takes over again. At that point it
should be seen in the hardcopy log, though, provided you can get to it.

IXC427A was introduced a few years back when a customer had a definitely
running system producing XCF traffic but there was a hardware problem
accessing the sysplex CDS device (I believe some sort of ESCON director
failure). So that system was partitioned out even though it shouldn't have.
As a safeguard, SFM will only consider a system 'dead' when it is system
status update missing *and* not having XCF traffic anymore. I believe that
this applies to your case, XCF traffic but no status update. 

When reply 466 got DOM'd, apparently it had stopped its XCF traffic (entered
a wait state itself?), so now SFM would have attempted to failure isolate
the system. Bill would know why SFM cannot do it (the case the book eludes
to, but does not elaborate on). 

So unless there is a way to automate (as in system automation/message trap)
ixc102A (not recommended because of the possibly missing system reset), the
operator will have to reply manually (after system-resetting). 

SA/390 has a part called proc/ops that would allow you to automate the
system reset, I believe. What I don't know is how that would be done and if
it could be done when failure isolation from SFM fails. Presumably they use
the same interface.

>Any ideas? Anything you have done in this area to help speed resolution
>of multi-system outages? Is an outage this wide something that SFM
>should be able to handle?

Our SFM policy has one statement in it:
DEFINE POLICY NAME(SFM01) REPLACE(YES) CONNFAIL(NO)
  SYSTEM NAME(*) WEIGHT(100) PROMPT
We don't even allow automatic removal. And all systems are equal with
respect to weight. 

I think, to preserve data integrity, SFM has done what it could and (without
further logs - you didn't take an sadump, did you?) cannot do more.

Regards, Barbara 

-- 
"Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ...
Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: IXC102A automated by SFM ?

2006-04-26 Thread Knutson, Sam
Hi Dave,

We are still working on that but if you have a z9-109 2094 CPU there is
a recent level of microcode (MCL) 082.J99673 which is marked HIPER that
you will probably want to have on.   We had what may or may not have
been a real hardware failure in an MBA fanout card.  The processor
attempted to spare CP's it believed had failed and check stopped.  The
CEC dump, CEC logs, and the replaced MBA card are now with IBM and we
hope to find out more but getting current on MCL was clearly the thing
everyone agreed to do first. We are going to be more aggressive on
getting MCL bundles loaded on the 2094 and even more so on HIPER's.   

If you need more details about the MCL you need to go into IBM
ResourceLink and look at bundle 17 074. through and including 082. for
J99673 or talk to your CE.

Best Regards, 

Sam Knutson, GEICO 
Performance and Availability Management 
mailto:[EMAIL PROTECTED] 
(office)  301.986.3574 

Frank Abagnale Sr.: Two little mice fell in a bucket of cream. The first
mouse quickly gave up and drowned. The second mouse, wouldn't quit. He
struggled so hard that eventually he churned that cream into butter and
crawled out. Gentlemen, as of this moment, I am that second mouse.


-Original Message-
From: IBM Mainframe Discussion List [mailto:[EMAIL PROTECTED] On
Behalf Of Jousma, David
Sent: Wednesday, April 26, 2006 1:03 PM
To: IBM-MAIN@BAMA.UA.EDU
Subject: Re: IXC102A automated by SFM ?

Sam,

Can you give us more details on what caused the outage?

Dave 



Dave Jousma
Principal Systems Programmer
[EMAIL PROTECTED]
616.653.8429

This email/fax message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution of this
email/fax is prohibited. If you are not the intended recipient, please
destroy all paper and electronic copies of the original message.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: IXC102A automated by SFM ?

2006-04-26 Thread Jousma, David
Sam,

Can you give us more details on what caused the outage?

Dave 



Dave Jousma
Principal Systems Programmer
[EMAIL PROTECTED]
616.653.8429


-Original Message-
From: IBM Mainframe Discussion List [mailto:[EMAIL PROTECTED] On
Behalf Of Knutson, Sam
Sent: Wednesday, April 26, 2006 11:39 AM
To: IBM-MAIN@BAMA.UA.EDU
Subject: Re: IXC102A automated by SFM ?

Hi,

This thread was very timely.  I have an SFM policy and it has worked
fine since I had it setup a couple years ago.
Last Thursday we had a CEC outage and lost 6 of the 8 systems in this
Parallel Sysplex at the same time and an ICF on the same CEC as those 6
systems.  Yeah that hurt:-( Another CEC, including an ICF, and an
external CF survived.  We are still working with IBM to the address all
the aspects of the CEC outage but we think it would have been reduced in
impact or avoided if we had a recent HIPER MCL 082 (J99673 stream)
installed on this 2094. The SFM Policy did not partition the dead
systems out of the Sysplex without operator intervention.  The remaining
two systems kept running but hung up till operators manually replied.

>snipped the rest


This e-mail transmission contains information that is confidential and may be 
privileged.   It is intended only for the addressee(s) named above. If you 
receive this e-mail in error, please do not read, copy or disseminate it in any 
manner. If you are not the intended recipient, any disclosure, copying, 
distribution or use of the contents of this information is prohibited. Please 
reply to the message immediately by informing the sender that the message was 
misdirected. After replying, please erase it from your computer system. Your 
assistance in correcting this error is appreciated.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


Re: IXC102A automated by SFM ?

2006-04-26 Thread Knutson, Sam
Hi,

This thread was very timely.  I have an SFM policy and it has worked
fine since I had it setup a couple years ago.
Last Thursday we had a CEC outage and lost 6 of the 8 systems in this
Parallel Sysplex at the same time and an ICF on the same CEC as those 6
systems.  Yeah that hurt:-( Another CEC, including an ICF, and an
external CF survived.  We are still working with IBM to the address all
the aspects of the CEC outage but we think it would have been reduced in
impact or avoided if we had a recent HIPER MCL 082 (J99673 stream)
installed on this 2094. The SFM Policy did not partition the dead
systems out of the Sysplex without operator intervention.  The remaining
two systems kept running but hung up till operators manually replied.

Looked for IXC256A did not find that it was issued but I am tracking
APAR Identifier .. OA14593 MSGIXC256A NOT RESPONDED TO BECAUSE IT IS
NOT READILY AVAILABLE.  This was just an interesting APAR I turned up in
IBMLink seems unrelated to this situation.

The recovery hung up till operators replied to IXC102A.

Failed 17:31

17:31:57.17  0090 *IXC427A SYSTEM BTST HAS NOT UPDATED
STATUS SINCE 17:31:05 679
 679 0090  BUT IS SENDING XCF SIGNALS. XCF SYSPLEX
FAILURE MANAGEMENT WILL  
 679 0090  REMOVE SYSTEM BTST IF NO SIGNALS ARE
RECEIVED WITHIN A 45
 679 0090  SECOND INTERVAL.

17:31:57.17  0090 *466 IXC426D SYSTEM BTST IS SENDING XCF
SIGNALS BUT NOT UPDATING  
   STATUS. REPLY SYSNAME=BTST TO REMOVE THE
SYSTEM. 
17:31:57.54 STC32489 0090  PXM4704 XMANAGER PLEXGRPX GRP=PXM1100
MEM=XMAN1100BSYS  
17:31:57.54 STC32489 0090  PXM4705 XMANAGER PLEXGRPX NEW=03 OLD=03
TYPE=0E  
17:31:58.71 STC32489 0090  PXM4704 XMANAGER PLEXGRPX GRP=PXM1100
MEM=XMAN1100PT01  
17:31:58.71 STC32489 0090  PXM4705 XMANAGER PLEXGRPX NEW=03 OLD=03
TYPE=0C  
17:31:59.63 STC32489 0090  PXM4704 XMANAGER PLEXGRPX GRP=PXM1100
MEM=XMAN1100PT01  
17:31:59.63 STC32489 0090  PXM4705 XMANAGER PLEXGRPX NEW=03 OLD=03
TYPE=0E  
17:31:59.79  0094  IEE400I THESE MESSAGES CANCELLED - 466.

17:32:00.08 STC32489 0090  PXM4704 XMANAGER PLEXGRPX GRP=PXM1100
MEM=XMAN1100BTST  

17:32:33.12  0090 *467 IXC102A XCF IS WAITING FOR SYSTEM
PT02 DEACTIVATION. REPLY DOW
   WHEN MVS ON PT02 HAS BEEN SYSTEM RESET

The WTOR's remained outstanding.

486 R 17.32.39 ASYS  *486 IXC102A XCF IS WAITING   
 FOR SYSTEM BEND DEACTIVATION. 
 REPLY DOWN WHEN MVS ON BEND   
 HAS BEEN SYSTEM RESET 
485 R 17.32.33 ASYS  *485 IXC102A XCF IS WAITING   
 FOR SYSTEM BTST DEACTIVATION. 
 REPLY DOWN WHEN MVS ON BTST   
 HAS BEEN SYSTEM RESET 
484 R 17.32.33 ASYS  *484 IXC102A XCF IS WAITING   
 FOR SYSTEM PT01 DEACTIVATION. 
 REPLY DOWN WHEN MVS ON PT01   
 HAS BEEN SYSTEM RESET 
482 R 17.32.30 ASYS  *482 IXC102A XCF IS WAITING   
 FOR SYSTEM HSYS DEACTIVATION. 
 REPLY DOWN WHEN MVS ON HSYS   
 HAS BEEN SYSTEM RESET 
483 R 17.32.30 ASYS  *483 IXC102A XCF IS WAITING   
 FOR SYSTEM BSYS DEACTIVATION. 
 REPLY DOWN WHEN MVS ON BSYS   
 HAS BEEN SYSTEM RESET


17:39:19.44 CSYS0050 0290  R 467,DOWN
17:43:04.60 CSYS0050 0290  R 486,DOWN
Etc.  

17:39:22.30  0090  IXC105I SYSPLEX PARTITIONING HAS
COMPLETED FOR PT02 324
 324 0090  - PRIMARY REASON: SYSTEM REMOVED BY
SYSPLEX FAILURE MANAGEMENT BECAUSE 
 324 0090  ITS STATUS UPDATE WAS MISSING

 324 0090  - REASON FLAGS: 000100

 
 
We specify ISOLATETIME in our SFM policy.  I have been reading the
Setting up Sysplex manual and IBMLink but still don't see exactly why
SFM was not able to isolate the failed systems and partition them out of
the Sysplex. We had full connectivity with XCF & 3 CF's for all systems
in the Sysplex.   I expect there are some circumstances SFM cannot
handle but this is exactly the kind of crash we want cleaned up
automatically so the remaining systems could process work with minimal
interruption.

  
DATA TYPE(SFM) REPORT(YES)