Hi, This thread was very timely. I have an SFM policy and it has worked fine since I had it setup a couple years ago. Last Thursday we had a CEC outage and lost 6 of the 8 systems in this Parallel Sysplex at the same time and an ICF on the same CEC as those 6 systems. Yeah that hurt:-( Another CEC, including an ICF, and an external CF survived. We are still working with IBM to the address all the aspects of the CEC outage but we think it would have been reduced in impact or avoided if we had a recent HIPER MCL 082 (J99673 stream) installed on this 2094. The SFM Policy did not partition the dead systems out of the Sysplex without operator intervention. The remaining two systems kept running but hung up till operators manually replied.
Looked for IXC256A did not find that it was issued but I am tracking APAR Identifier ...... OA14593 MSGIXC256A NOT RESPONDED TO BECAUSE IT IS NOT READILY AVAILABLE. This was just an interesting APAR I turned up in IBMLink seems unrelated to this situation. The recovery hung up till operators replied to IXC102A. Failed 17:31 17:31:57.17 00000090 *IXC427A SYSTEM BTST HAS NOT UPDATED STATUS SINCE 17:31:05 679 679 00000090 BUT IS SENDING XCF SIGNALS. XCF SYSPLEX FAILURE MANAGEMENT WILL 679 00000090 REMOVE SYSTEM BTST IF NO SIGNALS ARE RECEIVED WITHIN A 45 679 00000090 SECOND INTERVAL. 17:31:57.17 00000090 *466 IXC426D SYSTEM BTST IS SENDING XCF SIGNALS BUT NOT UPDATING STATUS. REPLY SYSNAME=BTST TO REMOVE THE SYSTEM. 17:31:57.54 STC32489 00000090 PXM4704 XMANAGER PLEXGRPX GRP=PXM1100 MEM=XMAN1100BSYS 17:31:57.54 STC32489 00000090 PXM4705 XMANAGER PLEXGRPX NEW=03 OLD=03 TYPE=0E 17:31:58.71 STC32489 00000090 PXM4704 XMANAGER PLEXGRPX GRP=PXM1100 MEM=XMAN1100PT01 17:31:58.71 STC32489 00000090 PXM4705 XMANAGER PLEXGRPX NEW=03 OLD=03 TYPE=0C 17:31:59.63 STC32489 00000090 PXM4704 XMANAGER PLEXGRPX GRP=PXM1100 MEM=XMAN1100PT01 17:31:59.63 STC32489 00000090 PXM4705 XMANAGER PLEXGRPX NEW=03 OLD=03 TYPE=0E 17:31:59.79 00000094 IEE400I THESE MESSAGES CANCELLED - 466. 17:32:00.08 STC32489 00000090 PXM4704 XMANAGER PLEXGRPX GRP=PXM1100 MEM=XMAN1100BTST 17:32:33.12 00000090 *467 IXC102A XCF IS WAITING FOR SYSTEM PT02 DEACTIVATION. REPLY DOW WHEN MVS ON PT02 HAS BEEN SYSTEM RESET The WTOR's remained outstanding. 486 R 17.32.39 ASYS *486 IXC102A XCF IS WAITING FOR SYSTEM BEND DEACTIVATION. REPLY DOWN WHEN MVS ON BEND HAS BEEN SYSTEM RESET 485 R 17.32.33 ASYS *485 IXC102A XCF IS WAITING FOR SYSTEM BTST DEACTIVATION. REPLY DOWN WHEN MVS ON BTST HAS BEEN SYSTEM RESET 484 R 17.32.33 ASYS *484 IXC102A XCF IS WAITING FOR SYSTEM PT01 DEACTIVATION. REPLY DOWN WHEN MVS ON PT01 HAS BEEN SYSTEM RESET 482 R 17.32.30 ASYS *482 IXC102A XCF IS WAITING FOR SYSTEM HSYS DEACTIVATION. REPLY DOWN WHEN MVS ON HSYS HAS BEEN SYSTEM RESET 483 R 17.32.30 ASYS *483 IXC102A XCF IS WAITING FOR SYSTEM BSYS DEACTIVATION. REPLY DOWN WHEN MVS ON BSYS HAS BEEN SYSTEM RESET 17:39:19.44 CSYS0050 00000290 R 467,DOWN 17:43:04.60 CSYS0050 00000290 R 486,DOWN Etc. 17:39:22.30 00000090 IXC105I SYSPLEX PARTITIONING HAS COMPLETED FOR PT02 324 324 00000090 - PRIMARY REASON: SYSTEM REMOVED BY SYSPLEX FAILURE MANAGEMENT BECAUSE 324 00000090 ITS STATUS UPDATE WAS MISSING 324 00000090 - REASON FLAGS: 000100 We specify ISOLATETIME in our SFM policy. I have been reading the Setting up Sysplex manual and IBMLink but still don't see exactly why SFM was not able to isolate the failed systems and partition them out of the Sysplex. We had full connectivity with XCF & 3 CF's for all systems in the Sysplex. I expect there are some circumstances SFM cannot handle but this is exactly the kind of crash we want cleaned up automatically so the remaining systems could process work with minimal interruption. DATA TYPE(SFM) REPORT(YES) DEFINE POLICY NAME(POLICYS1) CONNFAIL(YES) REPLACE(YES) SYSTEM NAME(*) ISOLATETIME(0) WEIGHT(1) SYSTEM NAME(ASYS) ISOLATETIME(15) WEIGHT(100) SYSTEM NAME(CSYS) ISOLATETIME(15) WEIGHT(75) SYSTEM NAME(BSYS) WEIGHT(5) SYSTEM NAME(BEND) WEIGHT(5) SYSTEM NAME(PT01) WEIGHT(5) I will probably assemble all the documentation and open an ETR but so far I don't see anything wrong with SFM policy. Operators just don't/can't sort all this out and respond fast enough when there is a multi-system failure. If we take an unexpected failure and SFM doesn't handle it without operator intervention it hurts. Any ideas? Anything you have done in this area to help speed resolution of multi-system outages? Is an outage this wide something that SFM should be able to handle? Best Regards, Sam Knutson, GEICO Performance and Availability Management mailto:[EMAIL PROTECTED] (office) 301.986.3574 Quantized Revision of Murphy's Law: Everything goes wrong all at once. ][ ==================== This email/fax message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of this email/fax is prohibited. If you are not the intended recipient, please destroy all paper and electronic copies of the original message. ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html