Hi,

This thread was very timely.  I have an SFM policy and it has worked
fine since I had it setup a couple years ago.
Last Thursday we had a CEC outage and lost 6 of the 8 systems in this
Parallel Sysplex at the same time and an ICF on the same CEC as those 6
systems.  Yeah that hurt:-( Another CEC, including an ICF, and an
external CF survived.  We are still working with IBM to the address all
the aspects of the CEC outage but we think it would have been reduced in
impact or avoided if we had a recent HIPER MCL 082 (J99673 stream)
installed on this 2094. The SFM Policy did not partition the dead
systems out of the Sysplex without operator intervention.  The remaining
two systems kept running but hung up till operators manually replied.

Looked for IXC256A did not find that it was issued but I am tracking
APAR Identifier ...... OA14593 MSGIXC256A NOT RESPONDED TO BECAUSE IT IS
NOT READILY AVAILABLE.  This was just an interesting APAR I turned up in
IBMLink seems unrelated to this situation.

The recovery hung up till operators replied to IXC102A.

Failed 17:31

17:31:57.17          00000090 *IXC427A SYSTEM BTST HAS NOT UPDATED
STATUS SINCE 17:31:05 679    
                 679 00000090  BUT IS SENDING XCF SIGNALS. XCF SYSPLEX
FAILURE MANAGEMENT WILL  
                 679 00000090  REMOVE SYSTEM BTST IF NO SIGNALS ARE
RECEIVED WITHIN A 45        
                 679 00000090  SECOND INTERVAL.

17:31:57.17          00000090 *466 IXC426D SYSTEM BTST IS SENDING XCF
SIGNALS BUT NOT UPDATING  
                               STATUS. REPLY SYSNAME=BTST TO REMOVE THE
SYSTEM.                 
17:31:57.54 STC32489 00000090  PXM4704 XMANAGER PLEXGRPX GRP=PXM1100
MEM=XMAN1100BSYS          
17:31:57.54 STC32489 00000090  PXM4705 XMANAGER PLEXGRPX NEW=03 OLD=03
TYPE=0E                  
17:31:58.71 STC32489 00000090  PXM4704 XMANAGER PLEXGRPX GRP=PXM1100
MEM=XMAN1100PT01          
17:31:58.71 STC32489 00000090  PXM4705 XMANAGER PLEXGRPX NEW=03 OLD=03
TYPE=0C                  
17:31:59.63 STC32489 00000090  PXM4704 XMANAGER PLEXGRPX GRP=PXM1100
MEM=XMAN1100PT01          
17:31:59.63 STC32489 00000090  PXM4705 XMANAGER PLEXGRPX NEW=03 OLD=03
TYPE=0E                  
17:31:59.79          00000094  IEE400I THESE MESSAGES CANCELLED - 466.

17:32:00.08 STC32489 00000090  PXM4704 XMANAGER PLEXGRPX GRP=PXM1100
MEM=XMAN1100BTST          

17:32:33.12          00000090 *467 IXC102A XCF IS WAITING FOR SYSTEM
PT02 DEACTIVATION. REPLY DOW
                               WHEN MVS ON PT02 HAS BEEN SYSTEM RESET

The WTOR's remained outstanding.

486 R 17.32.39 ASYS              *486 IXC102A XCF IS WAITING   
                                 FOR SYSTEM BEND DEACTIVATION. 
                                 REPLY DOWN WHEN MVS ON BEND   
                                 HAS BEEN SYSTEM RESET         
485 R 17.32.33 ASYS              *485 IXC102A XCF IS WAITING   
                                 FOR SYSTEM BTST DEACTIVATION. 
                                 REPLY DOWN WHEN MVS ON BTST   
                                 HAS BEEN SYSTEM RESET         
484 R 17.32.33 ASYS              *484 IXC102A XCF IS WAITING   
                                 FOR SYSTEM PT01 DEACTIVATION. 
                                 REPLY DOWN WHEN MVS ON PT01   
                                 HAS BEEN SYSTEM RESET         
482 R 17.32.30 ASYS              *482 IXC102A XCF IS WAITING   
                                 FOR SYSTEM HSYS DEACTIVATION. 
                                 REPLY DOWN WHEN MVS ON HSYS   
                                 HAS BEEN SYSTEM RESET         
483 R 17.32.30 ASYS              *483 IXC102A XCF IS WAITING   
                                 FOR SYSTEM BSYS DEACTIVATION. 
                                 REPLY DOWN WHEN MVS ON BSYS   
                                 HAS BEEN SYSTEM RESET


17:39:19.44 CSYS0050 00000290  R 467,DOWN    
17:43:04.60 CSYS0050 00000290  R 486,DOWN
Etc.      

17:39:22.30          00000090  IXC105I SYSPLEX PARTITIONING HAS
COMPLETED FOR PT02 324                
                 324 00000090  - PRIMARY REASON: SYSTEM REMOVED BY
SYSPLEX FAILURE MANAGEMENT BECAUSE 
                 324 00000090  ITS STATUS UPDATE WAS MISSING

                 324 00000090  - REASON FLAGS: 000100    

                                                     
                                             
We specify ISOLATETIME in our SFM policy.  I have been reading the
Setting up Sysplex manual and IBMLink but still don't see exactly why
SFM was not able to isolate the failed systems and partition them out of
the Sysplex. We had full connectivity with XCF & 3 CF's for all systems
in the Sysplex.   I expect there are some circumstances SFM cannot
handle but this is exactly the kind of crash we want cleaned up
automatically so the remaining systems could process work with minimal
interruption.

                                                              
DATA TYPE(SFM) REPORT(YES)                                    
                                                              
DEFINE POLICY NAME(POLICYS1) CONNFAIL(YES) REPLACE(YES)       
                                                              
   SYSTEM NAME(*)                                             
      ISOLATETIME(0)                                          
      WEIGHT(1)                                               
                                                              
   SYSTEM NAME(ASYS)                                          
      ISOLATETIME(15)                                         
      WEIGHT(100)                                             
                                                              
   SYSTEM NAME(CSYS)                                          
      ISOLATETIME(15)                                         
      WEIGHT(75)                                              
                                                              
   SYSTEM NAME(BSYS)                                          
      WEIGHT(5)                                               
                                                              
   SYSTEM NAME(BEND)                                          
      WEIGHT(5)                                               
                                                              
   SYSTEM NAME(PT01)                                          
      WEIGHT(5)                                               
                 

I will probably assemble all the documentation and open an ETR but so
far I don't see anything wrong with SFM policy.  Operators just
don't/can't sort all this out and respond fast enough when there is a
multi-system failure.  If we take an unexpected failure and SFM doesn't
handle it without operator intervention it hurts.

Any ideas?  Anything you have done in this area to help speed resolution
of multi-system outages?  Is an outage this wide something that SFM
should be able to handle?

        Best Regards, 

                Sam Knutson, GEICO 
                Performance and Availability Management 
                mailto:[EMAIL PROTECTED] 
                (office)  301.986.3574 

Quantized Revision of Murphy's Law: Everything goes wrong all at once.

][
====================
This email/fax message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution of this
email/fax is prohibited. If you are not the intended recipient, please
destroy all paper and electronic copies of the original message.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Reply via email to