NTAC:3NS-20 Our SAN people said they saw no errors on the switch (do I believe this? IDK) , which is dedicated to the mainframe. The problem was a bad cable between the switch and the storage, which both mainframes share, so there was an issue on one CHPID on each system.
In this forum with so many people who have so many opinions, I can't believe that nobody is offering suggestions for the RECOVERY parameter when asked. The default is: RECOVERY,PATH_SCOPE=DEVICE,PATH_INTERVAL=10,PATH_THRESHOLD=100 This is what caused our pain. z/OS has to see 100 errors per minute for 10 minutes before taking the path off of a single device. This was happening to every device. On our test system I started with: RECOVERY,PATH_SCOPE=CU,PATH_INTERVAL=5,PATH_THRESHOLD=50 z/OS has to see 50 errors per minute for 5 minutes before taking the path off of every device in the LCU. I was hoping to see some real world examples that work for others. Jim -----Original Message----- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Alan (GMAIL) Watthey Sent: Wednesday, December 28, 2016 11:55 PM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: Recommendations for RECOVERY options Jim, As the network guy who looks after the SAN I would not be expecting my z/OS guys to do anything in this situation. In fact z/OS cannot see our whole SAN as I have other things on it (eg. ISLs, backend tapes). Fortunately, the Brocade switches are dedicated to the mainframes and devices used by the mainframes here. The Brocade will see issues that it recovers from well before z/OS sees anything. Of course, I don't know exactly what your problem was and what your Brocade would have seen but I'd suggest the bottleneckmon command to detect latency issues. Running the porterrshow command from time to time would also give you a good idea as to whether your SFPs and fibres are performing as well as they should. If you have the Fabric Vision/Watch licenses then you do fancy stuff like fence errant ports before they impact anything. Regards, Alan Watthey -----Original Message----- From: James Peddycord [mailto:j...@ntrs.com] Sent: 28 December 2016 4:56 pm Subject: Recommendations for RECOVERY options NTAC:3NS-20 We had a situation with a bad cable that resulted in a huge performance impact due to the default way that z/OS (we are at 1.13) handles error recovery on Ficon paths. The symptoms were many (thousands) of IOS050I messages in the task's joblog, followed by an IOS450E message, which took the path offline to a single device. This was happening for every device (around 3000) that the affected path was attached to. As soon as I saw the messages I configured the CHPID offline and the problem stopped. We have put in automation that will immediately configure a CHPID offline as soon as a single IOS450E message is detected, and now I am experimenting with RECOVERY options. IBM recommended to set RECOVERY,PATH_SCOPE=CU, set the PATH_INTERVAL to 1 and leave PATH_THRESHOLD=10, and adjust from there. Due to the paperwork involved with making any change in our environment, I would like to implement this with a minimum of 'adjustment'. Does anyone have any recommendations? We are running on z13s, 16G Ficon through Brokade switches to IBM DS88xx DASD. Thanks, Jim ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN