NTAC:3NS-20

Our SAN people said they saw no errors on the switch (do I believe this?
IDK) , which is dedicated to the mainframe. The problem was a bad cable
between the switch and the storage, which both mainframes share, so
there was an issue on one CHPID on each system.

In this forum with so many people who have so many opinions, I can't
believe that nobody is offering suggestions for the RECOVERY parameter
when asked.

The default is:
RECOVERY,PATH_SCOPE=DEVICE,PATH_INTERVAL=10,PATH_THRESHOLD=100
This is what caused our pain.
z/OS has to see 100 errors per minute for 10 minutes before taking the
path off of a single device. This was happening to every device.

On our test system I started with:
RECOVERY,PATH_SCOPE=CU,PATH_INTERVAL=5,PATH_THRESHOLD=50
z/OS has to see 50 errors per minute for 5 minutes before taking the
path off of every device in the LCU.

I was hoping to see some real world examples that work for others.

Jim

-----Original Message-----
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On
Behalf Of Alan (GMAIL) Watthey
Sent: Wednesday, December 28, 2016 11:55 PM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Recommendations for RECOVERY options

Jim,

As the network guy who looks after the SAN I would not be expecting my
z/OS guys to do anything in this situation.  In fact z/OS cannot see our
whole SAN as I have other things on it (eg. ISLs, backend tapes).
Fortunately, the Brocade switches are dedicated to the mainframes and
devices used by the mainframes here.  The Brocade will see issues that
it recovers from well before z/OS sees anything.

Of course, I don't know exactly what your problem was and what your
Brocade would have seen but I'd suggest the bottleneckmon command to
detect latency issues.  Running the porterrshow command from time to
time would also give you a good idea as to whether your SFPs and fibres
are performing as well as they should.  If you have the Fabric
Vision/Watch licenses then you do fancy stuff like fence errant ports
before they impact anything.


Regards,
Alan Watthey
-----Original Message-----
From: James Peddycord [mailto:j...@ntrs.com]
Sent: 28 December 2016 4:56 pm
Subject: Recommendations for RECOVERY options

NTAC:3NS-20
We had a situation with a bad cable that resulted in a huge performance
impact due to the default way that z/OS (we are at 1.13) handles error
recovery on Ficon paths.
The symptoms were many (thousands) of IOS050I messages in the task's
joblog, followed by an IOS450E message, which took the path offline to a
single device.
This was happening for every device (around 3000) that the affected path
was attached to.
As soon as I saw the messages I configured the CHPID offline and the
problem stopped.
We have put in automation that will immediately configure a CHPID
offline as soon as a single IOS450E message is detected, and now I am
experimenting with RECOVERY options.
IBM recommended to set RECOVERY,PATH_SCOPE=CU, set the PATH_INTERVAL to
1 and leave PATH_THRESHOLD=10, and adjust from there.

Due to the paperwork involved with making any change in our environment,
I would like to implement this with a minimum of 'adjustment'.

Does anyone have any recommendations?
We are running on z13s, 16G Ficon through Brokade switches to IBM DS88xx
DASD.

Thanks,
Jim


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions, send
email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions, send
email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to