Re: Long-running batch job (high CONN time)
We see this 'knee of the curve' issue in FICON environments all the time. I would investigate the level of concurrency on the channels paths during this period; perhaps even the overall Disk Subsystem throughput. FICON elongation is also known as Connect elongation. Bear in mind you need to look from the DSS perspective. The bottleneck is very likely on that side rather than the zSeries I/O subsystem. Best regards, John Baker -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: Long-running batch job (high CONN time)
Not a lot to go on. For example, we don't even know how many files are involved. Assuming only one, then how is it being accessed? Sequential? VSAM sequential? VSAM random? Reading? Writing? Some of both? Each of those would suggest a different attack vector. The numbers suggest a 'knee of the curve' phenomena where only a slight increase in load will result in a huge jump in clock time. This can also be describe as a resource saturation event. I'd shy away from system issues at first and focus on the most common application issues. If this is a plain old QSAM, then I'd make sure the block size is maxed (half track) and ample (hundreds) buffers are specified. Tuning VSAM is more complex, but the attack is normally buffers and buffer management strategy (LSR, for example). Lastly, a tiny (perhaps hard to observe) increase in I/O times can and will make huge differences in clock time. -Original Message- From: IBM Mainframe Discussion List [mailto:[EMAIL PROTECTED] On Behalf Of Johnny Luo Sent: Wednesday, October 01, 2008 9:26 AM To: IBM-MAIN@BAMA.UA.EDU Subject: Long-running batch job (high CONN time) Hi, I'm dealing with one production job whose elapsed-time has increased dramatically in the past month. Since I'm doing this remotely and unable to collect relevant data by myself, I must rely on the customer to do that for me. It's not so convenient so I must do more 'theroritical' analysis. And I don't have the luxury to use tools like STROBE. Simply putting, the volume of input data to the job has not changed too much according to the customer. However, the elapsed time has been increasing over the month. The customer even did a test to run the same job with the similar volume of input data on a sandbox. The result is as follows: ( I only choose one step) Production - Clock: 58.8 (minutes) TCB: 2.75 SRB: .21 EXCP: 982k CONN: 899K TEST System - Clock: 10.3 TCB: 1.98 SRB: .08 EXCP: 910K CONN: 282K Obviously processor should not be the main impactor cause the step is not cpu-intensive. For the same step, I believe EXCP count has some meaning: the program did the similar amount of I/O on both production system and test system. Then, why CONN differs? (899K vs 282K) >From what I know, for the same amount of I/O CONN can differ if FICON is used. FICON is using a 'switched transfer mechanism' and if too many users are using the channel path, CONN time will increase to transfer the same amount of data. (Another possiblility is that too many I/O causes storage subsystem to send back the data packed slowly thus increased CONN). So at first glance, my conclusion is that the job is spending most of its time doing I/O (high CONN). The amount of I/O is the same but system needs more time to process it. That's the cause of elongation. As for why system needs more time to process the same amount of I/O, I believe the most possible reason is that there're othe i/o heavy jobs running in the system at that time point. Before digging deeply into the problem, I wanna make sure that the above conclusion is not wrong. I also tried RMF III and it shows device delay as the primary delay most of time for the job. However, WFL for the job is good: above 80%. So I don't think the device delay will cause the job to run so slowly. Yes, sometimes it has delay but most of time it gets what it wants. High CONN does not mean high delay. Johnny -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html NOTICE: This electronic mail message and any files transmitted with it are intended exclusively for the individual or entity to which it is addressed. The message, together with any attachment, may contain confidential and/or privileged information. Any unauthorized review, use, printing, saving, copying, disclosure or distribution is strictly prohibited. If you have received this message in error, please immediately advise the sender by reply email and delete all copies. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Re: Long-running batch job (high CONN time)
Hi Difficult to advice from here, in the past we had seen for some jobs more or less the similar effect , find out that for some volumes the VTOC indexing was disabled Johnny Luo wrote: Hi, I'm dealing with one production job whose elapsed-time has increased dramatically in the past month. Since I'm doing this remotely and unable to collect relevant data by myself, I must rely on the customer to do that for me. It's not so convenient so I must do more 'theroritical' analysis. And I don't have the luxury to use tools like STROBE. Simply putting, the volume of input data to the job has not changed too much according to the customer. However, the elapsed time has been increasing over the month. The customer even did a test to run the same job with the similar volume of input data on a sandbox. The result is as follows: ( I only choose one step) Production - Clock: 58.8 (minutes) TCB: 2.75 SRB: .21 EXCP: 982k CONN: 899K TEST System - Clock: 10.3 TCB: 1.98 SRB: .08 EXCP: 910K CONN: 282K Obviously processor should not be the main impactor cause the step is not cpu-intensive. For the same step, I believe EXCP count has some meaning: the program did the similar amount of I/O on both production system and test system. Then, why CONN differs? (899K vs 282K) From what I know, for the same amount of I/O CONN can differ if FICON is used. FICON is using a 'switched transfer mechanism' and if too many users are using the channel path, CONN time will increase to transfer the same amount of data. (Another possiblility is that too many I/O causes storage subsystem to send back the data packed slowly thus increased CONN). So at first glance, my conclusion is that the job is spending most of its time doing I/O (high CONN). The amount of I/O is the same but system needs more time to process it. That's the cause of elongation. As for why system needs more time to process the same amount of I/O, I believe the most possible reason is that there're othe i/o heavy jobs running in the system at that time point. Before digging deeply into the problem, I wanna make sure that the above conclusion is not wrong. I also tried RMF III and it shows device delay as the primary delay most of time for the job. However, WFL for the job is good: above 80%. So I don't think the device delay will cause the job to run so slowly. Yes, sometimes it has delay but most of time it gets what it wants. High CONN does not mean high delay. Johnny -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html -- Miklos Szigetvari Development Team ISIS Information Systems Gmbh tel: (+43) 2236 27551 570 Fax: (+43) 2236 21081 E-mail: [EMAIL PROTECTED] Info: [EMAIL PROTECTED] Hotline: +43-2236-27551-111 Visit our Website: http://www.isis-papyrus.com --- This e-mail is only intended for the recipient and not legally binding. Unauthorised use, publication, reproduction or disclosure of the content of this e-mail is not permitted. This email has been checked for known viruses, but ISIS accepts no responsibility for malicious or inappropriate content. --- -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html
Long-running batch job (high CONN time)
Hi, I'm dealing with one production job whose elapsed-time has increased dramatically in the past month. Since I'm doing this remotely and unable to collect relevant data by myself, I must rely on the customer to do that for me. It's not so convenient so I must do more 'theroritical' analysis. And I don't have the luxury to use tools like STROBE. Simply putting, the volume of input data to the job has not changed too much according to the customer. However, the elapsed time has been increasing over the month. The customer even did a test to run the same job with the similar volume of input data on a sandbox. The result is as follows: ( I only choose one step) Production - Clock: 58.8 (minutes) TCB: 2.75 SRB: .21 EXCP: 982k CONN: 899K TEST System - Clock: 10.3 TCB: 1.98 SRB: .08 EXCP: 910K CONN: 282K Obviously processor should not be the main impactor cause the step is not cpu-intensive. For the same step, I believe EXCP count has some meaning: the program did the similar amount of I/O on both production system and test system. Then, why CONN differs? (899K vs 282K) >From what I know, for the same amount of I/O CONN can differ if FICON is used. FICON is using a 'switched transfer mechanism' and if too many users are using the channel path, CONN time will increase to transfer the same amount of data. (Another possiblility is that too many I/O causes storage subsystem to send back the data packed slowly thus increased CONN). So at first glance, my conclusion is that the job is spending most of its time doing I/O (high CONN). The amount of I/O is the same but system needs more time to process it. That's the cause of elongation. As for why system needs more time to process the same amount of I/O, I believe the most possible reason is that there're othe i/o heavy jobs running in the system at that time point. Before digging deeply into the problem, I wanna make sure that the above conclusion is not wrong. I also tried RMF III and it shows device delay as the primary delay most of time for the job. However, WFL for the job is good: above 80%. So I don't think the device delay will cause the job to run so slowly. Yes, sometimes it has delay but most of time it gets what it wants. High CONN does not mean high delay. Johnny -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html