Re: Long-running batch job (high CONN time)

2008-11-05 Thread John Baker
We see this 'knee of the curve' issue in FICON environments all the time.

I would investigate the level of concurrency on the channels paths during this 
period; perhaps even the overall Disk Subsystem throughput.  FICON 
elongation is also known as Connect elongation.  Bear in mind you need to look 
from the DSS perspective.  The bottleneck is very likely on that side rather 
than the zSeries I/O subsystem.

Best regards,

John Baker

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html



Re: Long-running batch job (high CONN time)

2008-10-01 Thread Hal Merritt
Not a lot to go on. For example, we don't even know how many files are
involved. Assuming only one, then how is it being accessed? Sequential?
VSAM sequential? VSAM random? Reading? Writing? Some of both? Each of
those would suggest a different attack vector.  

The numbers suggest a 'knee of the curve' phenomena where only a slight
increase in load will result in a huge jump in clock time. This can also
be describe as a resource saturation event. 

I'd shy away from system issues at first and focus on the most common
application issues. 

If this is a plain old QSAM, then I'd make sure the block size is maxed
(half track) and ample (hundreds) buffers are specified.  

Tuning VSAM is more complex, but the attack is normally buffers and
buffer management strategy (LSR, for example).  

Lastly, a tiny (perhaps hard to observe) increase in I/O times can and
will make huge differences in clock time.   



-Original Message-
From: IBM Mainframe Discussion List [mailto:[EMAIL PROTECTED] On
Behalf Of Johnny Luo
Sent: Wednesday, October 01, 2008 9:26 AM
To: IBM-MAIN@BAMA.UA.EDU
Subject: Long-running batch job (high CONN time)

Hi,

I'm dealing with one production job whose elapsed-time has increased
dramatically in the past month. Since I'm doing this remotely and unable
to
collect relevant data by myself, I must rely on the customer to do that
for
me. It's not so convenient so I must do more 'theroritical' analysis.
And I
don't have the luxury to use tools like STROBE.

Simply putting, the volume of input data to the job has not changed too
much
according to the customer. However, the elapsed time has been increasing
over the month. The customer even did a test to run the same job with
the similar volume of  input data on a sandbox. The result is as
follows: (
I only choose one step)

Production -
Clock: 58.8 (minutes)
TCB:  2.75
SRB: .21
EXCP: 982k
CONN: 899K

TEST System -
Clock: 10.3
TCB: 1.98
SRB: .08
EXCP: 910K
CONN: 282K


Obviously processor should not be the main impactor cause the step is
not
cpu-intensive. For the same step, I believe EXCP count has some meaning:
the
program did the similar amount of I/O on both production system and test
system.

Then, why CONN differs? (899K vs 282K)

>From what I know, for the same amount of I/O CONN can differ if FICON is
used. FICON is using a 'switched transfer mechanism' and if too many
users
are using the channel path, CONN time will increase to transfer the same
amount of data. (Another possiblility  is that too many I/O causes
storage
subsystem to send back the data packed slowly thus increased CONN).

So at first glance, my conclusion is that the job is spending most of
its
time doing I/O (high CONN). The amount of I/O is the same but system
needs
more time to process it. That's the cause of elongation.

As for why system needs more time to process the same amount of I/O, I
believe the most possible reason is that there're othe i/o heavy jobs
running in the system at that time point.

Before digging deeply into the problem, I wanna make sure that the above
conclusion is not wrong.

I also tried RMF III and it shows device delay as the primary delay most
of
time for the job. However, WFL for the job is good: above 80%.  So I
don't
think the device delay will cause the job to run so slowly. Yes,
sometimes
it has delay but most of time it gets what it wants.  High CONN does
not mean high delay.

 Johnny

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

NOTICE: This electronic mail message and any files transmitted with it are 
intended
exclusively for the individual or entity to which it is addressed. The message, 
together with any attachment, may contain confidential and/or privileged 
information.
Any unauthorized review, use, printing, saving, copying, disclosure or 
distribution 
is strictly prohibited. If you have received this message in error, please 
immediately advise the sender by reply email and delete all copies.

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html



Re: Long-running batch job (high CONN time)

2008-10-01 Thread Miklos Szigetvari

Hi

Difficult to advice from here, in the past we had seen for some jobs 
more or less the similar effect , find out that for some volumes the 
VTOC indexing was disabled


Johnny Luo wrote:


Hi,

I'm dealing with one production job whose elapsed-time has increased
dramatically in the past month. Since I'm doing this remotely and unable to
collect relevant data by myself, I must rely on the customer to do that for
me. It's not so convenient so I must do more 'theroritical' analysis. And I
don't have the luxury to use tools like STROBE.

Simply putting, the volume of input data to the job has not changed too much
according to the customer. However, the elapsed time has been increasing
over the month. The customer even did a test to run the same job with
the similar volume of  input data on a sandbox. The result is as follows: (
I only choose one step)

Production -
Clock: 58.8 (minutes)
TCB:  2.75
SRB: .21
EXCP: 982k
CONN: 899K

TEST System -
Clock: 10.3
TCB: 1.98
SRB: .08
EXCP: 910K
CONN: 282K


Obviously processor should not be the main impactor cause the step is not
cpu-intensive. For the same step, I believe EXCP count has some meaning: the
program did the similar amount of I/O on both production system and test
system.

Then, why CONN differs? (899K vs 282K)


From what I know, for the same amount of I/O CONN can differ if FICON is

used. FICON is using a 'switched transfer mechanism' and if too many users
are using the channel path, CONN time will increase to transfer the same
amount of data. (Another possiblility  is that too many I/O causes storage
subsystem to send back the data packed slowly thus increased CONN).

So at first glance, my conclusion is that the job is spending most of its
time doing I/O (high CONN). The amount of I/O is the same but system needs
more time to process it. That's the cause of elongation.

As for why system needs more time to process the same amount of I/O, I
believe the most possible reason is that there're othe i/o heavy jobs
running in the system at that time point.

Before digging deeply into the problem, I wanna make sure that the above
conclusion is not wrong.

I also tried RMF III and it shows device delay as the primary delay most of
time for the job. However, WFL for the job is good: above 80%.  So I don't
think the device delay will cause the job to run so slowly. Yes, sometimes
it has delay but most of time it gets what it wants.  High CONN does
not mean high delay.

Johnny

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html


 



--
Miklos Szigetvari

Development Team
ISIS Information Systems Gmbh 
tel: (+43) 2236 27551 570
Fax: (+43) 2236 21081 

E-mail: [EMAIL PROTECTED] 

Info: [EMAIL PROTECTED] 
Hotline: +43-2236-27551-111 

Visit our Website: http://www.isis-papyrus.com 
---

This e-mail is only intended for the recipient and not legally
binding. Unauthorised use, publication, reproduction or
disclosure of the content of this e-mail is not permitted.
This email has been checked for known viruses, but ISIS accepts
no responsibility for malicious or inappropriate content.
--- 


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html



Long-running batch job (high CONN time)

2008-10-01 Thread Johnny Luo
Hi,

I'm dealing with one production job whose elapsed-time has increased
dramatically in the past month. Since I'm doing this remotely and unable to
collect relevant data by myself, I must rely on the customer to do that for
me. It's not so convenient so I must do more 'theroritical' analysis. And I
don't have the luxury to use tools like STROBE.

Simply putting, the volume of input data to the job has not changed too much
according to the customer. However, the elapsed time has been increasing
over the month. The customer even did a test to run the same job with
the similar volume of  input data on a sandbox. The result is as follows: (
I only choose one step)

Production -
Clock: 58.8 (minutes)
TCB:  2.75
SRB: .21
EXCP: 982k
CONN: 899K

TEST System -
Clock: 10.3
TCB: 1.98
SRB: .08
EXCP: 910K
CONN: 282K


Obviously processor should not be the main impactor cause the step is not
cpu-intensive. For the same step, I believe EXCP count has some meaning: the
program did the similar amount of I/O on both production system and test
system.

Then, why CONN differs? (899K vs 282K)

>From what I know, for the same amount of I/O CONN can differ if FICON is
used. FICON is using a 'switched transfer mechanism' and if too many users
are using the channel path, CONN time will increase to transfer the same
amount of data. (Another possiblility  is that too many I/O causes storage
subsystem to send back the data packed slowly thus increased CONN).

So at first glance, my conclusion is that the job is spending most of its
time doing I/O (high CONN). The amount of I/O is the same but system needs
more time to process it. That's the cause of elongation.

As for why system needs more time to process the same amount of I/O, I
believe the most possible reason is that there're othe i/o heavy jobs
running in the system at that time point.

Before digging deeply into the problem, I wanna make sure that the above
conclusion is not wrong.

I also tried RMF III and it shows device delay as the primary delay most of
time for the job. However, WFL for the job is good: above 80%.  So I don't
think the device delay will cause the job to run so slowly. Yes, sometimes
it has delay but most of time it gets what it wants.  High CONN does
not mean high delay.

 Johnny

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html