Bill, I found out in my PMR why things are worse after the MQ upgrade.
In MQ 5.3 CSD 11 and earlier, the looksAlive/isAlive check that the
cluster uses to determine the status of the QM would wait indefinitely
for a response. At MQ 5.3 CSD12 and later, it would wait only 10
seconds.
 
APAR IC53278 allows us to set a system variable telling that health
check how long to wait before timing out, plus cuts an FDC with a new
specific Probe ID if the failover occurs because of a timeout. They
still say the underlying problem is why the health check hangs and
suspect high CPU, but that's not the case for us. We are going to look
for heavy I/O as a possible culprit based on your experience.
 
Anyway, we are going to roll this APAR out and set the timeout to 300000
(5 minutes). If it waited forever in 5.3 CSD11 I can't see how 5 minutes
is a bad thing. We tested in the lab by setting the timeout variable and
the killing amqzmuc0.exe. The QM failed over immediately; it did not
wait 5 minutes. They say that variable only comes into play if the
health check (which checks every 5 seconds) is not responding.
 

Peter 

 

________________________________

From: MQSeries List [mailto:[EMAIL PROTECTED] On
Behalf Of Conklin, William
Sent: Friday, October 12, 2007 9:20 AM
To: [email protected]
Subject: Re: Spontaneous Qm failovers in MSCS


--> 

Peter,

I used the following trace command, "strmqtrc -l 5" which wraps the
trace logs after they're 5 megs, I kicked this off via a windows
schedule tasks from 9:55 to 10:15, I could do that because my failures
were consistent during this time frame.  I have also used the following
"link" to kickoff a trace that looks for a particular string in this
case a probe id. You may be able to modify this to search for whatever
you want and then stop the trace.

 

http://www-1.ibm.com/support/docview.wss?rs=171&context=SSFKSJ&context=S
SWHKB&q1=Batch+job+to+stop+websphere+mq+trace+when+an+FDC+occurs&uid=swg
21193147&loc=en_US&cs=utf-8&lang=en

 

I hope this helps...

Bill C.

 

________________________________

From: MQSeries List [mailto:[EMAIL PROTECTED] On
Behalf Of Potkay, Peter M (ISD, IT)
Sent: Wednesday, October 10, 2007 11:12 AM
To: [email protected]
Subject: Re: Spontaneous Qm failovers in MSCS

 

Bill,

Glad I'm not alone! Its very sporadic for me. Some clusters have never
popped. One has had 2 failures 2 days apart. Makes it very tough to
trace. What did you trace? strmqtrc -t all -t detail? Or did you just
trace something very specific.

 

 

I pushed back at the PMR with the following questions:

 

1.      Is the resource monitor process that invokes the routine
IsAlive/LooksAlive a lot more sensitive in MQ 6.0? 
2.      Was the internal timeout value in 5.3 a lot higher than the 10
seconds MQ 6.0 uses by default? 
3.      By applying this APAR will the FDC provide you information on
how long IsAlive/LooksAlive took if another failure occurs? 
4.      Is there any sort of looping trace we can put that traces just
the resource monitor process? Something that will not effect performance
that we can leave running for a few weeks. 

 

Peter Potkay 
MQSeries Team Leader 
IBM Global Services - The Hartford Account 
Hartford email: [EMAIL PROTECTED]
IBM email: [EMAIL PROTECTED] 
Office Phone: 1-860-547-7906 
Cell Phone: 1-860-202-1375 
Pager: 1-800-203-3375 

 

 

________________________________

From: MQSeries List [mailto:[EMAIL PROTECTED] On
Behalf Of Conklin, William
Sent: Wednesday, October 10, 2007 9:28 AM
To: [email protected]
Subject: Re: Spontaneous Qm failovers in MSCS

--> 

Hi Peter,

We've seen similar instances of this with our Windows 2003 and 6.0
WebSphere 6.0 server running 6.0.2.1, we've been dealing with this for
about 5 months we notice it when the backups run at ~ 10:00 PM almost
like clock work.  The CPU is certainly not pegged and according to the
several traces, we sent IBM it looked like there is an I/O performance
issue when we backup the C: drive with Legato.  When we backup D and E
which are SAN drives we don't experience the intensive I/O spike.  In
response to IBM's request I increased the AMQ_MSCS_TIMEOUT  variable
from the default of 3000 ms to 120000 ms and that didn't resolve the
problem.  We could have increased this further but we did not think it
would have any impact on the situation.

 

To prove our theory regarding this we stopped the backups for a week and
all the failovers and timeouts stopped immediately, as soon as we
started them up again they came back at 10:00 PM.  Our plan is to move
to a Linux Redhat 64 bit environment and have all the disks SAN aware,
part of the performance issue is that we also use the Broker and it is
pegging the limits of the Windows OS since it can only address ~3.5 gig
of memory.

 

Since this application is so critical, we simply stopped backing up the
system on a nightly basis and scheduled it once/wk on Sunday morning and
all is well. 

 

The strange thing about this is that the system was up for 2.5 years
without any of these problems.  Good luck!

 

Thanks 

Bill C.

 

 

 

________________________________

From: MQSeries List [mailto:[EMAIL PROTECTED] On
Behalf Of Potkay, Peter M (ISD, IT)
Sent: Monday, October 08, 2007 10:59 AM
To: [email protected]
Subject: Spontaneous Qm failovers in MSCS

 

Windows 2003 SP1 
MQ 6.0.2.1 + IC51904 + IC53266 
Microsoft hardware cluster. 

These 2 node clusters (a dozen of them) have been set up and running OK
for ~ 3 years. Since July I have upgrade them all to MQ 6.0.2.1. In that
time I have had 7 occurrences where the QM fails over to the other node
or just plain comes offline and doesn't failover. In some cases it was
weeks after the MQ upgrade. Some clusters have never had this problem.
One of them had it happen twice in 3 days 2 weeks after I upgraded, but
not again in the 5 weeks since. 

Looking in the QM error logs the first "error" is the messages saying
the Repository Manager is ending, the same message you get when you ask
the QM to end.

We opened a case with MSFT and they said the MQ resource is reporting
the QM is not healthy (or not reporting at all), so the cluster
initiates a failover. The referred us to this "fix"

http://www-1.ibm.com/support/docview.wss?rs=171&context=SSFKSJ&context=S
SWHKB&q1=ic52378&uid=swg1IC52378&loc=en_US&cs=utf-8&lang=en
<http://www-1.ibm.com/support/docview.wss?rs=171&context=SSFKSJ&context=
SSWHKB&q1=ic52378&uid=swg1IC52378&loc=en_US&cs=utf-8&lang=en> 

"3) The timeout value for the looksAlive check has been made
customizable by setting the AMQ_MSCS_TIMEOUT environment variable to the
amount in milliseconds that the timeout should be. The cluster will need
to be stopped and restarted for this to take effect. Note changing this
value will impact how responsive the WMQ resource is to a 'real'
failure, and it should only be set as an interim measure whilst the real
problem is resolved. However, these changes only help reduce the
symptoms, and the real problem is the source of why the check was taking
so long. For example, in this instance an external process hogging the
CPU and this needs to be remedied in tandem with applying this APAR."

 

But in all cases to the best of our knowledge the servers were barely
breathing as far as CPU is concerned. 

Has anyone dealt with this yet? 
If I apply IC52378, what would I set "AMQ_MSCS_TIMEOUT" to? And if there
is no indication that CPU was a problem, do I even go down this route?

Is there some sort of tracing I can put on specific to the MQ dll that
talks with MSCS that might indicate why these 2 components are not
conversing properly? I would need to leave the trace on for potentially
weeks so I'm looking for a trace specific to the dll that will have a
minimal effect on performance and whose log files can be set to wrap
every day or so.

Peter Potkay 
MQSeries Team Leader 
IBM Global Services - The Hartford Account 
Hartford email: [EMAIL PROTECTED]
IBM email: [EMAIL PROTECTED] 
Office Phone: 1-860-547-7906 
Cell Phone: 1-860-202-1375 
Pager: 1-800-203-3375 



************************************************************************
*
This communication, including attachments, is
for the exclusive use of addressee and may contain proprietary,
confidential and/or privileged information. If you are not the intended
recipient, any use, copying, disclosure, dissemination or distribution
is
strictly prohibited. If you are not the intended recipient, please
notify
the sender immediately by return e-mail, delete this communication and
destroy all copies.
************************************************************************
*

 

________________________________

List Archive <http://listserv.meduniwien.ac.at/archives/mqser-l.html>  -
Manage Your List Settings
<http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1>  -
Unsubscribe
<mailto:[EMAIL PROTECTED]&BODY=sign
off%20mqseries>  

Instructions for managing your mailing list subscription are provided in
the Listserv General Users Guide available at http://www.lsoft.com
<http://www.lsoft.com/resources/manuals.asp>  

 

________________________________

List Archive <http://listserv.meduniwien.ac.at/archives/mqser-l.html>  -
Manage Your List Settings
<http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1>  -
Unsubscribe
<mailto:[EMAIL PROTECTED]&BODY=sign
off%20mqseries>  

Instructions for managing your mailing list subscription are provided in
the Listserv General Users Guide available at http://www.lsoft.com
<http://www.lsoft.com/resources/manuals.asp>  


________________________________

List Archive <http://listserv.meduniwien.ac.at/archives/mqser-l.html>  -
Manage Your List Settings
<http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1>  -
Unsubscribe
<mailto:[EMAIL PROTECTED]&BODY=sign
off%20mqseries>  

Instructions for managing your mailing list subscription are provided in
the Listserv General Users Guide available at http://www.lsoft.com
<http://www.lsoft.com/resources/manuals.asp>  


________________________________

List Archive <http://listserv.meduniwien.ac.at/archives/mqser-l.html>  -
Manage Your List Settings
<http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1>  -
Unsubscribe
<mailto:[EMAIL PROTECTED]&BODY=sign
off%20mqseries>  

Instructions for managing your mailing list subscription are provided in
the Listserv General Users Guide available at http://www.lsoft.com
<http://www.lsoft.com/resources/manuals.asp>  


To unsubscribe, write to [EMAIL PROTECTED] and,
in the message body (not the subject), write: SIGNOFF MQSERIES
Instructions for managing your mailing list subscription are provided in
the Listserv General Users Guide available at http://www.lsoft.com
Archive: http://listserv.meduniwien.ac.at/archives/mqser-l.html

Reply via email to