Bill, I found out in my PMR why things are worse after the MQ upgrade. In MQ 5.3 CSD 11 and earlier, the looksAlive/isAlive check that the cluster uses to determine the status of the QM would wait indefinitely for a response. At MQ 5.3 CSD12 and later, it would wait only 10 seconds. APAR IC53278 allows us to set a system variable telling that health check how long to wait before timing out, plus cuts an FDC with a new specific Probe ID if the failover occurs because of a timeout. They still say the underlying problem is why the health check hangs and suspect high CPU, but that's not the case for us. We are going to look for heavy I/O as a possible culprit based on your experience. Anyway, we are going to roll this APAR out and set the timeout to 300000 (5 minutes). If it waited forever in 5.3 CSD11 I can't see how 5 minutes is a bad thing. We tested in the lab by setting the timeout variable and the killing amqzmuc0.exe. The QM failed over immediately; it did not wait 5 minutes. They say that variable only comes into play if the health check (which checks every 5 seconds) is not responding.
Peter ________________________________ From: MQSeries List [mailto:[EMAIL PROTECTED] On Behalf Of Conklin, William Sent: Friday, October 12, 2007 9:20 AM To: [email protected] Subject: Re: Spontaneous Qm failovers in MSCS --> Peter, I used the following trace command, "strmqtrc -l 5" which wraps the trace logs after they're 5 megs, I kicked this off via a windows schedule tasks from 9:55 to 10:15, I could do that because my failures were consistent during this time frame. I have also used the following "link" to kickoff a trace that looks for a particular string in this case a probe id. You may be able to modify this to search for whatever you want and then stop the trace. http://www-1.ibm.com/support/docview.wss?rs=171&context=SSFKSJ&context=S SWHKB&q1=Batch+job+to+stop+websphere+mq+trace+when+an+FDC+occurs&uid=swg 21193147&loc=en_US&cs=utf-8&lang=en I hope this helps... Bill C. ________________________________ From: MQSeries List [mailto:[EMAIL PROTECTED] On Behalf Of Potkay, Peter M (ISD, IT) Sent: Wednesday, October 10, 2007 11:12 AM To: [email protected] Subject: Re: Spontaneous Qm failovers in MSCS Bill, Glad I'm not alone! Its very sporadic for me. Some clusters have never popped. One has had 2 failures 2 days apart. Makes it very tough to trace. What did you trace? strmqtrc -t all -t detail? Or did you just trace something very specific. I pushed back at the PMR with the following questions: 1. Is the resource monitor process that invokes the routine IsAlive/LooksAlive a lot more sensitive in MQ 6.0? 2. Was the internal timeout value in 5.3 a lot higher than the 10 seconds MQ 6.0 uses by default? 3. By applying this APAR will the FDC provide you information on how long IsAlive/LooksAlive took if another failure occurs? 4. Is there any sort of looping trace we can put that traces just the resource monitor process? Something that will not effect performance that we can leave running for a few weeks. Peter Potkay MQSeries Team Leader IBM Global Services - The Hartford Account Hartford email: [EMAIL PROTECTED] IBM email: [EMAIL PROTECTED] Office Phone: 1-860-547-7906 Cell Phone: 1-860-202-1375 Pager: 1-800-203-3375 ________________________________ From: MQSeries List [mailto:[EMAIL PROTECTED] On Behalf Of Conklin, William Sent: Wednesday, October 10, 2007 9:28 AM To: [email protected] Subject: Re: Spontaneous Qm failovers in MSCS --> Hi Peter, We've seen similar instances of this with our Windows 2003 and 6.0 WebSphere 6.0 server running 6.0.2.1, we've been dealing with this for about 5 months we notice it when the backups run at ~ 10:00 PM almost like clock work. The CPU is certainly not pegged and according to the several traces, we sent IBM it looked like there is an I/O performance issue when we backup the C: drive with Legato. When we backup D and E which are SAN drives we don't experience the intensive I/O spike. In response to IBM's request I increased the AMQ_MSCS_TIMEOUT variable from the default of 3000 ms to 120000 ms and that didn't resolve the problem. We could have increased this further but we did not think it would have any impact on the situation. To prove our theory regarding this we stopped the backups for a week and all the failovers and timeouts stopped immediately, as soon as we started them up again they came back at 10:00 PM. Our plan is to move to a Linux Redhat 64 bit environment and have all the disks SAN aware, part of the performance issue is that we also use the Broker and it is pegging the limits of the Windows OS since it can only address ~3.5 gig of memory. Since this application is so critical, we simply stopped backing up the system on a nightly basis and scheduled it once/wk on Sunday morning and all is well. The strange thing about this is that the system was up for 2.5 years without any of these problems. Good luck! Thanks Bill C. ________________________________ From: MQSeries List [mailto:[EMAIL PROTECTED] On Behalf Of Potkay, Peter M (ISD, IT) Sent: Monday, October 08, 2007 10:59 AM To: [email protected] Subject: Spontaneous Qm failovers in MSCS Windows 2003 SP1 MQ 6.0.2.1 + IC51904 + IC53266 Microsoft hardware cluster. These 2 node clusters (a dozen of them) have been set up and running OK for ~ 3 years. Since July I have upgrade them all to MQ 6.0.2.1. In that time I have had 7 occurrences where the QM fails over to the other node or just plain comes offline and doesn't failover. In some cases it was weeks after the MQ upgrade. Some clusters have never had this problem. One of them had it happen twice in 3 days 2 weeks after I upgraded, but not again in the 5 weeks since. Looking in the QM error logs the first "error" is the messages saying the Repository Manager is ending, the same message you get when you ask the QM to end. We opened a case with MSFT and they said the MQ resource is reporting the QM is not healthy (or not reporting at all), so the cluster initiates a failover. The referred us to this "fix" http://www-1.ibm.com/support/docview.wss?rs=171&context=SSFKSJ&context=S SWHKB&q1=ic52378&uid=swg1IC52378&loc=en_US&cs=utf-8&lang=en <http://www-1.ibm.com/support/docview.wss?rs=171&context=SSFKSJ&context= SSWHKB&q1=ic52378&uid=swg1IC52378&loc=en_US&cs=utf-8&lang=en> "3) The timeout value for the looksAlive check has been made customizable by setting the AMQ_MSCS_TIMEOUT environment variable to the amount in milliseconds that the timeout should be. The cluster will need to be stopped and restarted for this to take effect. Note changing this value will impact how responsive the WMQ resource is to a 'real' failure, and it should only be set as an interim measure whilst the real problem is resolved. However, these changes only help reduce the symptoms, and the real problem is the source of why the check was taking so long. For example, in this instance an external process hogging the CPU and this needs to be remedied in tandem with applying this APAR." But in all cases to the best of our knowledge the servers were barely breathing as far as CPU is concerned. Has anyone dealt with this yet? If I apply IC52378, what would I set "AMQ_MSCS_TIMEOUT" to? And if there is no indication that CPU was a problem, do I even go down this route? Is there some sort of tracing I can put on specific to the MQ dll that talks with MSCS that might indicate why these 2 components are not conversing properly? I would need to leave the trace on for potentially weeks so I'm looking for a trace specific to the dll that will have a minimal effect on performance and whose log files can be set to wrap every day or so. Peter Potkay MQSeries Team Leader IBM Global Services - The Hartford Account Hartford email: [EMAIL PROTECTED] IBM email: [EMAIL PROTECTED] Office Phone: 1-860-547-7906 Cell Phone: 1-860-202-1375 Pager: 1-800-203-3375 ************************************************************************ * This communication, including attachments, is for the exclusive use of addressee and may contain proprietary, confidential and/or privileged information. If you are not the intended recipient, any use, copying, disclosure, dissemination or distribution is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return e-mail, delete this communication and destroy all copies. ************************************************************************ * ________________________________ List Archive <http://listserv.meduniwien.ac.at/archives/mqser-l.html> - Manage Your List Settings <http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1> - Unsubscribe <mailto:[EMAIL PROTECTED]&BODY=sign off%20mqseries> Instructions for managing your mailing list subscription are provided in the Listserv General Users Guide available at http://www.lsoft.com <http://www.lsoft.com/resources/manuals.asp> ________________________________ List Archive <http://listserv.meduniwien.ac.at/archives/mqser-l.html> - Manage Your List Settings <http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1> - Unsubscribe <mailto:[EMAIL PROTECTED]&BODY=sign off%20mqseries> Instructions for managing your mailing list subscription are provided in the Listserv General Users Guide available at http://www.lsoft.com <http://www.lsoft.com/resources/manuals.asp> ________________________________ List Archive <http://listserv.meduniwien.ac.at/archives/mqser-l.html> - Manage Your List Settings <http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1> - Unsubscribe <mailto:[EMAIL PROTECTED]&BODY=sign off%20mqseries> Instructions for managing your mailing list subscription are provided in the Listserv General Users Guide available at http://www.lsoft.com <http://www.lsoft.com/resources/manuals.asp> ________________________________ List Archive <http://listserv.meduniwien.ac.at/archives/mqser-l.html> - Manage Your List Settings <http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1> - Unsubscribe <mailto:[EMAIL PROTECTED]&BODY=sign off%20mqseries> Instructions for managing your mailing list subscription are provided in the Listserv General Users Guide available at http://www.lsoft.com <http://www.lsoft.com/resources/manuals.asp> To unsubscribe, write to [EMAIL PROTECTED] and, in the message body (not the subject), write: SIGNOFF MQSERIES Instructions for managing your mailing list subscription are provided in the Listserv General Users Guide available at http://www.lsoft.com Archive: http://listserv.meduniwien.ac.at/archives/mqser-l.html
