Re: AJP communication failures
Hi. Thank you for all the very detailed information provided. From what I can see in the logs, at this point I would have to say that my impression is that this is a problem buried fairly deep in the TCP/IP stack, and both Apache+mod_proxy_ajp, and Tomcat, may just be suffering the consequences of an underlying TCP/IP issue (or of a Windows NLB feature). In the logs, you have messages like : java.net.SocketException: Software caused connection abort: socket write error which is something that comes from the JVM running Tomcat (and even probably from native code in the JVM). Similarly, messages in Apache httpd's logs like [Tue May 29 15:29:43 2012] [error] (OS 10060)A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. : ajp_ilink_receive() can't receive header [Tue May 29 15:29:43 2012] [error] ajp_read_header: ajp_ilink_receive failed [Tue May 29 15:29:43 2012] [error] (70007)The timeout specified has expired: proxy: dialog to 10.11.102.223:9109 (10.11.102.223) failed look to me like OS-level error conditions, just forwarded by Apache to the logs (at least the (OS 10060) prefix looks like a Windows error code). I've read a bit about Windows NLB (just right now, to find out what it is), and it seems to me that there at least /a possibility/ that combining this with another kind of load-balancing (as you do with mod_proxy_ajp) may not be the most stable configuration. From the logs, it really looks as if both the Apache and Tomcat softwares occasionally find themselves with a suddenly non-existent connection, where ping packets are not being returned, and/or a read or write socket suddenly becomes unresponsive. I know that you mentioned that these httpd/tomcat connections are being done on the respective hosts private addresses, and I can see in the logs that the problems happen even on the host's local loop address 127.0.0.1. But on the other hand, setting up NLB seems to involve a common IP stack driver buried fairly deep in the protocol stack of each host (and affinity parameters), and who knows what that thing is doing, or not doing. Just to give an idea - and I realise that this article may have no direct relevance whatsoever to the present issue - see : http://support.microsoft.com/kb/905179 In this case, they are talking about the installation of some software package resulting indirectly in shortening the packet MTU, and this indirectly causing problems with some webserver functions. Just to say that you may be faced with some deep issue like this, because of the NLB implementation. - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
RE: AJP communication failures
Hello Warnier, The disablereuse=On just made things worse, maybe due to the high frequency/quantity of opened connections. I'll look on the possibility to disable the MS NLB. Thanks, Roney - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: AJP communication failures
Roney Duilio Stein wrote: Hello Warnier, The disablereuse=On just made things worse, maybe due to the high frequency/quantity of opened connections. I'll look on the possibility to disable the MS NLB. According to my superficial reading of a couple of MS KB pages about the NLB, it should be relatively easy to at least temporarily take a host out of the NLB (for software updates e.g.). You may want to try that first, with your two hosts A/B. I would also - in a separate step if possible - completely disable the Firewall Service, just in case. Anyway, whatever solves your problem, please report it here, so that someone else may profit from it by searching the list archives. - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: AJP communication failures
On 30 May 2012 16:01, André Warnier a...@ice-sa.com wrote: Roney Duilio Stein wrote: Hello Warnier, The disablereuse=On just made things worse, maybe due to the high frequency/quantity of opened connections. I'll look on the possibility to disable the MS NLB. According to my superficial reading of a couple of MS KB pages about the NLB, it should be relatively easy to at least temporarily take a host out of the NLB (for software updates e.g.). You may want to try that first, with your two hosts A/B. I would also - in a separate step if possible - completely disable the Firewall Service, just in case. Anyway, whatever solves your problem, please report it here, so that someone else may profit from it by searching the list archives. Hi, If there are any network hardware (switches or firewalls) involved they need to have multicast MAC enabled - often this is not enabled by default. Failing to do this can result in similar issues to those you have described. Just something else to check :) - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org -- Best Regards, Brett Delle Grazie - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
RE: AJP communication failures
I forgot to mention that these errors are infrequent, 6 per day on average. Usually everything works fine, sometimes these errors happens and put the worker in error state. However, system load is still very low. Thanks, Roney -Original Message- From: Roney Duilio Stein [mailto:roney.st...@sondait.com.br] Sent: terça-feira, 29 de maio de 2012 21:36 To: users@tomcat.apache.org Subject: AJP communication failures Hello. Hope anyone there can help me with this issue. I'm dealing with this for the past 2 weeks and cannot solve it completely nor locate the root cause. I have an environment with 2 boxes load balanced with mod_proxy_ajp. Each box have 1 Apache HTTP and 1 Tomcat. To illustrate this: Box A: . Tomcat A (6.0.29, x64) . Apache A (2.2.22) Box B: . Tomcat B (6.0.29, x64) . Apache B (2.2.22) Apache A and B have identical setup, each one balances to both Tomcat A and B. Also, there's a third application in Box C with another application being proxied by Apache A and Apache B. Box C: . Tomcat C (6.0.20, x86) All boxes runs Windows 2008 R2 x64, have the Windows Firewall started but not enabled. Box A and Box B are part of a Windows NLB Domain, but all references in the proxy configuration are made using the hosts private IP addresses. Users use the NLB IP address to connecto to Apache. Apache proxy is configured like: == begin httpd.conf == Timeout 600 LimitRequestFieldSize 20480 ProxyIOBufferSize 21504 ProxyRequests Off ProxyPreserveHost On Proxy * Order deny,allow Allow from all /Proxy Proxy balancer://wlb BalancerMember ajp://10.11.102.224:9109 route=wt1 loadfactor=50 max=85 ttl=120 retry=5 connectiontimeout=5000ms ping=5000ms BalancerMember ajp://127.0.0.1:9109 route=wt2 loadfactor=50 max=85 ttl=120 retry=5 connectiontimeout=5000ms ping=5000ms /Proxy ProxyPass /app1 balancer://wlb/app1 stickysession=JSESSIONID nofailover=On ProxyPass /app2 ajp://10.11.102.219:8009/app2 == end httpd.conf == Tomcat A and Tomcat B AJP connector is configured like: Connector port=9109 protocol=AJP/1.3 redirectPort=8443 packetSize=22528 maxThreads=200 connectionTimeout=12/ Tomcat C is configured like: Connector port=8009 protocol=AJP/1.3 redirectPort=8443 / The load is not high and there are a few users using the applications. This is the production environment, I could not trace an operation to reproduce this behavior in a controlled environment. The Box C application app2 shown above runs fine, not a single error message. The timeout parameters for app1 (Tomcat A and B) were configured in an attempt to solve the problem shown here. When using the default (no connectiontimeout, no ping, no ttl, no retry) other communications failures were happening. Now, the problem: the AJP communication between Apache A/B and Tomcat A/B is bad. The following can be seen in the Apache logs: == begin apache log == [Tue May 29 14:43:59 2012] [error] (OS 10060)A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. : ajp_ilink_receive() can't receive header [Tue May 29 14:43:59 2012] [error] ajp_read_header: ajp_ilink_receive failed [Tue May 29 14:43:59 2012] [error] (70007)The timeout specified has expired: proxy: dialog to 127.0.0.1:9109 (127.0.0.1) failed [Tue May 29 14:44:42 2012] [error] [client 10.45.7.78] File does not exist: E:/Apache/htdocs/favicon.ico [Tue May 29 14:45:08 2012] [error] [client 10.45.6.233] File does not exist: E:/Apache/htdocs/favicon.ico [Tue May 29 14:45:17 2012] [error] [client 10.45.6.100] File does not exist: E:/Apache/htdocs/favicon.ico [Tue May 29 14:45:39 2012] [error] (OS 10060)A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. : ajp_ilink_receive() can't receive header [Tue May 29 14:45:39 2012] [error] ajp_read_header: ajp_ilink_receive failed [Tue May 29 14:45:39 2012] [error] (70007)The timeout specified has expired: proxy: dialog to 127.0.0.1:9109 (127.0.0.1) failed [Tue May 29 14:45:39 2012] [error] proxy: BALANCER: (balancer://wlb). All workers are in error state for route (wt1) [Tue May 29 14:54:40 2012] [error] (OS 10060)A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. : ajp_ilink_receive() can't receive header [Tue May 29 14:54:40 2012] [error] ajp_read_header: ajp_ilink_receive failed [Tue May 29 14:54:40 2012] [error] (70007)The timeout specified has expired: proxy: dialog to 127.0.0.1:9109 (127.0.0.1) failed [Tue May 29 15:05:15 2012] [error] [client 200.251.3.133] File does not exist: E:/Apache/htdocs/favicon.ico [Tue May 29 15:07:15 2012] [error] [client 10.45.6.54] File does not exist: