Re: AJP communication failures

2012-05-30 Thread André Warnier

Hi.

Thank you for all the very detailed information provided.

From what I can see in the logs, at this point I would have to say that my impression is 
that this is a problem buried fairly deep in the TCP/IP stack, and both 
Apache+mod_proxy_ajp, and Tomcat, may just be suffering the consequences of an underlying 
 TCP/IP issue (or of a Windows NLB feature).


In the logs, you have messages like :

java.net.SocketException: Software caused connection abort: socket write error

which is something that comes from the JVM running Tomcat (and even probably from native 
code in the JVM).


Similarly, messages in Apache httpd's logs like

[Tue May 29 15:29:43 2012] [error] (OS 10060)A connection attempt failed because the 
connected party did not properly respond after a period of time, or established connection 
failed because connected host has failed to respond.  : ajp_ilink_receive() can't receive 
header

[Tue May 29 15:29:43 2012] [error] ajp_read_header: ajp_ilink_receive failed
[Tue May 29 15:29:43 2012] [error] (70007)The timeout specified has expired: proxy: dialog 
to 10.11.102.223:9109 (10.11.102.223) failed


look to me like OS-level error conditions, just forwarded by Apache to the logs (at least 
the (OS 10060) prefix looks like a Windows error code).


I've read a bit about Windows NLB (just right now, to find out what it is), and it seems 
to me that there at least /a possibility/ that combining this with another kind of 
load-balancing (as you do with mod_proxy_ajp) may not be the most stable configuration.
From the logs, it really looks as if both the Apache and Tomcat softwares occasionally 
find themselves with a suddenly non-existent connection, where ping packets are not being 
returned, and/or a read or write socket suddenly becomes unresponsive.


I know that you mentioned that these httpd/tomcat connections are being done on the 
respective hosts private addresses, and I can see in the logs that the problems happen 
even on the host's local loop address 127.0.0.1. But on the other hand, setting up NLB 
seems to involve a common IP stack driver buried fairly deep in the protocol stack of each 
host (and affinity parameters), and who knows what that thing is doing, or not doing.


Just to give an idea - and I realise that this article may have no direct relevance 
whatsoever to the present issue - see : http://support.microsoft.com/kb/905179
In this case, they are talking about the installation of some software package resulting 
indirectly in shortening the packet MTU, and this indirectly causing problems with some 
webserver functions.  Just to say that you may be faced with some deep issue like this, 
because of the NLB implementation.



-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



RE: AJP communication failures

2012-05-30 Thread Roney Duilio Stein
Hello Warnier,

The disablereuse=On just made things worse, maybe due to the high 
frequency/quantity of opened connections.
I'll look on the possibility to disable the MS NLB.

Thanks,
Roney



-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: AJP communication failures

2012-05-30 Thread André Warnier

Roney Duilio Stein wrote:

Hello Warnier,

The disablereuse=On just made things worse, maybe due to the high 
frequency/quantity of opened connections.
I'll look on the possibility to disable the MS NLB.

According to my superficial reading of a couple of MS KB pages about the NLB, it should be 
relatively easy to at least temporarily take a host out of the NLB (for software updates 
e.g.). You may want to try that first, with your two hosts A/B.
I would also - in a separate step if possible - completely disable the Firewall Service, 
just in case.


Anyway, whatever solves your problem, please report it here, so that someone else may 
profit from it by searching the list archives.


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: AJP communication failures

2012-05-30 Thread Brett Delle Grazie
On 30 May 2012 16:01, André Warnier a...@ice-sa.com wrote:
 Roney Duilio Stein wrote:

 Hello Warnier,

 The disablereuse=On just made things worse, maybe due to the high
 frequency/quantity of opened connections.
 I'll look on the possibility to disable the MS NLB.

 According to my superficial reading of a couple of MS KB pages about the
 NLB, it should be relatively easy to at least temporarily take a host out of
 the NLB (for software updates e.g.). You may want to try that first, with
 your two hosts A/B.
 I would also - in a separate step if possible - completely disable the
 Firewall Service, just in case.

 Anyway, whatever solves your problem, please report it here, so that someone
 else may profit from it by searching the list archives.

Hi,
If there are any network hardware (switches or firewalls) involved
they need to have multicast
MAC enabled - often this is not enabled by default. Failing to do this
can result in similar issues
to those you have described.
Just something else to check :)



 -
 To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
 For additional commands, e-mail: users-h...@tomcat.apache.org




-- 
Best Regards,

Brett Delle Grazie

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



RE: AJP communication failures

2012-05-29 Thread Roney Duilio Stein
I forgot to mention that these errors are infrequent, 6 per day on average. 
Usually everything works fine, sometimes these errors happens and put the 
worker in error state.
However, system load is still very low.

Thanks,
Roney


-Original Message-
From: Roney Duilio Stein [mailto:roney.st...@sondait.com.br] 
Sent: terça-feira, 29 de maio de 2012 21:36
To: users@tomcat.apache.org
Subject: AJP communication failures

Hello.

Hope anyone there can help me with this issue. I'm dealing with this for the 
past 2 weeks and cannot solve it completely nor locate the root cause. 

I have an environment with 2 boxes load balanced with mod_proxy_ajp. Each box 
have 1 Apache HTTP and 1 Tomcat. To illustrate this:

Box A:
 . Tomcat A (6.0.29, x64)
 . Apache A (2.2.22)

Box B:
 . Tomcat B (6.0.29, x64)
 . Apache B (2.2.22)

Apache A and B have identical setup, each one balances to both Tomcat A and B.

Also, there's a third application in Box C with another application being 
proxied by Apache A and Apache B.
Box C:
 . Tomcat C (6.0.20, x86)

All boxes runs Windows 2008 R2 x64, have the Windows Firewall started but not 
enabled.
Box A and Box B are part of a Windows NLB Domain, but all references in the 
proxy configuration are made using the hosts private IP addresses. Users use 
the NLB IP address to connecto to Apache.

Apache proxy is configured like:
== begin httpd.conf ==
Timeout 600
LimitRequestFieldSize 20480
ProxyIOBufferSize 21504
ProxyRequests Off
ProxyPreserveHost On
Proxy *
  Order deny,allow
  Allow from all
/Proxy
Proxy balancer://wlb
  BalancerMember ajp://10.11.102.224:9109 route=wt1 loadfactor=50 max=85 
ttl=120 retry=5 connectiontimeout=5000ms ping=5000ms
  BalancerMember ajp://127.0.0.1:9109 route=wt2 loadfactor=50 max=85 ttl=120 
retry=5 connectiontimeout=5000ms ping=5000ms /Proxy ProxyPass /app1 
balancer://wlb/app1 stickysession=JSESSIONID nofailover=On ProxyPass /app2 
ajp://10.11.102.219:8009/app2 == end httpd.conf ==

Tomcat A and Tomcat B AJP connector is configured like:
Connector port=9109 protocol=AJP/1.3 redirectPort=8443 
packetSize=22528 maxThreads=200 connectionTimeout=12/

Tomcat C is configured like:
Connector port=8009 protocol=AJP/1.3 redirectPort=8443 /

The load is not high and there are a few users using the applications. This is 
the production environment, I could not trace an operation to reproduce this 
behavior in a controlled environment.
The Box C application app2 shown above runs fine, not a single error 
message.

The timeout parameters for app1 (Tomcat A and B) were configured in an attempt 
to solve the problem shown here. When using the default (no connectiontimeout, 
no ping, no ttl, no retry) other communications failures were happening.

Now, the problem: the AJP communication between Apache A/B and Tomcat A/B is 
bad. The following can be seen in the Apache logs:

== begin apache log ==
[Tue May 29 14:43:59 2012] [error] (OS 10060)A connection attempt failed 
because the connected party did not properly respond after a period of time, or 
established connection failed because connected host has failed to respond.  : 
ajp_ilink_receive() can't receive header [Tue May 29 14:43:59 2012] [error] 
ajp_read_header: ajp_ilink_receive failed [Tue May 29 14:43:59 2012] [error] 
(70007)The timeout specified has expired: proxy: dialog to 127.0.0.1:9109 
(127.0.0.1) failed [Tue May 29 14:44:42 2012] [error] [client 10.45.7.78] File 
does not exist: E:/Apache/htdocs/favicon.ico [Tue May 29 14:45:08 2012] [error] 
[client 10.45.6.233] File does not exist: E:/Apache/htdocs/favicon.ico [Tue May 
29 14:45:17 2012] [error] [client 10.45.6.100] File does not exist: 
E:/Apache/htdocs/favicon.ico [Tue May 29 14:45:39 2012] [error] (OS 10060)A 
connection attempt failed because the connected party did not properly respond 
after a period of time, or established connection failed because connected host 
has failed to respond.  : ajp_ilink_receive() can't receive header [Tue May 29 
14:45:39 2012] [error] ajp_read_header: ajp_ilink_receive failed [Tue May 29 
14:45:39 2012] [error] (70007)The timeout specified has expired: proxy: dialog 
to 127.0.0.1:9109 (127.0.0.1) failed [Tue May 29 14:45:39 2012] [error] proxy: 
BALANCER: (balancer://wlb). All workers are in error state for route (wt1) [Tue 
May 29 14:54:40 2012] [error] (OS 10060)A connection attempt failed because the 
connected party did not properly respond after a period of time, or established 
connection failed because connected host has failed to respond.  : 
ajp_ilink_receive() can't receive header [Tue May 29 14:54:40 2012] [error] 
ajp_read_header: ajp_ilink_receive failed [Tue May 29 14:54:40 2012] [error] 
(70007)The timeout specified has expired: proxy: dialog to 127.0.0.1:9109 
(127.0.0.1) failed [Tue May 29 15:05:15 2012] [error] [client 200.251.3.133] 
File does not exist: E:/Apache/htdocs/favicon.ico [Tue May 29 15:07:15 2012] 
[error] [client 10.45.6.54] File does not exist: