Hello all..
I am going to do my best to describe my problem.  Hopefully someone will
have some sort of insight.

Tomcat 7.0.41 (working on updating that)
Java 1.6 (Working on getting this updated to the latest minor release)
RHEL Linux

I inherited an opti-tenant setup.  Individual user accounts on the system
each have their own Tomcat instance, each is started using sysinit.  This
is done to keep each website in its own permissible world so one website
can't interfere with a others data.

There are two load balanced apache proxies at the edge that point to one
Tomcat server (I know I know but again I inherited this)

Apache lays over the top of tomcat to terminate SSL and uses AJP to
proxypass to each tomcat instance based on the users assigned port.

Things have run fine for years (so I am being told anyway) until recently.
Let me give an example of an outage.

User1, user2 and user3 all use unique databases on a shared database
server, SQL server 10.

User 4 runs on a windows jboss server and also has a database on shared
database server 10.

Users 5-50 all run in the mentioned Linux server using tomcat and have
databases on *other* various shared databases servers but have nothing to
do with database server 10.

User 4 had a stored proc go wild on database server 10 basically knocking
it offline.

  Now one would expect sites 1-4 to experience interruption of service
because they use a shared DBMS platform.  However.

Every single site goes down. I monitor the connections for each site with a
custom tool.  When this happens, the connections start stacking up across
all the components. (Proxies all the way through the stack)
Looking at the AJP connection pool threads for user 9 shows that user has
exhausted their AJP connection pool threads.  They are maxed out at 300 yet
that user doesn't have high activity at all. The CPU load, memory usage and
traffic for everything except SQL server 10 is stable during this outrage.
The proxies start consuming more and more memory the longer the outrage
occurs but that's expected as the connection counts stack up into the
thousands.  After a short time all the sites apache / ssl termination later
start throwing AJP timeout errors.  Shortly after that the edge proxies
will naturally also starting throwing timeout errors of their own.

I am only watching user 9 using a tool that allows me to have insight into
what's going on using JMX metrics but I suspect that once I get all the
others instrumented that I will see the same thing. Maxed out AJP
connection pools.

Aren't those supposed to be unique per user/ JVM? Am I missing something in
the docs?

Any assistance from the tomcat gods is much appreciated.


Thanks in advance.
TCD

Reply via email to