ActiveMQ slave appears to have brought our site down

bpardee Thu, 12 May 2011 08:53:22 -0700

Our site appears to have been brought down my our ActiveMQ slave last night. 
It's CPU was operating at 100%
so we theorize that our master ActiveMQ server was hung trying to replicate
to the servers and our web
servers were hung up attempting to publish messages.  Web requests weren't
releasing their db connections so
eventually all db connections in the pool were used up.  I killed the
ActiveMQ slave and immediately things 
started working again.  I wish I had done a binary dump first but was too
busy panicking trying to get the
site up.

We have hyperic monitoring ActiveMQ and it shows that the the CPU started
steadily increasing on the slave from it's
normal range of about 2% up to 100% over the 2 hours prior. See the
attached image (image times are in EDT
while the log times below are in UTC). Our web server threads and database
connections spiked when the CPU hit 100%.

Our slave log shows many of the following types of messages. These messages
have been occurring for several
days but the frequency seems to have increased quite a bit in the hours
leading up to the problem time.

2011-05-11 17:10:30,732 | WARN | Duplicate message add attempt rejected.
Destination: intercept_responses, Message id:
ID:app1.mysite.com-33692-1305033332639-0:1:1:59006:1 |
org.apache.activemq.store.kahadb.MessageDatabase | VMTransport:
vm://localhost#1

The master log shows many of the following types of messages. We were
getting enough of these NPE's that the log
reached rotate size and rotated several times and purged such that we only
have info up to 1 hour prior to the
event.

2011-05-12 01:17:32,109 | ERROR | Slave Failed |
org.apache.activemq.broker.ft.MasterBroker | ActiveMQ Broker[localhost]
Scheduler java.lang.NullPointerException

Our master log shows the following message at the same time as our site went
down. I'm not sure if this is
referring to itself or the slave? Maybe we're not dedicating enough heap
space but based on the CPU problems
above it would seem to just be a symptom? We haven't restarted the master
since the event and it seems to be running fine.

2011-05-12 02:58:40,468 | ERROR | Slave Failed |
org.apache.activemq.broker.ft.MasterBroker | ActiveMQ Transport:
tcp:///10.180.78.158:50957
java.lang.OutOfMemoryError: Java heap space
at
org.apache.activemq.openwire.OpenWireFormat.<init>(OpenWireFormat.java:60)
at
org.apache.activemq.openwire.OpenWireFormat.<init>(OpenWireFormat.java:66)
...

We've only been using ActiveMQ for a few weeks. Here are some of our
settings:

RHEL5
apache-activemq-5.4.2-fuse-03-09-bin.tar.gz
jre1.6.0_24
ACTIVEMQ_OPTS_MEMORY="-Xms256M -Xmx1024M
-Dorg.apache.activemq.UseDedicatedTaskRunner=false"
failover://(tcp://msg1:61616,tcp://msg2:61616)?randomize=false&timeout=60000&initialReconnectDelay=100&useExponentialBackOff=true

Process:
java -Xms256M -Xmx1024M -Dorg.apache.activemq.UseDedicatedTaskRunner=false
-Dorg.apache.activemq.UseDedicatedTaskRunner=true
-Djava.util.logging.config.file=logging.properties
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote
-Dactivemq.classpath=/opt/activemq/conf; -Dactivemq.home=/opt/activemq
-Dactivemq.base=/opt/activemq -jar /opt/activemq/bin/run.jar start

Our current thoughts right now are to have increase the heap size and have
hyperic send an email if the CPU gets to 50%
and to automatically restart if it gets to 90%.

So any thoughts on what we could do to diagnose the problem would be greatly
appreciated. We are currently checking into
commercial support options.

Thanks,
Brad
http://activemq.2283324.n4.nabble.com/file/n3517958/ActiveMQ_CPU.gif

--
View this message in context:
http://activemq.2283324.n4.nabble.com/ActiveMQ-slave-appears-to-have-brought-our-site-down-tp3517958p3517958.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

ActiveMQ slave appears to have brought our site down

Reply via email to