Our site appears to have been brought down my our ActiveMQ slave last night. It's CPU was operating at 100% so we theorize that our master ActiveMQ server was hung trying to replicate to the servers and our web servers were hung up attempting to publish messages. Web requests weren't releasing their db connections so eventually all db connections in the pool were used up. I killed the ActiveMQ slave and immediately things started working again. I wish I had done a binary dump first but was too busy panicking trying to get the site up.
We have hyperic monitoring ActiveMQ and it shows that the the CPU started steadily increasing on the slave from it's normal range of about 2% up to 100% over the 2 hours prior. See the attached image (image times are in EDT while the log times below are in UTC). Our web server threads and database connections spiked when the CPU hit 100%. Our slave log shows many of the following types of messages. These messages have been occurring for several days but the frequency seems to have increased quite a bit in the hours leading up to the problem time. 2011-05-11 17:10:30,732 | WARN | Duplicate message add attempt rejected. Destination: intercept_responses, Message id: ID:app1.mysite.com-33692-1305033332639-0:1:1:59006:1 | org.apache.activemq.store.kahadb.MessageDatabase | VMTransport: vm://localhost#1 The master log shows many of the following types of messages. We were getting enough of these NPE's that the log reached rotate size and rotated several times and purged such that we only have info up to 1 hour prior to the event. 2011-05-12 01:17:32,109 | ERROR | Slave Failed | org.apache.activemq.broker.ft.MasterBroker | ActiveMQ Broker[localhost] Scheduler java.lang.NullPointerException Our master log shows the following message at the same time as our site went down. I'm not sure if this is referring to itself or the slave? Maybe we're not dedicating enough heap space but based on the CPU problems above it would seem to just be a symptom? We haven't restarted the master since the event and it seems to be running fine. 2011-05-12 02:58:40,468 | ERROR | Slave Failed | org.apache.activemq.broker.ft.MasterBroker | ActiveMQ Transport: tcp:///10.180.78.158:50957 java.lang.OutOfMemoryError: Java heap space at org.apache.activemq.openwire.OpenWireFormat.<init>(OpenWireFormat.java:60) at org.apache.activemq.openwire.OpenWireFormat.<init>(OpenWireFormat.java:66) ... We've only been using ActiveMQ for a few weeks. Here are some of our settings: RHEL5 apache-activemq-5.4.2-fuse-03-09-bin.tar.gz jre1.6.0_24 ACTIVEMQ_OPTS_MEMORY="-Xms256M -Xmx1024M -Dorg.apache.activemq.UseDedicatedTaskRunner=false" failover://(tcp://msg1:61616,tcp://msg2:61616)?randomize=false&timeout=60000&initialReconnectDelay=100&useExponentialBackOff=true Process: java -Xms256M -Xmx1024M -Dorg.apache.activemq.UseDedicatedTaskRunner=false -Dorg.apache.activemq.UseDedicatedTaskRunner=true -Djava.util.logging.config.file=logging.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dactivemq.classpath=/opt/activemq/conf; -Dactivemq.home=/opt/activemq -Dactivemq.base=/opt/activemq -jar /opt/activemq/bin/run.jar start Our current thoughts right now are to have increase the heap size and have hyperic send an email if the CPU gets to 50% and to automatically restart if it gets to 90%. So any thoughts on what we could do to diagnose the problem would be greatly appreciated. We are currently checking into commercial support options. Thanks, Brad http://activemq.2283324.n4.nabble.com/file/n3517958/ActiveMQ_CPU.gif -- View this message in context: http://activemq.2283324.n4.nabble.com/ActiveMQ-slave-appears-to-have-brought-our-site-down-tp3517958p3517958.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.
