We are running CentOS 5.4, Chukwa 0.3.0, java version "1.6.0_17", and are feeding a steady stream of data into our CDH3u3 Hadoop cluster. We have 6 Chukwa agent machines feeding 3 Chukwa collectors. Any time the cluster gets busy with a big job or the task of decommissioning a node the Chukwa agent and collector start to back up and and I start seeing "WaitingQueue - MemLimitQueue is full" messages in the agent.log as shown below. As soon as hadoop cluster activity dies down the MemLimitQueue messages go away and everything goes back to normal.
[root@COLL5 chukwa]# ps auxf | grep chukwa root 11258 0.0 0.0 61172 732 pts/0 S+ 15:15 0:00 \_ grep chukwa root 29248 1.2 2.1 415572 86928 ? Sl 04:03 8:04 /usr/java/default/bin/java -Xms32M -Xmx64M -DAPP=agent -Dlog4j.configuration=chukwa-log4j.properties -DCHUKWA_HOME=/usr/local/chukwa/bin/.. -DCHUKWA_CONF_DIR=/usr/local/chukwa/bin/../conf -DCHUKWA_LOG_DIR=/usr/local/chukwa/logs -classpath /usr/local/chukwa/bin/../conf::/usr/local/chukwa/bin/../chukwa-agent-0.3.0.jar:/usr/local/chukwa/bin/../chukwa-core-0.3.0.jar:/usr/local/chukwa/bin/../hadoopjars/hadoop-0.20.0-core.jar:/usr/local/chukwa/bin/../lib/NagiosAppender-1.5.0.jar:/usr/local/chukwa/bin/../lib/ant-1.7.1.jar:/usr/local/chukwa/bin/../lib/ant-launcher-1.7.1.jar:/usr/local/chukwa/bin/../lib/asm-3.1.jar:/usr/local/chukwa/bin/../lib/commons-beanutils-1.8.0.jar:/usr/local/chukwa/bin/../lib/commons-cli-2.0-SNAPSHOT.jar:/usr/local/chukwa/bin/../lib/commons-codec-1.3.jar:/usr/local/chukwa/bin/../lib/commons-collections-3.1.jar:/usr/local/chukwa/bin/../lib/commons-fileupload-1.2.jar:/usr/local/chukwa/bin/../lib/commons-httpclient-3.0.1.jar:/usr/local/chukwa/bin/../lib/commons-io-1.4.jar:/usr/local/chukwa/bin/../lib/commons-lang-2.4.jar:/usr/local/chukwa/bin/../lib/commons-logging-1.1.1.jar:/usr/local/chukwa/bin/../lib/commons-logging-api-1.0.4.jar:/usr/local/chukwa/bin/../lib/commons-net-1.4.1.jar:/usr/local/chukwa/bin/../lib/core-3.1.1.jar:/usr/local/chukwa/bin/../lib/ezmorph-1.0.6.jar:/usr/local/chukwa/bin/../lib/jchronic-0.2.3.jar:/usr/local/chukwa/bin/../lib/jersey-bundle-1.1.0-ea.jar:/usr/local/chukwa/bin/../lib/jetty-6.1.11.jar:/usr/local/chukwa/bin/../lib/jetty-util-6.1.11.jar:/usr/local/chukwa/bin/../lib/json-lib-2.2.3-jdk15.jar:/usr/local/chukwa/bin/../lib/json.jar:/usr/local/chukwa/bin/../lib/jsp-2.1-6.1.11.jar:/usr/local/chukwa/bin/../lib/jsp-api-2.1-6.1.11.jar:/usr/local/chukwa/bin/../lib/jsr311-api-1.0.jar:/usr/local/chukwa/bin/../lib/junit-3.8.1.jar:/usr/local/chukwa/bin/../lib/log4j-1.2.13.jar:/usr/local/chukwa/bin/../lib/mysql-connector-java-5.1.6.jar:/usr/local/chukwa/bin/../lib/prefuse.jar:/usr/local/chukwa/bin/../lib/servlet-api-2.5-6.1.11.jar org.apache.hadoop.chukwa.datacollection.agent.ChukwaAgent agent.log ........ 2012-11-10 14:56:14,470 INFO Timer-0 ChukwaAgent - writing checkpoint 7257 2012-11-10 14:56:18,655 INFO Timer-1 HttpConnector - # http chunks ACK'ed since last report: 547 2012-11-10 14:56:20,163 INFO HTTP post thread ChukwaHttpSender - >>>>>> HTTP Got success back from http://10.5.200.204:8080/chukwa; response length 832 2012-11-10 14:56:20,163 INFO HTTP post thread HttpConnector - sent 13 chunks, got back 13 acks 2012-11-10 14:56:20,163 INFO HTTP post thread ChukwaHttpSender - collected 13 chunks *2012-11-10 14:56:20,163 INFO Thread-6 WaitingQueue - MemLimitQueue is full [8119214]* 2012-11-10 14:56:20,166 INFO HTTP post thread ChukwaHttpSender - >>>>>> HTTP post to http://10.5.200.204:8080/ length = 2286662 2012-11-10 14:56:24,474 INFO Timer-0 ChukwaAgent - writing checkpoint 7258 2012-11-10 14:56:27,293 INFO HTTP post thread ChukwaHttpSender - >>>>>> HTTP Got success back from http://10.5.200.204:8080/chukwa; response length 832 2012-11-10 14:56:27,294 INFO HTTP post thread HttpConnector - sent 13 chunks, got back 13 acks 2012-11-10 14:56:27,294 INFO HTTP post thread ChukwaHttpSender - collected 13 chunks *2012-11-10 14:56:27,295 INFO Thread-6 WaitingQueue - MemLimitQueue is full [8091188]* 2012-11-10 14:56:27,302 INFO HTTP post thread ChukwaHttpSender - >>>>>> HTTP post to http://10.5.200.204:8080/ length = 2214008 2012-11-10 14:56:29,476 INFO Timer-0 ChukwaAgent - writing checkpoint 7259 Any ideas? -- -- *Logan Hardy *| Operations Engineer 33Across <http://www.33across.com/> | Follow us: Twitter<http://www.twitter.com/33across> | Facebook <http://www.facebook.com/33across> o 801.231.4573 *Learn about our Q1 Brand Graph Category Insights Report<http://www.33across.com/BrandGraph/33Across_BrandGraph_AQ1_2012.pdf> * * 33Across and Tynt in the News *AdWeek • AllThingsD • Bloomberg • Forbes • TechCrunch • VentureBeat • WSJ<http://33across.com/news.php#axzz1uqxl0v16>
