We are using Storm 0.10 with the following configuration: - 1 Nimbus node - 6 Supervisor nodes, each with 2 worker slots. Each supervisor has 8 cores.
Our topology has a KafkaSpout that forwards to a bolt where we transform the message and insert it in to Cassandra. Our topic has 50 partitions so we have configured the number of executors/tasks for the KafkaSpout to be 50. Our bolt has 150 executors/tasks. We have also added the storm-graphite metrics consumer ( https://github.com/verisign/storm-graphite) to our topology so that storms metrics are sent to our graphite cluster. Yesterday we were running a 2000 tuple/sec load test and everything was fine for a few hours until we noticed that we were no longer receiving metrics from Storm in graphite. I verified that its not a connectivity issue between the Storm and Graphite. Looking in Storm UI, the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't received a single tuple in the prior 10 minute or 3 hour window. Since the metrics consumer bolt was assigned to one executor, I took thread dumps of that JVM. I saw the following stack trace for the metrics consumer thread: "Thread-23-__metricscom.verisign.storm.metrics.GraphiteMetricsConsumer" #56 prio=5 os_prio=0 tid=0x00007fb4a13f1000 nid=0xe45 waiting on condition [0x00007fb3a7af9000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000a9ea23e8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163) at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:83) at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:54) at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:102) at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) at backtype.storm.daemon.executor$fn__5694$fn__5707$fn__5758.invoke(executor.clj:819) at backtype.storm.util$async_loop$fn__545.invoke(util.clj:479) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) I also saw a number of threads stuck in _waitForFreeSlotAt_ on the disruptor queue, I'm not sure if thats an issue or not: "user-timer" #33 daemon prio=10 os_prio=0 tid=0x00007fb4a1579800 nid=0xe1f runnable [0x00007fb445665000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338) at com.lmax.disruptor.AbstractMultithreadedClaimStrategy.waitForFreeSlotAt(AbstractMultithreadedClaimStrategy.java:99) at com.lmax.disruptor.AbstractMultithreadedClaimStrategy.incrementAndGet(AbstractMultithreadedClaimStrategy.java:49) at com.lmax.disruptor.Sequencer.next(Sequencer.java:127) at backtype.storm.utils.DisruptorQueue.publishDirect(DisruptorQueue.java:181) at backtype.storm.utils.DisruptorQueue.publish(DisruptorQueue.java:174) at backtype.storm.disruptor$publish.invoke(disruptor.clj:66) at backtype.storm.disruptor$publish.invoke(disruptor.clj:68) at backtype.storm.daemon.executor$setup_metrics_BANG_$fn__5544.invoke(executor.clj:295) at backtype.storm.timer$schedule_recurring$this__3721.invoke(timer.clj:102) at backtype.storm.timer$mk_timer$fn__3704$fn__3705.invoke(timer.clj:50) at backtype.storm.timer$mk_timer$fn__3704.invoke(timer.clj:42) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) "Thread-36-disruptor-executor[49 49]-send-queue" #69 prio=5 os_prio=0 tid=0x00007fb4a0c36800 nid=0xe5a runnable [0x00007fb3a6dec000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338) at com.lmax.disruptor.AbstractMultithreadedClaimStrategy.waitForFreeSlotAt(AbstractMultithreadedClaimStrategy.java:99) at com.lmax.disruptor.AbstractMultithreadedClaimStrategy.incrementAndGet(AbstractMultithreadedClaimStrategy.java:49) at com.lmax.disruptor.Sequencer.next(Sequencer.java:127) at backtype.storm.utils.DisruptorQueue.publishDirect(DisruptorQueue.java:181) at backtype.storm.utils.DisruptorQueue.publish(DisruptorQueue.java:174) at backtype.storm.disruptor$publish.invoke(disruptor.clj:66) at backtype.storm.disruptor$publish.invoke(disruptor.clj:68) at backtype.storm.daemon.worker$mk_transfer_fn$transfer_fn__6886.invoke(worker.clj:141) at backtype.storm.daemon.executor$start_batch_transfer__GT_worker_handler_BANG_$fn__5534.invoke(executor.clj:279) at backtype.storm.disruptor$clojure_handler$reify__5189.onEvent(disruptor.clj:58) at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:132) at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:106) at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) at backtype.storm.disruptor$consume_loop_STAR_$fn__5202.invoke(disruptor.clj:94) at backtype.storm.util$async_loop$fn__545.invoke(util.clj:479) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) Whats also interesting is that at the same time the metrics stopped being emitted, we started seeing a lot of tuples failing at the spout. Looking at our average CPU usage graph below, you can see that all of the workers were engaged in doing work and then about half of them went idle. http://i.imgur.com/Y9kBDFD.png In Storm UI, the bolt latency did not increase and there were no failures so it seems like there was an issue getting tuples from the spout to the bolt. Some more info about the state of the system before we stopped receiving metrics: The average execute / processing latency for the bolt was ~18ms: http://i.imgur.com/BzmT1gz.png The average number of spout acks / second was ~2400: http://i.imgur.com/DTywOyj.png The average spout lag was under 10 and the average spout latency was 50ms: http://i.imgur.com/omEy11t.png There were errors or warnings in any of the logs and none of the workers were restarted during this time. Any thoughts on what could be causing this and how to diagnose further? Thank you -- Kevin Conaway http://www.linkedin.com/pub/kevin-conaway/7/107/580/ https://github.com/kevinconaway