We are using Storm 0.10 with the following configuration:

   - 1 Nimbus node
   - 6 Supervisor nodes, each with 2 worker slots.  Each supervisor has 8
   cores.


Our topology has a KafkaSpout that forwards to a bolt where we transform
the message and insert it in to Cassandra.  Our topic has 50 partitions so
we have configured the number of executors/tasks for the KafkaSpout to be
50.  Our bolt has 150 executors/tasks.

We have also added the storm-graphite metrics consumer (
https://github.com/verisign/storm-graphite) to our topology so that storms
metrics are sent to our graphite cluster.

Yesterday we were running a 2000 tuple/sec load test and everything was
fine for a few hours until we noticed that we were no longer receiving
metrics from Storm in graphite.

I verified that its not a connectivity issue between the Storm and
Graphite.  Looking in Storm UI,
the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't
received a single tuple in the prior 10 minute or 3 hour window.

Since the metrics consumer bolt was assigned to one executor, I took thread
dumps of that JVM.  I saw the following stack trace for the metrics
consumer thread:

"Thread-23-__metricscom.verisign.storm.metrics.GraphiteMetricsConsumer" #56
prio=5 os_prio=0 tid=0x00007fb4a13f1000 nid=0xe45 waiting on condition
[0x00007fb3a7af9000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000a9ea23e8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
        at
com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:83)
        at
com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:54)
        at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:102)
        at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
        at
backtype.storm.daemon.executor$fn__5694$fn__5707$fn__5758.invoke(executor.clj:819)
        at backtype.storm.util$async_loop$fn__545.invoke(util.clj:479)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.lang.Thread.run(Thread.java:745)

I also saw a number of threads stuck in _waitForFreeSlotAt_ on the
disruptor queue, I'm not sure if thats an issue or not:

"user-timer" #33 daemon prio=10 os_prio=0 tid=0x00007fb4a1579800 nid=0xe1f
runnable [0x00007fb445665000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)
        at
com.lmax.disruptor.AbstractMultithreadedClaimStrategy.waitForFreeSlotAt(AbstractMultithreadedClaimStrategy.java:99)
        at
com.lmax.disruptor.AbstractMultithreadedClaimStrategy.incrementAndGet(AbstractMultithreadedClaimStrategy.java:49)
        at com.lmax.disruptor.Sequencer.next(Sequencer.java:127)
        at
backtype.storm.utils.DisruptorQueue.publishDirect(DisruptorQueue.java:181)
        at
backtype.storm.utils.DisruptorQueue.publish(DisruptorQueue.java:174)
        at backtype.storm.disruptor$publish.invoke(disruptor.clj:66)
        at backtype.storm.disruptor$publish.invoke(disruptor.clj:68)
        at
backtype.storm.daemon.executor$setup_metrics_BANG_$fn__5544.invoke(executor.clj:295)
        at
backtype.storm.timer$schedule_recurring$this__3721.invoke(timer.clj:102)
        at
backtype.storm.timer$mk_timer$fn__3704$fn__3705.invoke(timer.clj:50)
        at backtype.storm.timer$mk_timer$fn__3704.invoke(timer.clj:42)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.lang.Thread.run(Thread.java:745)

"Thread-36-disruptor-executor[49 49]-send-queue" #69 prio=5 os_prio=0
tid=0x00007fb4a0c36800 nid=0xe5a runnable [0x00007fb3a6dec000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)
        at
com.lmax.disruptor.AbstractMultithreadedClaimStrategy.waitForFreeSlotAt(AbstractMultithreadedClaimStrategy.java:99)
        at
com.lmax.disruptor.AbstractMultithreadedClaimStrategy.incrementAndGet(AbstractMultithreadedClaimStrategy.java:49)
        at com.lmax.disruptor.Sequencer.next(Sequencer.java:127)
        at
backtype.storm.utils.DisruptorQueue.publishDirect(DisruptorQueue.java:181)
        at
backtype.storm.utils.DisruptorQueue.publish(DisruptorQueue.java:174)
        at backtype.storm.disruptor$publish.invoke(disruptor.clj:66)
        at backtype.storm.disruptor$publish.invoke(disruptor.clj:68)
        at
backtype.storm.daemon.worker$mk_transfer_fn$transfer_fn__6886.invoke(worker.clj:141)
        at
backtype.storm.daemon.executor$start_batch_transfer__GT_worker_handler_BANG_$fn__5534.invoke(executor.clj:279)
        at
backtype.storm.disruptor$clojure_handler$reify__5189.onEvent(disruptor.clj:58)
        at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:132)
        at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:106)
        at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
        at
backtype.storm.disruptor$consume_loop_STAR_$fn__5202.invoke(disruptor.clj:94)
        at backtype.storm.util$async_loop$fn__545.invoke(util.clj:479)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.lang.Thread.run(Thread.java:745)

Whats also interesting is that at the same time the metrics stopped being
emitted, we started seeing a lot of tuples failing at the spout.

Looking at our average CPU usage graph below, you can see that all of the
workers were engaged in doing work and then about half of them went idle.
http://i.imgur.com/Y9kBDFD.png

In Storm UI, the bolt latency did not increase and there were no failures
so it seems like there was an issue getting tuples from the spout to the
bolt.

Some more info about the state of the system before we stopped receiving
metrics:

The average execute / processing latency for the bolt was ~18ms:
http://i.imgur.com/BzmT1gz.png

The average number of spout acks / second was ~2400:
http://i.imgur.com/DTywOyj.png

The average spout lag was under 10 and the average spout latency was 50ms:
http://i.imgur.com/omEy11t.png

There were errors or warnings in any of the logs and none of the workers
were restarted during this time.

Any thoughts on what could be causing this and how to diagnose further?

Thank you

-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Reply via email to