You might want to check the thread dump and verify if some bolt is stuck somewhere
Excuse typos On Apr 15, 2016 11:08 PM, "Kevin Conaway" <kevin.a.cona...@gmail.com> wrote: > Was the bolt really "stuck" though given that the failure was at the spout > level (because the spout couldn't connect to the Kafka broker)? > > Additionally, we restarted the Kafka broker and it seemed like the spout > was able to reconnect but we never saw messages from through on the metric > consumer until we killed and restarted the topology. > > On Fri, Apr 15, 2016 at 1:31 PM, Abhishek Agarwal <abhishc...@gmail.com> > wrote: > >> Kevin, >> That would explain it. A stuck bolt will stall the whole topology. >> MetricConsumer runs as a bolt so it will be blocked as well >> >> Excuse typos >> On Apr 15, 2016 10:29 PM, "Kevin Conaway" <kevin.a.cona...@gmail.com> >> wrote: >> >>> Two more data points on this: >>> >>> 1.) We are registering the graphite MetricsConsumer on our Topology >>> Config, not globally in storm.yaml. I don't know if this makes a >>> difference. >>> >>> 2.) We re-ran another test last night and it ran fine for about 6 hours >>> until the Kafka brokers ran out of disk space (oops) which halted the >>> test. This exact time also coincided with when the Graphite instance >>> stopped receiving metrics from Storm. Given that we weren't processing any >>> tuples while storm was down, I understand why we didn't get those metrics >>> but shouldn't the __system metrics (like heap size, gc time) still have >>> been sent? >>> >>> On Thu, Apr 14, 2016 at 10:09 PM, Kevin Conaway < >>> kevin.a.cona...@gmail.com> wrote: >>> >>>> Thank you for taking the time to respond. >>>> >>>> In my bolt I am registering 3 custom metrics (each a ReducedMetric to >>>> track the latency of individual operations in the bolt). The metric >>>> interval for each is the same as TOPOLOGY_BUILTIN_METRICS_BUCKET_SIZE_SECS >>>> which we have set at 60s >>>> >>>> The topology did not hang completely but it did degrade severely. >>>> Without metrics it was hard to tell but it looked like some of the tasks >>>> for certain kafka partitions either stopped emitting tuples or never got >>>> acknowledgements for the tuples they did emit. Some tuples were definitely >>>> making it through though because data was continuously being inserted in to >>>> Cassandra. After I killed and resubmitted the topology, there were still >>>> messages left over in the topic but only for certain partitions. >>>> >>>> What queue configuration are you looking for? >>>> >>>> I don't believe that the case was that the graphite metrics consumer >>>> wasn't "keeping up". In storm UI, the processing latency was very low for >>>> that pseudo-bolt, as was the capacity. Storm UI just showed that no tuples >>>> were being delivered to the bolt. >>>> >>>> Thanks! >>>> >>>> On Thu, Apr 14, 2016 at 9:00 PM, Jungtaek Lim <kabh...@gmail.com> >>>> wrote: >>>> >>>>> Kevin, >>>>> >>>>> Do you register custom metrics? If then how long / vary is their >>>>> intervals? >>>>> Did your topology not working completely? (I mean did all tuples >>>>> become failing after that time?) >>>>> And could you share your queue configuration? >>>>> >>>>> And you can replace storm-graphite to LoggingMetricsConsumer and see >>>>> it helps. If changing consumer resolves the issue, we can guess >>>>> storm-graphite cannot keep up the metrics. >>>>> >>>>> Btw, I'm addressing metrics consumer issues (asynchronous, filter). >>>>> You can track the progress here: >>>>> https://issues.apache.org/jira/browse/STORM-1699 >>>>> >>>>> I'm afraid they may be not ported to 0.10.x, but asynchronous metrics >>>>> consumer bolt <https://issues.apache.org/jira/browse/STORM-1698> is a >>>>> simple patch so you can apply and build custom 0.10.0, and give it a try. >>>>> >>>>> Hope this helps. >>>>> >>>>> Thanks, >>>>> Jungtaek Lim (HeartSaVioR) >>>>> >>>>> >>>>> 2016년 4월 14일 (목) 오후 11:06, Denis DEBARBIEUX <ddebarbi...@norsys.fr>님이 >>>>> 작성: >>>>> >>>>>> Hi Kevin, >>>>>> >>>>>> I have a similar issue with storm 0.9.6 (see the following topic >>>>>> https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser >>>>>> ). >>>>>> >>>>>> It is still open. So, please, keep me informed on your progress. >>>>>> >>>>>> Denis >>>>>> >>>>>> >>>>>> Le 14/04/2016 15:54, Kevin Conaway a écrit : >>>>>> >>>>>> We are using Storm 0.10 with the following configuration: >>>>>> >>>>>> - 1 Nimbus node >>>>>> - 6 Supervisor nodes, each with 2 worker slots. Each supervisor >>>>>> has 8 cores. >>>>>> >>>>>> >>>>>> Our topology has a KafkaSpout that forwards to a bolt where we >>>>>> transform the message and insert it in to Cassandra. Our topic has 50 >>>>>> partitions so we have configured the number of executors/tasks for the >>>>>> KafkaSpout to be 50. Our bolt has 150 executors/tasks. >>>>>> >>>>>> We have also added the storm-graphite metrics consumer ( >>>>>> <https://github.com/verisign/storm-graphite> >>>>>> https://github.com/verisign/storm-graphite) to our topology so that >>>>>> storms metrics are sent to our graphite cluster. >>>>>> >>>>>> Yesterday we were running a 2000 tuple/sec load test and everything >>>>>> was fine for a few hours until we noticed that we were no longer >>>>>> receiving >>>>>> metrics from Storm in graphite. >>>>>> >>>>>> I verified that its not a connectivity issue between the Storm and >>>>>> Graphite. Looking in Storm UI, >>>>>> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't >>>>>> received a single tuple in the prior 10 minute or 3 hour window. >>>>>> >>>>>> Since the metrics consumer bolt was assigned to one executor, I took >>>>>> thread dumps of that JVM. I saw the following stack trace for the >>>>>> metrics >>>>>> consumer thread: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> [image: Avast logo] >>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >>>>>> >>>>>> L'absence de virus dans ce courrier électronique a été vérifiée par >>>>>> le logiciel antivirus Avast. >>>>>> www.avast.com >>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >>>>>> >>>>>> >>>> >>>> >>>> -- >>>> Kevin Conaway >>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/ >>>> https://github.com/kevinconaway >>>> >>> >>> >>> >>> -- >>> Kevin Conaway >>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/ >>> https://github.com/kevinconaway >>> >> > > > -- > Kevin Conaway > http://www.linkedin.com/pub/kevin-conaway/7/107/580/ > https://github.com/kevinconaway >