You might want to check the thread dump and verify if some bolt is stuck
somewhere

Excuse typos
On Apr 15, 2016 11:08 PM, "Kevin Conaway" <kevin.a.cona...@gmail.com> wrote:

> Was the bolt really "stuck" though given that the failure was at the spout
> level (because the spout couldn't connect to the Kafka broker)?
>
> Additionally, we restarted the Kafka broker and it seemed like the spout
> was able to reconnect but we never saw messages from through on the metric
> consumer until we killed and restarted the topology.
>
> On Fri, Apr 15, 2016 at 1:31 PM, Abhishek Agarwal <abhishc...@gmail.com>
> wrote:
>
>> Kevin,
>> That would explain it. A stuck bolt will stall the whole topology.
>> MetricConsumer runs as a bolt so it will be blocked as well
>>
>> Excuse typos
>> On Apr 15, 2016 10:29 PM, "Kevin Conaway" <kevin.a.cona...@gmail.com>
>> wrote:
>>
>>> Two more data points on this:
>>>
>>> 1.) We are registering the graphite MetricsConsumer on our Topology
>>> Config, not globally in storm.yaml.  I don't know if this makes a
>>> difference.
>>>
>>> 2.) We re-ran another test last night and it ran fine for about 6 hours
>>> until the Kafka brokers ran out of disk space (oops) which halted the
>>> test.  This exact time also coincided with when the Graphite instance
>>> stopped receiving metrics from Storm.  Given that we weren't processing any
>>> tuples while storm was down, I understand why we didn't get those metrics
>>> but shouldn't the __system metrics (like heap size, gc time) still have
>>> been sent?
>>>
>>> On Thu, Apr 14, 2016 at 10:09 PM, Kevin Conaway <
>>> kevin.a.cona...@gmail.com> wrote:
>>>
>>>> Thank you for taking the time to respond.
>>>>
>>>> In my bolt I am registering 3 custom metrics (each a ReducedMetric to
>>>> track the latency of individual operations in the bolt).  The metric
>>>> interval for each is the same as TOPOLOGY_BUILTIN_METRICS_BUCKET_SIZE_SECS
>>>> which we have set at 60s
>>>>
>>>> The topology did not hang completely but it did degrade severely.
>>>> Without metrics it was hard to tell but it looked like some of the tasks
>>>> for certain kafka partitions either stopped emitting tuples or never got
>>>> acknowledgements for the tuples they did emit.  Some tuples were definitely
>>>> making it through though because data was continuously being inserted in to
>>>> Cassandra.  After I killed and resubmitted the topology, there were still
>>>> messages left over in the topic but only for certain partitions.
>>>>
>>>> What queue configuration are you looking for?
>>>>
>>>> I don't believe that the case was that the graphite metrics consumer
>>>> wasn't "keeping up".  In storm UI, the processing latency was very low for
>>>> that pseudo-bolt, as was the capacity.  Storm UI just showed that no tuples
>>>> were being delivered to the bolt.
>>>>
>>>> Thanks!
>>>>
>>>> On Thu, Apr 14, 2016 at 9:00 PM, Jungtaek Lim <kabh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Kevin,
>>>>>
>>>>> Do you register custom metrics? If then how long / vary is their
>>>>> intervals?
>>>>> Did your topology not working completely? (I mean did all tuples
>>>>> become failing after that time?)
>>>>> And could you share your queue configuration?
>>>>>
>>>>> And you can replace storm-graphite to LoggingMetricsConsumer and see
>>>>> it helps. If changing consumer resolves the issue, we can guess
>>>>> storm-graphite cannot keep up the metrics.
>>>>>
>>>>> Btw, I'm addressing metrics consumer issues (asynchronous, filter).
>>>>> You can track the progress here:
>>>>> https://issues.apache.org/jira/browse/STORM-1699
>>>>>
>>>>> I'm afraid they may be not ported to 0.10.x, but asynchronous metrics
>>>>> consumer bolt <https://issues.apache.org/jira/browse/STORM-1698> is a
>>>>> simple patch so you can apply and build custom 0.10.0, and give it a try.
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>>
>>>>> 2016년 4월 14일 (목) 오후 11:06, Denis DEBARBIEUX <ddebarbi...@norsys.fr>님이
>>>>> 작성:
>>>>>
>>>>>> Hi Kevin,
>>>>>>
>>>>>> I have a similar issue with storm 0.9.6 (see the following topic
>>>>>> https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser
>>>>>> ).
>>>>>>
>>>>>> It is still open. So, please, keep me informed on your progress.
>>>>>>
>>>>>> Denis
>>>>>>
>>>>>>
>>>>>> Le 14/04/2016 15:54, Kevin Conaway a écrit :
>>>>>>
>>>>>> We are using Storm 0.10 with the following configuration:
>>>>>>
>>>>>>    - 1 Nimbus node
>>>>>>    - 6 Supervisor nodes, each with 2 worker slots.  Each supervisor
>>>>>>    has 8 cores.
>>>>>>
>>>>>>
>>>>>> Our topology has a KafkaSpout that forwards to a bolt where we
>>>>>> transform the message and insert it in to Cassandra.  Our topic has 50
>>>>>> partitions so we have configured the number of executors/tasks for the
>>>>>> KafkaSpout to be 50.  Our bolt has 150 executors/tasks.
>>>>>>
>>>>>> We have also added the storm-graphite metrics consumer (
>>>>>> <https://github.com/verisign/storm-graphite>
>>>>>> https://github.com/verisign/storm-graphite) to our topology so that
>>>>>> storms metrics are sent to our graphite cluster.
>>>>>>
>>>>>> Yesterday we were running a 2000 tuple/sec load test and everything
>>>>>> was fine for a few hours until we noticed that we were no longer 
>>>>>> receiving
>>>>>> metrics from Storm in graphite.
>>>>>>
>>>>>> I verified that its not a connectivity issue between the Storm and
>>>>>> Graphite.  Looking in Storm UI,
>>>>>> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't
>>>>>> received a single tuple in the prior 10 minute or 3 hour window.
>>>>>>
>>>>>> Since the metrics consumer bolt was assigned to one executor, I took
>>>>>> thread dumps of that JVM.  I saw the following stack trace for the 
>>>>>> metrics
>>>>>> consumer thread:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> [image: Avast logo]
>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>>
>>>>>> L'absence de virus dans ce courrier électronique a été vérifiée par
>>>>>> le logiciel antivirus Avast.
>>>>>> www.avast.com
>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Kevin Conaway
>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>> https://github.com/kevinconaway
>>>>
>>>
>>>
>>>
>>> --
>>> Kevin Conaway
>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>> https://github.com/kevinconaway
>>>
>>
>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>

Reply via email to