I tried to reproduce this locally and could not.  I stood up a local storm
0.10 cluster and deployed my topology with the Graphite metrics consumer
configured.  After it was up and running, I killed my local kafka broker to
observe what happened.  Although the kafka spout tasks were printing
errors, metrics were still being sent to Graphite during this time.

However, what I did notice was that every time the metrics were sent, I saw
the following two messages in the worker logs:

*2016-04-16 17:37:39.758 b.s.m.n.Server [INFO] Getting metrics for server
on port 6701*

*2016-04-16 17:37:39.758 b.s.m.n.Client [INFO] Getting metrics for client
connection to Netty-Client-/192.168.1.11:6700 <http://192.168.1.11:6700>*

I went back through the worker logs from our load testing cluster and
noticed that those log messages stopped being printed at the exact same
time the metrics stopped being reported to Graphite.  Both of those log
messages are logged in the implementation of *IStatefulObject.getState()*
 (in* backtype.storm.messaging.netty.Server* and
* backtype.storm.messaging.netty.Client*)  so whatever class is responsible
for invoking that method stopped working.   At first guess, that would
appear to be whatever process is responsible for collecting metrics via
*IMetric.getValueAndReset()*

Does that provide any further insight in to what happened?  I will keep
digging on my end.
Thanks,

Kevin

On Fri, Apr 15, 2016 at 2:17 PM, Kevin Conaway <kevin.a.cona...@gmail.com>
wrote:

> I took thread dumps of the worker where the graphite consumer bolt
> executor was running but I didn't see any BLOCKED threads or anything out
> of the ordinary.  This is the thread dump for the graphite metrics consumer
> bolt:
>
> "Thread-23-__metricscom.verisign.storm.metrics.GraphiteMetricsConsumer"
> #56 prio=5 os_prio=0 tid=0x00007f0b8555c800 nid=0x9a2 waiting on condition
> [0x00007f0abaeed000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at
> backtype.storm.daemon.executor$fn__5694$fn__5707.invoke(executor.clj:713)
>         at backtype.storm.util$async_loop$fn__545.invoke(util.clj:477)
>         at clojure.lang.AFn.run(AFn.java:22)
>         at java.lang.Thread.run(Thread.java:745)
>
> Would a "stuck" bolt on some other worker JVM have the same effect?
>
>
> On Fri, Apr 15, 2016 at 2:10 PM, Abhishek Agarwal <abhishc...@gmail.com>
> wrote:
>
>> You might want to check the thread dump and verify if some bolt is stuck
>> somewhere
>>
>> Excuse typos
>> On Apr 15, 2016 11:08 PM, "Kevin Conaway" <kevin.a.cona...@gmail.com>
>> wrote:
>>
>>> Was the bolt really "stuck" though given that the failure was at the
>>> spout level (because the spout couldn't connect to the Kafka broker)?
>>>
>>> Additionally, we restarted the Kafka broker and it seemed like the spout
>>> was able to reconnect but we never saw messages from through on the metric
>>> consumer until we killed and restarted the topology.
>>>
>>> On Fri, Apr 15, 2016 at 1:31 PM, Abhishek Agarwal <abhishc...@gmail.com>
>>> wrote:
>>>
>>>> Kevin,
>>>> That would explain it. A stuck bolt will stall the whole topology.
>>>> MetricConsumer runs as a bolt so it will be blocked as well
>>>>
>>>> Excuse typos
>>>> On Apr 15, 2016 10:29 PM, "Kevin Conaway" <kevin.a.cona...@gmail.com>
>>>> wrote:
>>>>
>>>>> Two more data points on this:
>>>>>
>>>>> 1.) We are registering the graphite MetricsConsumer on our Topology
>>>>> Config, not globally in storm.yaml.  I don't know if this makes a
>>>>> difference.
>>>>>
>>>>> 2.) We re-ran another test last night and it ran fine for about 6
>>>>> hours until the Kafka brokers ran out of disk space (oops) which halted 
>>>>> the
>>>>> test.  This exact time also coincided with when the Graphite instance
>>>>> stopped receiving metrics from Storm.  Given that we weren't processing 
>>>>> any
>>>>> tuples while storm was down, I understand why we didn't get those metrics
>>>>> but shouldn't the __system metrics (like heap size, gc time) still have
>>>>> been sent?
>>>>>
>>>>> On Thu, Apr 14, 2016 at 10:09 PM, Kevin Conaway <
>>>>> kevin.a.cona...@gmail.com> wrote:
>>>>>
>>>>>> Thank you for taking the time to respond.
>>>>>>
>>>>>> In my bolt I am registering 3 custom metrics (each a ReducedMetric to
>>>>>> track the latency of individual operations in the bolt).  The metric
>>>>>> interval for each is the same as 
>>>>>> TOPOLOGY_BUILTIN_METRICS_BUCKET_SIZE_SECS
>>>>>> which we have set at 60s
>>>>>>
>>>>>> The topology did not hang completely but it did degrade severely.
>>>>>> Without metrics it was hard to tell but it looked like some of the tasks
>>>>>> for certain kafka partitions either stopped emitting tuples or never got
>>>>>> acknowledgements for the tuples they did emit.  Some tuples were 
>>>>>> definitely
>>>>>> making it through though because data was continuously being inserted in 
>>>>>> to
>>>>>> Cassandra.  After I killed and resubmitted the topology, there were still
>>>>>> messages left over in the topic but only for certain partitions.
>>>>>>
>>>>>> What queue configuration are you looking for?
>>>>>>
>>>>>> I don't believe that the case was that the graphite metrics consumer
>>>>>> wasn't "keeping up".  In storm UI, the processing latency was very low 
>>>>>> for
>>>>>> that pseudo-bolt, as was the capacity.  Storm UI just showed that no 
>>>>>> tuples
>>>>>> were being delivered to the bolt.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Thu, Apr 14, 2016 at 9:00 PM, Jungtaek Lim <kabh...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Kevin,
>>>>>>>
>>>>>>> Do you register custom metrics? If then how long / vary is their
>>>>>>> intervals?
>>>>>>> Did your topology not working completely? (I mean did all tuples
>>>>>>> become failing after that time?)
>>>>>>> And could you share your queue configuration?
>>>>>>>
>>>>>>> And you can replace storm-graphite to LoggingMetricsConsumer and see
>>>>>>> it helps. If changing consumer resolves the issue, we can guess
>>>>>>> storm-graphite cannot keep up the metrics.
>>>>>>>
>>>>>>> Btw, I'm addressing metrics consumer issues (asynchronous, filter).
>>>>>>> You can track the progress here:
>>>>>>> https://issues.apache.org/jira/browse/STORM-1699
>>>>>>>
>>>>>>> I'm afraid they may be not ported to 0.10.x, but asynchronous
>>>>>>> metrics consumer bolt
>>>>>>> <https://issues.apache.org/jira/browse/STORM-1698> is a simple
>>>>>>> patch so you can apply and build custom 0.10.0, and give it a try.
>>>>>>>
>>>>>>> Hope this helps.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>
>>>>>>>
>>>>>>> 2016년 4월 14일 (목) 오후 11:06, Denis DEBARBIEUX <ddebarbi...@norsys.fr>님이
>>>>>>> 작성:
>>>>>>>
>>>>>>>> Hi Kevin,
>>>>>>>>
>>>>>>>> I have a similar issue with storm 0.9.6 (see the following topic
>>>>>>>> https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser
>>>>>>>> ).
>>>>>>>>
>>>>>>>> It is still open. So, please, keep me informed on your progress.
>>>>>>>>
>>>>>>>> Denis
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 14/04/2016 15:54, Kevin Conaway a écrit :
>>>>>>>>
>>>>>>>> We are using Storm 0.10 with the following configuration:
>>>>>>>>
>>>>>>>>    - 1 Nimbus node
>>>>>>>>    - 6 Supervisor nodes, each with 2 worker slots.  Each
>>>>>>>>    supervisor has 8 cores.
>>>>>>>>
>>>>>>>>
>>>>>>>> Our topology has a KafkaSpout that forwards to a bolt where we
>>>>>>>> transform the message and insert it in to Cassandra.  Our topic has 50
>>>>>>>> partitions so we have configured the number of executors/tasks for the
>>>>>>>> KafkaSpout to be 50.  Our bolt has 150 executors/tasks.
>>>>>>>>
>>>>>>>> We have also added the storm-graphite metrics consumer (
>>>>>>>> <https://github.com/verisign/storm-graphite>
>>>>>>>> https://github.com/verisign/storm-graphite) to our topology so
>>>>>>>> that storms metrics are sent to our graphite cluster.
>>>>>>>>
>>>>>>>> Yesterday we were running a 2000 tuple/sec load test and everything
>>>>>>>> was fine for a few hours until we noticed that we were no longer 
>>>>>>>> receiving
>>>>>>>> metrics from Storm in graphite.
>>>>>>>>
>>>>>>>> I verified that its not a connectivity issue between the Storm and
>>>>>>>> Graphite.  Looking in Storm UI,
>>>>>>>> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't
>>>>>>>> received a single tuple in the prior 10 minute or 3 hour window.
>>>>>>>>
>>>>>>>> Since the metrics consumer bolt was assigned to one executor, I
>>>>>>>> took thread dumps of that JVM.  I saw the following stack trace for the
>>>>>>>> metrics consumer thread:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>> [image: Avast logo]
>>>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>>>>
>>>>>>>> L'absence de virus dans ce courrier électronique a été vérifiée par
>>>>>>>> le logiciel antivirus Avast.
>>>>>>>> www.avast.com
>>>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Kevin Conaway
>>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>>> https://github.com/kevinconaway
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kevin Conaway
>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>> https://github.com/kevinconaway
>>>>>
>>>>
>>>
>>>
>>> --
>>> Kevin Conaway
>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>> https://github.com/kevinconaway
>>>
>>
>
>
> --
> Kevin Conaway
> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
> https://github.com/kevinconaway
>



-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Reply via email to