Go to the bottom of the problem and thought to share and maybe gain more
insight on why...
One of the bolts that used to run in around 35ms was taking 450+ms to
process a message.
After capturing some timing inside the bolt the culprit was found to be
collector.reportError()
it was taking up 400+ms to report an error, and since batch of messages was
all going that route average time for this bolt was 450+ms.
Switching back to logging error to file instead of to UI brought back the
process time to 35ms range.
It may be that running with 1 netty thread is causing this, or reporting an
error in general is a costly operation?

Thanks,
Tomas


On Wed, Jul 23, 2014 at 2:35 PM, Tomas Mazukna <tomas.mazu...@gmail.com>
wrote:

> No, its on RHE on real hardware...
> Although we went through a network upgrade couple day ago and all sort of
> things started failing....
>
>
> On Wed, Jul 23, 2014 at 1:43 PM, Andrew Montalenti <and...@parsely.com>
> wrote:
>
>> Tomas,
>>
>> You don't happen to be running Ubuntu 14.04 on Xen kernel, do you? Eg on
>> Amazon EC2.
>>
>> I discovered an issue where running Storm across many workers on that OS
>> led to me hitting an annoying network driver bug that would cause timeouts
>> and topology freezes like you are seeing. Check dmesg for odd messages from
>> your network stack. Just a guess.
>>
>> (copied from my reply to another similar thread)
>>
>> -AM
>> On Jul 23, 2014 10:07 AM, "Tomas Mazukna" <tomas.mazu...@gmail.com>
>> wrote:
>>
>>> I am really puzzled why processing stopped in the topology.
>>> Looks like the acking threads all stopped communicating. Only hint I saw
>>> was this netty exception:
>>> Any hints how to prevent this from happening again?
>>>
>>> 2014-07-23 08:56:03 b.s.m.n.Client [INFO] Closing Netty Client
>>> Netty-Client-ndhhadappp3.tsh.mis.mckesson.com/10.48.132.224:9703
>>>
>>> 2014-07-23 08:56:03 b.s.m.n.Client [INFO] Waiting for pending batchs to
>>> be sent with
>>> Netty-Client-ndhhadappp3.tsh.mis.mckesson.com/10.48.132.224:9703...,
>>> timeout: 600000ms, pendings: 0
>>>
>>> 2014-07-23 08:56:03 b.s.m.n.Client [INFO] Closing Netty Client
>>> Netty-Client-ndhhadappp3.tsh.mis.mckesson.com/10.48.132.224:9700
>>>
>>> 2014-07-23 08:56:03 b.s.m.n.Client [INFO] Waiting for pending batchs to
>>> be sent with
>>> Netty-Client-ndhhadappp3.tsh.mis.mckesson.com/10.48.132.224:9700...,
>>> timeout: 600000ms, pendings: 0
>>>
>>> 2014-07-23 08:56:03 b.s.util [ERROR] Async loop died!
>>>
>>> java.lang.RuntimeException: java.lang.RuntimeException: Client is being
>>> closed, and does not take requests any more
>>>
>>>         at
>>> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
>>>
>>>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]
>>>
>>> Caused by: java.lang.RuntimeException: Client is being closed, and does
>>> not take requests any more
>>>
>>>         at backtype.storm.messaging.netty.Client.send(Client.java:194)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:320)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         ... 6 common frames omitted
>>>
>>> 2014-07-23 08:56:03 b.s.util [INFO] Halting process: ("Async loop
>>> died!")
>>>
>>>
>>> Configuration:
>>>
>>> worker.childopts: "-Xmx2048m -Xss256k -XX:MaxPermSize=256m
>>> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>>> -XX:NewSize=128m -XX:CMSInitiatingOccupancyFraction=70
>>> -XX:-CMSConcurrentMTEnabled -Djava.net.preferIPv4Stack=true"
>>>
>>> supervisor.childopts: "-Xmx256m -Djava.net.preferIPv4Stack=true"
>>>
>>> nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
>>>
>>> ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
>>>
>>> nimbus.thrift.threads: 256
>>>
>>>
>>> storm.messaging.transport: "backtype.storm.messaging.netty.Context"
>>>
>>> storm.messaging.netty.server_worker_threads: 1
>>>
>>> storm.messaging.netty.client_worker_threads: 1
>>>
>>> storm.messaging.netty.buffer_size: 5242880
>>>
>>> storm.messaging.netty.max_retries: 100
>>>
>>> storm.messaging.netty.max_wait_ms: 1000
>>>
>>> storm.messaging.netty.min_wait_ms: 100
>>>
>>>
>>> Thanks,
>>> --
>>> Tomas Mazukna
>>> 678-557-3834
>>>
>>
>
>
> --
> Tomas Mazukna
> 678-557-3834
>



-- 
Tomas Mazukna
678-557-3834

Reply via email to