@varun - I still see workers waiting, reconnecting, closing connections, and 
dying, when using a longer max_wait_ms and shorter worker.start timeout

@derek - based on that bug, I will try to see if using a single worker per node 
(currently 4 workers per node) makes a difference.

Thanks
Tyson


On Sep 26, 2014, at 12:10 PM, Derek Dagit <der...@yahoo-inc.com> wrote:

> This could be https://issues.apache.org/jira/browse/STORM-510
> 
> The send thread is blocked on a connection attempt, and so no messages get 
> sent out until the connection is re-established or it times out.
> 
> -- 
> Derek
> 
> On 9/26/14 13:47, Varun Vijayaraghavan wrote:
>> I first tried increasing the max_retries to a much higher number (300)
>> but that did not make a difference.
>> 
>> On Fri, Sep 26, 2014 at 2:46 PM, Varun Vijayaraghavan
>> <varun....@gmail.com <mailto:varun....@gmail.com>> wrote:
>> 
>>    Hey,
>> 
>>    I've been facing the same issues in my topologies. It seems like a
>>    crash in a single worker would trigger a reconnect from other
>>    workers for x amount of time (30 x 10s = ~300 seconds in your case)
>>    before crashing themselves - thus leading to a catastrophic failure
>>    in the topology.
>> 
>>    There is a patch in 0.9.3 related to exponential backoff for netty
>>    connections - which may address the issue - but until then I did two
>>    things - a) increase the max_wait_ms to 15000 and b) decrease
>>    supervisor.worker.start.timeout.secs to 30 - so that workers restart
>>    earlier.
>> 
>>    On Fri, Sep 26, 2014 at 2:06 PM, Tyson Norris <tnor...@adobe.com
>>    <mailto:tnor...@adobe.com>> wrote:
>> 
>>        Hi -
>>        We are seeing workers dying and restarting quite a bit,
>>        apparently from netty connection issues.
>> 
>>        For example, the log below shows:
>>        * Reconnect for worker at 121:6700
>>        * connection established to 121:6700
>>        * closing connection to 121:6700
>>        * Reconnect started to 121:6700
>> 
>>        all within 1 second.
>> 
>>        We have netty config updated to:
>>        storm.messaging.netty.max_retries: 30
>>        storm.messaging.netty.max_wait_ms: 10000
>>        storm.messaging.netty.min_wait_ms: 1000
>> 
>>        And the workers die pretty quickly because often 30 retries does
>>        not end up with a connection.
>> 
>>        Any suggestions for how to prevent netting from closing a
>>        connection immediately? I could not see any obvious reason in
>>        the code that this would happen.
>> 
>>        Thanks
>>        Tyson
>> 
>>        2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for
>>        Netty-Client-/10.27.13.121:6700... [5]
>>        2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for
>>        Netty-Client-/10.27.13.121:6701... [6]
>>        2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for
>>        Netty-Client-/10.27.10.180:6701... [6]
>>        2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for
>>        Netty-Client-/10.27.10.180:6702... [6]
>>        2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for
>>        Netty-Client-/10.27.13.121:6700... [6]
>>        2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for
>>        Netty-Client-/10.27.13.121:6701... [7]
>>        2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for
>>        Netty-Client-/10.27.13.121:6700... [7]
>>        2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established
>>        to a remote host Netty-Client-/10.27.13.121:6700
>>        <http://10.27.13.121:6700>, [id: 0xb8b33bef, /10.27.10.180:33880
>>        <http://10.27.10.180:33880> => /10.27.13.121:6700
>>        <http://10.27.13.121:6700>]
>>        2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client
>>        Netty-Client-/10.27.13.121:6700 <http://10.27.13.121:6700>
>>        2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending
>>        batchs to be sent with Netty-Client-/10.27.13.121:6700...,
>>        timeout: 600000ms, pendings: 0
>>        2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client,
>>        connect to 10.27.13.121, 6700, config: , buffer_size: 5242880
>>        2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for
>>        Netty-Client-/10.27.13.121:6700... [0]
>>        2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established
>>        to a remote host Netty-Client-/10.27.13.121:6700
>>        <http://10.27.13.121:6700>, [id: 0x9dc224e6, /10.27.10.180:33881
>>        <http://10.27.10.180:33881> => /10.27.13.121:6700
>>        <http://10.27.13.121:6700>]
>> 
>> 
>> 
>> 
>>    --
>>    - varun :)
>> 
>> 
>> 
>> 
>> --
>> - varun :)

Reply via email to