I first tried increasing the max_retries to a much higher number (300) but
that did not make a difference.

On Fri, Sep 26, 2014 at 2:46 PM, Varun Vijayaraghavan <varun....@gmail.com>
wrote:

> Hey,
>
> I've been facing the same issues in my topologies. It seems like a crash
> in a single worker would trigger a reconnect from other workers for x
> amount of time (30 x 10s = ~300 seconds in your case) before crashing
> themselves - thus leading to a catastrophic failure in the topology.
>
> There is a patch in 0.9.3 related to exponential backoff for netty
> connections - which may address the issue - but until then I did two things
> - a) increase the max_wait_ms to 15000 and b) decrease
> supervisor.worker.start.timeout.secs to 30 - so that workers restart
> earlier.
>
> On Fri, Sep 26, 2014 at 2:06 PM, Tyson Norris <tnor...@adobe.com> wrote:
>
>> Hi -
>> We are seeing workers dying and restarting quite a bit, apparently from
>> netty connection issues.
>>
>> For example, the log below shows:
>> * Reconnect for worker at 121:6700
>> * connection established to 121:6700
>> * closing connection to 121:6700
>> * Reconnect started to 121:6700
>>
>> all within 1 second.
>>
>> We have netty config updated to:
>> storm.messaging.netty.max_retries: 30
>> storm.messaging.netty.max_wait_ms: 10000
>> storm.messaging.netty.min_wait_ms: 1000
>>
>> And the workers die pretty quickly because often 30 retries does not end
>> up with a connection.
>>
>> Any suggestions for how to prevent netting from closing a connection
>> immediately? I could not see any obvious reason in the code that this would
>> happen.
>>
>> Thanks
>> Tyson
>>
>> 2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-/10.27.13.121:6700... [5]
>> 2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-/10.27.13.121:6701... [6]
>> 2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-/10.27.10.180:6701... [6]
>> 2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-/10.27.10.180:6702... [6]
>> 2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-/10.27.13.121:6700... [6]
>> 2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-/10.27.13.121:6701... [7]
>> 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-/10.27.13.121:6700... [7]
>> 2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established to a
>> remote host Netty-Client-/10.27.13.121:6700, [id: 0xb8b33bef, /
>> 10.27.10.180:33880 => /10.27.13.121:6700]
>> 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client
>> Netty-Client-/10.27.13.121:6700
>> 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending batchs to
>> be sent with Netty-Client-/10.27.13.121:6700..., timeout: 600000ms,
>> pendings: 0
>> 2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client, connect to
>> 10.27.13.121, 6700, config: , buffer_size: 5242880
>> 2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-/10.27.13.121:6700... [0]
>> 2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established to a
>> remote host Netty-Client-/10.27.13.121:6700, [id: 0x9dc224e6, /
>> 10.27.10.180:33881 => /10.27.13.121:6700]
>>
>>
>
>
> --
> - varun :)
>



-- 
- varun :)

Reply via email to