This could be https://issues.apache.org/jira/browse/STORM-510

The send thread is blocked on a connection attempt, and so no messages get sent out until the connection is re-established or it times out.

--
Derek

On 9/26/14 13:47, Varun Vijayaraghavan wrote:
I first tried increasing the max_retries to a much higher number (300)
but that did not make a difference.

On Fri, Sep 26, 2014 at 2:46 PM, Varun Vijayaraghavan
<varun....@gmail.com <mailto:varun....@gmail.com>> wrote:

    Hey,

    I've been facing the same issues in my topologies. It seems like a
    crash in a single worker would trigger a reconnect from other
    workers for x amount of time (30 x 10s = ~300 seconds in your case)
    before crashing themselves - thus leading to a catastrophic failure
    in the topology.

    There is a patch in 0.9.3 related to exponential backoff for netty
    connections - which may address the issue - but until then I did two
    things - a) increase the max_wait_ms to 15000 and b) decrease
    supervisor.worker.start.timeout.secs to 30 - so that workers restart
    earlier.

    On Fri, Sep 26, 2014 at 2:06 PM, Tyson Norris <tnor...@adobe.com
    <mailto:tnor...@adobe.com>> wrote:

        Hi -
        We are seeing workers dying and restarting quite a bit,
        apparently from netty connection issues.

        For example, the log below shows:
        * Reconnect for worker at 121:6700
        * connection established to 121:6700
        * closing connection to 121:6700
        * Reconnect started to 121:6700

        all within 1 second.

        We have netty config updated to:
        storm.messaging.netty.max_retries: 30
        storm.messaging.netty.max_wait_ms: 10000
        storm.messaging.netty.min_wait_ms: 1000

        And the workers die pretty quickly because often 30 retries does
        not end up with a connection.

        Any suggestions for how to prevent netting from closing a
        connection immediately? I could not see any obvious reason in
        the code that this would happen.

        Thanks
        Tyson

        2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for
        Netty-Client-/10.27.13.121:6700... [5]
        2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for
        Netty-Client-/10.27.13.121:6701... [6]
        2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for
        Netty-Client-/10.27.10.180:6701... [6]
        2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for
        Netty-Client-/10.27.10.180:6702... [6]
        2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for
        Netty-Client-/10.27.13.121:6700... [6]
        2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for
        Netty-Client-/10.27.13.121:6701... [7]
        2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for
        Netty-Client-/10.27.13.121:6700... [7]
        2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established
        to a remote host Netty-Client-/10.27.13.121:6700
        <http://10.27.13.121:6700>, [id: 0xb8b33bef, /10.27.10.180:33880
        <http://10.27.10.180:33880> => /10.27.13.121:6700
        <http://10.27.13.121:6700>]
        2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client
        Netty-Client-/10.27.13.121:6700 <http://10.27.13.121:6700>
        2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending
        batchs to be sent with Netty-Client-/10.27.13.121:6700...,
        timeout: 600000ms, pendings: 0
        2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client,
        connect to 10.27.13.121, 6700, config: , buffer_size: 5242880
        2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for
        Netty-Client-/10.27.13.121:6700... [0]
        2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established
        to a remote host Netty-Client-/10.27.13.121:6700
        <http://10.27.13.121:6700>, [id: 0x9dc224e6, /10.27.10.180:33881
        <http://10.27.10.180:33881> => /10.27.13.121:6700
        <http://10.27.13.121:6700>]




    --
    - varun :)




--
- varun :)

Reply via email to