Re: nette reconnects
Hey, I've been facing the same issues in my topologies. It seems like a crash in a single worker would trigger a reconnect from other workers for x amount of time (30 x 10s = ~300 seconds in your case) before crashing themselves - thus leading to a catastrophic failure in the topology. There is a patch in 0.9.3 related to exponential backoff for netty connections - which may address the issue - but until then I did two things - a) increase the max_wait_ms to 15000 and b) decrease supervisor.worker.start.timeout.secs to 30 - so that workers restart earlier. On Fri, Sep 26, 2014 at 2:06 PM, Tyson Norris tnor...@adobe.com wrote: Hi - We are seeing workers dying and restarting quite a bit, apparently from netty connection issues. For example, the log below shows: * Reconnect for worker at 121:6700 * connection established to 121:6700 * closing connection to 121:6700 * Reconnect started to 121:6700 all within 1 second. We have netty config updated to: storm.messaging.netty.max_retries: 30 storm.messaging.netty.max_wait_ms: 1 storm.messaging.netty.min_wait_ms: 1000 And the workers die pretty quickly because often 30 retries does not end up with a connection. Any suggestions for how to prevent netting from closing a connection immediately? I could not see any obvious reason in the code that this would happen. Thanks Tyson 2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [5] 2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6701... [6] 2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.10.180:6701... [6] 2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.10.180:6702... [6] 2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [6] 2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6701... [7] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [7] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-/10.27.13.121:6700, [id: 0xb8b33bef, / 10.27.10.180:33880 = /10.27.13.121:6700] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-/10.27.13.121:6700 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-/10.27.13.121:6700..., timeout: 60ms, pendings: 0 2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client, connect to 10.27.13.121, 6700, config: , buffer_size: 5242880 2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [0] 2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-/10.27.13.121:6700, [id: 0x9dc224e6, / 10.27.10.180:33881 = /10.27.13.121:6700] -- - varun :)
Re: nette reconnects
I first tried increasing the max_retries to a much higher number (300) but that did not make a difference. On Fri, Sep 26, 2014 at 2:46 PM, Varun Vijayaraghavan varun@gmail.com wrote: Hey, I've been facing the same issues in my topologies. It seems like a crash in a single worker would trigger a reconnect from other workers for x amount of time (30 x 10s = ~300 seconds in your case) before crashing themselves - thus leading to a catastrophic failure in the topology. There is a patch in 0.9.3 related to exponential backoff for netty connections - which may address the issue - but until then I did two things - a) increase the max_wait_ms to 15000 and b) decrease supervisor.worker.start.timeout.secs to 30 - so that workers restart earlier. On Fri, Sep 26, 2014 at 2:06 PM, Tyson Norris tnor...@adobe.com wrote: Hi - We are seeing workers dying and restarting quite a bit, apparently from netty connection issues. For example, the log below shows: * Reconnect for worker at 121:6700 * connection established to 121:6700 * closing connection to 121:6700 * Reconnect started to 121:6700 all within 1 second. We have netty config updated to: storm.messaging.netty.max_retries: 30 storm.messaging.netty.max_wait_ms: 1 storm.messaging.netty.min_wait_ms: 1000 And the workers die pretty quickly because often 30 retries does not end up with a connection. Any suggestions for how to prevent netting from closing a connection immediately? I could not see any obvious reason in the code that this would happen. Thanks Tyson 2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [5] 2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6701... [6] 2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.10.180:6701... [6] 2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.10.180:6702... [6] 2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [6] 2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6701... [7] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [7] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-/10.27.13.121:6700, [id: 0xb8b33bef, / 10.27.10.180:33880 = /10.27.13.121:6700] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-/10.27.13.121:6700 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-/10.27.13.121:6700..., timeout: 60ms, pendings: 0 2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client, connect to 10.27.13.121, 6700, config: , buffer_size: 5242880 2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [0] 2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-/10.27.13.121:6700, [id: 0x9dc224e6, / 10.27.10.180:33881 = /10.27.13.121:6700] -- - varun :) -- - varun :)
Re: nette reconnects
This could be https://issues.apache.org/jira/browse/STORM-510 The send thread is blocked on a connection attempt, and so no messages get sent out until the connection is re-established or it times out. -- Derek On 9/26/14 13:47, Varun Vijayaraghavan wrote: I first tried increasing the max_retries to a much higher number (300) but that did not make a difference. On Fri, Sep 26, 2014 at 2:46 PM, Varun Vijayaraghavan varun@gmail.com mailto:varun@gmail.com wrote: Hey, I've been facing the same issues in my topologies. It seems like a crash in a single worker would trigger a reconnect from other workers for x amount of time (30 x 10s = ~300 seconds in your case) before crashing themselves - thus leading to a catastrophic failure in the topology. There is a patch in 0.9.3 related to exponential backoff for netty connections - which may address the issue - but until then I did two things - a) increase the max_wait_ms to 15000 and b) decrease supervisor.worker.start.timeout.secs to 30 - so that workers restart earlier. On Fri, Sep 26, 2014 at 2:06 PM, Tyson Norris tnor...@adobe.com mailto:tnor...@adobe.com wrote: Hi - We are seeing workers dying and restarting quite a bit, apparently from netty connection issues. For example, the log below shows: * Reconnect for worker at 121:6700 * connection established to 121:6700 * closing connection to 121:6700 * Reconnect started to 121:6700 all within 1 second. We have netty config updated to: storm.messaging.netty.max_retries: 30 storm.messaging.netty.max_wait_ms: 1 storm.messaging.netty.min_wait_ms: 1000 And the workers die pretty quickly because often 30 retries does not end up with a connection. Any suggestions for how to prevent netting from closing a connection immediately? I could not see any obvious reason in the code that this would happen. Thanks Tyson 2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [5] 2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6701... [6] 2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.10.180:6701... [6] 2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.10.180:6702... [6] 2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [6] 2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6701... [7] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [7] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-/10.27.13.121:6700 http://10.27.13.121:6700, [id: 0xb8b33bef, /10.27.10.180:33880 http://10.27.10.180:33880 = /10.27.13.121:6700 http://10.27.13.121:6700] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-/10.27.13.121:6700 http://10.27.13.121:6700 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-/10.27.13.121:6700..., timeout: 60ms, pendings: 0 2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client, connect to 10.27.13.121, 6700, config: , buffer_size: 5242880 2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [0] 2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-/10.27.13.121:6700 http://10.27.13.121:6700, [id: 0x9dc224e6, /10.27.10.180:33881 http://10.27.10.180:33881 = /10.27.13.121:6700 http://10.27.13.121:6700] -- - varun :) -- - varun :)
Re: nette reconnects
@varun - I still see workers waiting, reconnecting, closing connections, and dying, when using a longer max_wait_ms and shorter worker.start timeout @derek - based on that bug, I will try to see if using a single worker per node (currently 4 workers per node) makes a difference. Thanks Tyson On Sep 26, 2014, at 12:10 PM, Derek Dagit der...@yahoo-inc.com wrote: This could be https://issues.apache.org/jira/browse/STORM-510 The send thread is blocked on a connection attempt, and so no messages get sent out until the connection is re-established or it times out. -- Derek On 9/26/14 13:47, Varun Vijayaraghavan wrote: I first tried increasing the max_retries to a much higher number (300) but that did not make a difference. On Fri, Sep 26, 2014 at 2:46 PM, Varun Vijayaraghavan varun@gmail.com mailto:varun@gmail.com wrote: Hey, I've been facing the same issues in my topologies. It seems like a crash in a single worker would trigger a reconnect from other workers for x amount of time (30 x 10s = ~300 seconds in your case) before crashing themselves - thus leading to a catastrophic failure in the topology. There is a patch in 0.9.3 related to exponential backoff for netty connections - which may address the issue - but until then I did two things - a) increase the max_wait_ms to 15000 and b) decrease supervisor.worker.start.timeout.secs to 30 - so that workers restart earlier. On Fri, Sep 26, 2014 at 2:06 PM, Tyson Norris tnor...@adobe.com mailto:tnor...@adobe.com wrote: Hi - We are seeing workers dying and restarting quite a bit, apparently from netty connection issues. For example, the log below shows: * Reconnect for worker at 121:6700 * connection established to 121:6700 * closing connection to 121:6700 * Reconnect started to 121:6700 all within 1 second. We have netty config updated to: storm.messaging.netty.max_retries: 30 storm.messaging.netty.max_wait_ms: 1 storm.messaging.netty.min_wait_ms: 1000 And the workers die pretty quickly because often 30 retries does not end up with a connection. Any suggestions for how to prevent netting from closing a connection immediately? I could not see any obvious reason in the code that this would happen. Thanks Tyson 2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [5] 2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6701... [6] 2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.10.180:6701... [6] 2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.10.180:6702... [6] 2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [6] 2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6701... [7] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [7] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-/10.27.13.121:6700 http://10.27.13.121:6700, [id: 0xb8b33bef, /10.27.10.180:33880 http://10.27.10.180:33880 = /10.27.13.121:6700 http://10.27.13.121:6700] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-/10.27.13.121:6700 http://10.27.13.121:6700 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-/10.27.13.121:6700..., timeout: 60ms, pendings: 0 2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client, connect to 10.27.13.121, 6700, config: , buffer_size: 5242880 2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [0] 2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-/10.27.13.121:6700 http://10.27.13.121:6700, [id: 0x9dc224e6, /10.27.10.180:33881 http://10.27.10.180:33881 = /10.27.13.121:6700 http://10.27.13.121:6700] -- - varun :) -- - varun :)
RE: nette reconnects
We see exactly the same thing in our worker logs. I don't know if this is correct behavior, but just acknowledging that we see the same thing. Richard Gunderson Mobile: (612) 860-1676 -Original Message- From: Tyson Norris [mailto:tnor...@adobe.com] Sent: Friday, September 26, 2014 1:06 PM To: user@storm.incubator.apache.org Subject: nette reconnects Hi - We are seeing workers dying and restarting quite a bit, apparently from netty connection issues. For example, the log below shows: * Reconnect for worker at 121:6700 * connection established to 121:6700 * closing connection to 121:6700 * Reconnect started to 121:6700 all within 1 second. We have netty config updated to: storm.messaging.netty.max_retries: 30 storm.messaging.netty.max_wait_ms: 1 storm.messaging.netty.min_wait_ms: 1000 And the workers die pretty quickly because often 30 retries does not end up with a connection. Any suggestions for how to prevent netting from closing a connection immediately? I could not see any obvious reason in the code that this would happen. Thanks Tyson 2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [5] 2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6701... [6] 2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.10.180:6701... [6] 2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.10.180:6702... [6] 2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [6] 2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6701... [7] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [7] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-/10.27.13.121:6700, [id: 0xb8b33bef, /10.27.10.180:33880 = /10.27.13.121:6700] 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client Netty-Client-/10.27.13.121:6700 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent with Netty-Client-/10.27.13.121:6700..., timeout: 60ms, pendings: 0 2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client, connect to 10.27.13.121, 6700, config: , buffer_size: 5242880 2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for Netty-Client-/10.27.13.121:6700... [0] 2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established to a remote host Netty-Client-/10.27.13.121:6700, [id: 0x9dc224e6, /10.27.10.180:33881 = /10.27.13.121:6700]