Re: nette reconnects

2014-09-26 Thread Varun Vijayaraghavan
Hey,

I've been facing the same issues in my topologies. It seems like a crash in
a single worker would trigger a reconnect from other workers for x amount
of time (30 x 10s = ~300 seconds in your case) before crashing themselves -
thus leading to a catastrophic failure in the topology.

There is a patch in 0.9.3 related to exponential backoff for netty
connections - which may address the issue - but until then I did two things
- a) increase the max_wait_ms to 15000 and b) decrease
supervisor.worker.start.timeout.secs to 30 - so that workers restart
earlier.

On Fri, Sep 26, 2014 at 2:06 PM, Tyson Norris tnor...@adobe.com wrote:

 Hi -
 We are seeing workers dying and restarting quite a bit, apparently from
 netty connection issues.

 For example, the log below shows:
 * Reconnect for worker at 121:6700
 * connection established to 121:6700
 * closing connection to 121:6700
 * Reconnect started to 121:6700

 all within 1 second.

 We have netty config updated to:
 storm.messaging.netty.max_retries: 30
 storm.messaging.netty.max_wait_ms: 1
 storm.messaging.netty.min_wait_ms: 1000

 And the workers die pretty quickly because often 30 retries does not end
 up with a connection.

 Any suggestions for how to prevent netting from closing a connection
 immediately? I could not see any obvious reason in the code that this would
 happen.

 Thanks
 Tyson

 2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.13.121:6700... [5]
 2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.13.121:6701... [6]
 2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.10.180:6701... [6]
 2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.10.180:6702... [6]
 2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.13.121:6700... [6]
 2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.13.121:6701... [7]
 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.13.121:6700... [7]
 2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established to a
 remote host Netty-Client-/10.27.13.121:6700, [id: 0xb8b33bef, /
 10.27.10.180:33880 = /10.27.13.121:6700]
 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client
 Netty-Client-/10.27.13.121:6700
 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending batchs to be
 sent with Netty-Client-/10.27.13.121:6700..., timeout: 60ms,
 pendings: 0
 2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client, connect to
 10.27.13.121, 6700, config: , buffer_size: 5242880
 2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.13.121:6700... [0]
 2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established to a
 remote host Netty-Client-/10.27.13.121:6700, [id: 0x9dc224e6, /
 10.27.10.180:33881 = /10.27.13.121:6700]




-- 
- varun :)


Re: nette reconnects

2014-09-26 Thread Varun Vijayaraghavan
I first tried increasing the max_retries to a much higher number (300) but
that did not make a difference.

On Fri, Sep 26, 2014 at 2:46 PM, Varun Vijayaraghavan varun@gmail.com
wrote:

 Hey,

 I've been facing the same issues in my topologies. It seems like a crash
 in a single worker would trigger a reconnect from other workers for x
 amount of time (30 x 10s = ~300 seconds in your case) before crashing
 themselves - thus leading to a catastrophic failure in the topology.

 There is a patch in 0.9.3 related to exponential backoff for netty
 connections - which may address the issue - but until then I did two things
 - a) increase the max_wait_ms to 15000 and b) decrease
 supervisor.worker.start.timeout.secs to 30 - so that workers restart
 earlier.

 On Fri, Sep 26, 2014 at 2:06 PM, Tyson Norris tnor...@adobe.com wrote:

 Hi -
 We are seeing workers dying and restarting quite a bit, apparently from
 netty connection issues.

 For example, the log below shows:
 * Reconnect for worker at 121:6700
 * connection established to 121:6700
 * closing connection to 121:6700
 * Reconnect started to 121:6700

 all within 1 second.

 We have netty config updated to:
 storm.messaging.netty.max_retries: 30
 storm.messaging.netty.max_wait_ms: 1
 storm.messaging.netty.min_wait_ms: 1000

 And the workers die pretty quickly because often 30 retries does not end
 up with a connection.

 Any suggestions for how to prevent netting from closing a connection
 immediately? I could not see any obvious reason in the code that this would
 happen.

 Thanks
 Tyson

 2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.13.121:6700... [5]
 2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.13.121:6701... [6]
 2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.10.180:6701... [6]
 2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.10.180:6702... [6]
 2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.13.121:6700... [6]
 2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.13.121:6701... [7]
 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.13.121:6700... [7]
 2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established to a
 remote host Netty-Client-/10.27.13.121:6700, [id: 0xb8b33bef, /
 10.27.10.180:33880 = /10.27.13.121:6700]
 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client
 Netty-Client-/10.27.13.121:6700
 2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending batchs to
 be sent with Netty-Client-/10.27.13.121:6700..., timeout: 60ms,
 pendings: 0
 2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client, connect to
 10.27.13.121, 6700, config: , buffer_size: 5242880
 2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for
 Netty-Client-/10.27.13.121:6700... [0]
 2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established to a
 remote host Netty-Client-/10.27.13.121:6700, [id: 0x9dc224e6, /
 10.27.10.180:33881 = /10.27.13.121:6700]




 --
 - varun :)




-- 
- varun :)


Re: nette reconnects

2014-09-26 Thread Derek Dagit

This could be https://issues.apache.org/jira/browse/STORM-510

The send thread is blocked on a connection attempt, and so no messages 
get sent out until the connection is re-established or it times out.


--
Derek

On 9/26/14 13:47, Varun Vijayaraghavan wrote:

I first tried increasing the max_retries to a much higher number (300)
but that did not make a difference.

On Fri, Sep 26, 2014 at 2:46 PM, Varun Vijayaraghavan
varun@gmail.com mailto:varun@gmail.com wrote:

Hey,

I've been facing the same issues in my topologies. It seems like a
crash in a single worker would trigger a reconnect from other
workers for x amount of time (30 x 10s = ~300 seconds in your case)
before crashing themselves - thus leading to a catastrophic failure
in the topology.

There is a patch in 0.9.3 related to exponential backoff for netty
connections - which may address the issue - but until then I did two
things - a) increase the max_wait_ms to 15000 and b) decrease
supervisor.worker.start.timeout.secs to 30 - so that workers restart
earlier.

On Fri, Sep 26, 2014 at 2:06 PM, Tyson Norris tnor...@adobe.com
mailto:tnor...@adobe.com wrote:

Hi -
We are seeing workers dying and restarting quite a bit,
apparently from netty connection issues.

For example, the log below shows:
* Reconnect for worker at 121:6700
* connection established to 121:6700
* closing connection to 121:6700
* Reconnect started to 121:6700

all within 1 second.

We have netty config updated to:
storm.messaging.netty.max_retries: 30
storm.messaging.netty.max_wait_ms: 1
storm.messaging.netty.min_wait_ms: 1000

And the workers die pretty quickly because often 30 retries does
not end up with a connection.

Any suggestions for how to prevent netting from closing a
connection immediately? I could not see any obvious reason in
the code that this would happen.

Thanks
Tyson

2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6700... [5]
2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6701... [6]
2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.10.180:6701... [6]
2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.10.180:6702... [6]
2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6700... [6]
2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6701... [7]
2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6700... [7]
2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established
to a remote host Netty-Client-/10.27.13.121:6700
http://10.27.13.121:6700, [id: 0xb8b33bef, /10.27.10.180:33880
http://10.27.10.180:33880 = /10.27.13.121:6700
http://10.27.13.121:6700]
2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client
Netty-Client-/10.27.13.121:6700 http://10.27.13.121:6700
2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending
batchs to be sent with Netty-Client-/10.27.13.121:6700...,
timeout: 60ms, pendings: 0
2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client,
connect to 10.27.13.121, 6700, config: , buffer_size: 5242880
2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6700... [0]
2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established
to a remote host Netty-Client-/10.27.13.121:6700
http://10.27.13.121:6700, [id: 0x9dc224e6, /10.27.10.180:33881
http://10.27.10.180:33881 = /10.27.13.121:6700
http://10.27.13.121:6700]




--
- varun :)




--
- varun :)


Re: nette reconnects

2014-09-26 Thread Tyson Norris
@varun - I still see workers waiting, reconnecting, closing connections, and 
dying, when using a longer max_wait_ms and shorter worker.start timeout

@derek - based on that bug, I will try to see if using a single worker per node 
(currently 4 workers per node) makes a difference.

Thanks
Tyson


On Sep 26, 2014, at 12:10 PM, Derek Dagit der...@yahoo-inc.com wrote:

 This could be https://issues.apache.org/jira/browse/STORM-510
 
 The send thread is blocked on a connection attempt, and so no messages get 
 sent out until the connection is re-established or it times out.
 
 -- 
 Derek
 
 On 9/26/14 13:47, Varun Vijayaraghavan wrote:
 I first tried increasing the max_retries to a much higher number (300)
 but that did not make a difference.
 
 On Fri, Sep 26, 2014 at 2:46 PM, Varun Vijayaraghavan
 varun@gmail.com mailto:varun@gmail.com wrote:
 
Hey,
 
I've been facing the same issues in my topologies. It seems like a
crash in a single worker would trigger a reconnect from other
workers for x amount of time (30 x 10s = ~300 seconds in your case)
before crashing themselves - thus leading to a catastrophic failure
in the topology.
 
There is a patch in 0.9.3 related to exponential backoff for netty
connections - which may address the issue - but until then I did two
things - a) increase the max_wait_ms to 15000 and b) decrease
supervisor.worker.start.timeout.secs to 30 - so that workers restart
earlier.
 
On Fri, Sep 26, 2014 at 2:06 PM, Tyson Norris tnor...@adobe.com
mailto:tnor...@adobe.com wrote:
 
Hi -
We are seeing workers dying and restarting quite a bit,
apparently from netty connection issues.
 
For example, the log below shows:
* Reconnect for worker at 121:6700
* connection established to 121:6700
* closing connection to 121:6700
* Reconnect started to 121:6700
 
all within 1 second.
 
We have netty config updated to:
storm.messaging.netty.max_retries: 30
storm.messaging.netty.max_wait_ms: 1
storm.messaging.netty.min_wait_ms: 1000
 
And the workers die pretty quickly because often 30 retries does
not end up with a connection.
 
Any suggestions for how to prevent netting from closing a
connection immediately? I could not see any obvious reason in
the code that this would happen.
 
Thanks
Tyson
 
2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6700... [5]
2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6701... [6]
2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.10.180:6701... [6]
2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.10.180:6702... [6]
2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6700... [6]
2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6701... [7]
2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6700... [7]
2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established
to a remote host Netty-Client-/10.27.13.121:6700
http://10.27.13.121:6700, [id: 0xb8b33bef, /10.27.10.180:33880
http://10.27.10.180:33880 = /10.27.13.121:6700
http://10.27.13.121:6700]
2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client
Netty-Client-/10.27.13.121:6700 http://10.27.13.121:6700
2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending
batchs to be sent with Netty-Client-/10.27.13.121:6700...,
timeout: 60ms, pendings: 0
2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client,
connect to 10.27.13.121, 6700, config: , buffer_size: 5242880
2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-/10.27.13.121:6700... [0]
2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established
to a remote host Netty-Client-/10.27.13.121:6700
http://10.27.13.121:6700, [id: 0x9dc224e6, /10.27.10.180:33881
http://10.27.10.180:33881 = /10.27.13.121:6700
http://10.27.13.121:6700]
 
 
 
 
--
- varun :)
 
 
 
 
 --
 - varun :)



RE: nette reconnects

2014-09-26 Thread Gunderson, Richard-CW
We see exactly the same thing in our worker logs. I don't know if this is 
correct behavior, but just acknowledging that we see the same thing.

Richard Gunderson
Mobile: (612) 860-1676

-Original Message-
From: Tyson Norris [mailto:tnor...@adobe.com] 
Sent: Friday, September 26, 2014 1:06 PM
To: user@storm.incubator.apache.org
Subject: nette reconnects

Hi - 
We are seeing workers dying and restarting quite a bit, apparently from netty 
connection issues.

For example, the log below shows:
* Reconnect for worker at 121:6700
* connection established to 121:6700
* closing connection to 121:6700
* Reconnect started to 121:6700

all within 1 second.

We have netty config updated to:
storm.messaging.netty.max_retries: 30
storm.messaging.netty.max_wait_ms: 1
storm.messaging.netty.min_wait_ms: 1000

And the workers die pretty quickly because often 30 retries does not end up 
with a connection. 

Any suggestions for how to prevent netting from closing a connection 
immediately? I could not see any obvious reason in the code that this would 
happen.

Thanks
Tyson

2014-09-26 09:32:03 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-/10.27.13.121:6700... [5]
2014-09-26 09:32:04 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-/10.27.13.121:6701... [6]
2014-09-26 09:32:11 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-/10.27.10.180:6701... [6]
2014-09-26 09:32:12 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-/10.27.10.180:6702... [6]
2014-09-26 09:32:13 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-/10.27.13.121:6700... [6]
2014-09-26 09:32:14 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-/10.27.13.121:6701... [7]
2014-09-26 09:32:18 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-/10.27.13.121:6700... [7]
2014-09-26 09:32:18 b.s.m.n.Client [INFO] connection established to a remote 
host Netty-Client-/10.27.13.121:6700, [id: 0xb8b33bef, /10.27.10.180:33880 = 
/10.27.13.121:6700]
2014-09-26 09:32:18 b.s.m.n.Client [INFO] Closing Netty Client 
Netty-Client-/10.27.13.121:6700
2014-09-26 09:32:18 b.s.m.n.Client [INFO] Waiting for pending batchs to be sent 
with Netty-Client-/10.27.13.121:6700..., timeout: 60ms, pendings: 0
2014-09-26 09:32:19 b.s.m.n.Client [INFO] New Netty Client, connect to 
10.27.13.121, 6700, config: , buffer_size: 5242880
2014-09-26 09:32:19 b.s.m.n.Client [INFO] Reconnect started for 
Netty-Client-/10.27.13.121:6700... [0]
2014-09-26 09:32:19 b.s.m.n.Client [INFO] connection established to a remote 
host Netty-Client-/10.27.13.121:6700, [id: 0x9dc224e6, /10.27.10.180:33881 = 
/10.27.13.121:6700]