Re: Netty reconnect

Kashyap Mhaisekar Tue, 15 Sep 2015 13:34:55 -0700

Thanks Erik.
I *think* the issue for me was that I had created more workers than what
was possible. I have 5 machines with 4 core each and I should have had 20
workers, but I ended up defining 25 workers. One of the machines went out
of circulation for whatever reason (disk space filled up). This meant i had
a 16 workers but defined 25 workers. Am not sure if this could be a reason,
but after correcting this and migrating to 0.9.5, i think I saw the issue
disappear. Am still validating though.


Thanks
Kashyap

On Mon, Sep 14, 2015 at 6:53 PM, Erik Weathers <[email protected]>
wrote:

> That exception is certainly a *result* of the original worker death you're
> experiencing.  As noted in earlier responses on this thread, it seems like
> you're experiencing a cascading set of connection exceptions which are
> obscuring the original root cause / original worker death.  This is one of
> the pain points with storm: it can be hard to find the original exception /
> reason for a spiraling set of worker deaths.   You should look at the
> worker and supervisor logs to find out if "
> myserver1.personal.com/10.2.72.176:6701" was already dead before you saw
> this exception.
>
> Notably, with storm-0.9.3 we needed to revert to zero-mq instead of netty
> to overcome a similar issue.  We haven't experienced the problems after
> upgrading to 0.9.4 with netty (0.9.5 has also worked for us).  When we were
> experiencing problems with 0.9.3 and netty, the original worker process
> that was dying and invoking the cascading failures was "timing out".  i.e.,
> The supervisor wasn't receiving heartbeats from the worker within the 30
> second window, and then the supervisor *killed* the worker.  We noted that
> the workers were supposed to write to their heartbeat file once a second,
> but the frequency consistently increased, going from 1 second, to 2
> seconds, to 5 seconds, ..., to eventually being longer than 30 seconds,
> causing the supervisor to kill the worker.
>
> So long story short:  if you're experiencing the same thing as we were,
> just upgrading to 0.9.4 or 0.9.5 might solve it.
>
> But before doing that you should find the initial worker death's cause (be
> it a heartbeat timeout or an exception within the worker).
>
> - Erik
>
> On Fri, Sep 11, 2015 at 1:26 PM, Kashyap Mhaisekar <[email protected]>
> wrote:
>
>> Ganesh, All
>> Do you know if the answer to this is an upgrade to* 0.9.4 *or or* 0.9.5 *or
>> to version* 0.10.0-beta1*. My topology runs fine for 15 mins and then
>> gives up with this -
>> 2015-09-11 15:19:51 b.s.m.n.Client [INFO] failed to send requests to
>> myserver1.personal.com/10.2.72.176:6701:
>> java.nio.channels.ClosedChannelException: null
>>         at
>> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:405)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> [na:1.7.0_79]
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> [na:1.7.0_79]
>>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
>>
>> and then with   ...
>> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-myserver5.personal.com/10.2.72.176:6701... [1]
>> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-myserver7.personal.com/10.2.72.72:6704... [1]
>> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-myserver3.personal.com/10.2.72.77:6702... [1]
>>
>>
>> It restarts again and the whole thing repeats.
>>
>> Thanks
>> kashyap
>>
>> On Fri, Sep 4, 2015 at 11:33 AM, Ganesh Chandrasekaran <
>> [email protected]> wrote:
>>
>>> Kashyap,
>>>
>>>
>>>
>>> Yes you will need to upgrade Storm version on cluster as well.
>>> Personally, I would run tests to see if it fixes existing issue before
>>> upgrading.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ganesh
>>>
>>>
>>>
>>> *From:* Joseph Beard [mailto:[email protected]]
>>> *Sent:* Friday, September 04, 2015 12:07 PM
>>>
>>> *To:* [email protected]
>>> *Subject:* Re: Netty reconnect
>>>
>>>
>>>
>>> We also ran into the same issue with Storm 0.9.4.  We chose to upgrade
>>> to 0.10.0-beta1 which solved the problem and has been otherwise stable for
>>> our needs.
>>>
>>>
>>>
>>>
>>>
>>> Joe
>>>
>>> —
>>>
>>> Joseph Beard
>>>
>>> [email protected]
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sep 3, 2015, at 10:03 AM, Kashyap Mhaisekar <[email protected]>
>>> wrote:
>>>
>>>
>>>
>>> Thanks for the advices. Will upgrade from 0.9.3 to 0.9.4. A lame
>>> question - Does it mean that the existing clusters need to be rebuilt with
>>> 0.9.4?
>>>
>>> Thanks
>>> Kashyap
>>>
>>> On Sep 3, 2015 08:32, "Nick R. Katsipoulakis" <[email protected]>
>>> wrote:
>>>
>>> Ganesh,
>>>
>>>
>>>
>>> No I am not.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Nick
>>>
>>>
>>>
>>> 2015-09-03 9:25 GMT-04:00 Ganesh Chandrasekaran <
>>> [email protected]>:
>>>
>>> Are you using multilang protocol? I know that after upgrading to 0.9.4
>>> it seemed like I was being affected by this bug -
>>> https://issues.apache.org/jira/browse/STORM-738 and rolled back to
>>> previous stable version of 0.8.2.
>>>
>>> I did not verify this thoroughly on my cluster though.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Nick R. Katsipoulakis [mailto:[email protected]]
>>> *Sent:* Thursday, September 03, 2015 9:08 AM
>>>
>>>
>>> *To:* [email protected]
>>> *Subject:* Re: Netty reconnect
>>>
>>>
>>>
>>>
>>>
>>> Hello again,
>>>
>>>
>>>
>>> I read STORM-404 and I saw that is resolved on version 0.9.4. However, I
>>> have version 0.9.4 installed in my cluster, and I have seen similar
>>> behavior in my workers.
>>>
>>>
>>>
>>> In fact, at random times I would see that some workers were considered
>>> dead (Netty was dropping messages) and they would be restarted by the
>>> nimbus.
>>>
>>>
>>>
>>> Currently, I only see dropped messages but not restarted workers.
>>>
>>>
>>>
>>> FYI, my cluster has the following information
>>>
>>>
>>>
>>>    - 3X AWS m4.xlarge instances for ZooKeeper and Nimbus
>>>    - 4X AWS m4.xlarge instances for Supervisors (each one with 2
>>>    workers)
>>>
>>> Thanks,
>>>
>>> Nick
>>>
>>>
>>>
>>> 2015-09-03 8:38 GMT-04:00 Ganesh Chandrasekaran <
>>> [email protected]>:
>>>
>>> Agreed with Jitendra. We were using 0.9.3 version and facing the same
>>> issue of netty reconnects which was the issue 404. Upgrading to 0.9.4 fixed
>>> the issue.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ganesh
>>>
>>>
>>>
>>> *From:* Jitendra Yadav [mailto:[email protected]]
>>> *Sent:* Thursday, September 03, 2015 8:20 AM
>>> *To:* [email protected]
>>> *Subject:* Re: Netty reconnect
>>>
>>>
>>>
>>> I don't know your storm version, but it's worth to check these Jira's
>>> and see if similar scenario occurring.
>>>
>>>
>>>
>>> https://issues.apache.org/jira/browse/STORM-404
>>> https://issues.apache.org/jira/browse/STORM-450
>>>
>>>
>>>
>>> Thanks
>>>
>>> Jitendra
>>>
>>>
>>>
>>> On Thu, Sep 3, 2015 at 5:22 PM, John Yost <[email protected]>
>>> wrote:
>>>
>>> Hi Everyone,
>>>
>>> When I see this, it is evidence that one or more of the workers are not
>>> starting up, which results in connections either not occuring or
>>> reconnecting occuring when supervisors kill workers that don't start up
>>> properly. I recommend checking the supervisor and nimbus logs to see if
>>> there are any root causes other than network issues causing the
>>> connect/reconnect.
>>>
>>> --John
>>>
>>>
>>>
>>> On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <
>>> [email protected]> wrote:
>>>
>>> Hello Kashyap,
>>>
>>> I have been having the same issue for some time now on my AWS cluster.
>>> To be honest, I do not know how to resolve it.
>>>
>>> Regards,
>>>
>>> Nick
>>>
>>>
>>>
>>> 2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <[email protected]>:
>>>
>>> Hi,
>>> Has anyone experienced Netty reconnects repeatedly? My workers seem to
>>> be eternally in reconnect state and topology doesn't serve messages at all.
>>> It gets connected once in a while and then goes back to getting
>>> reconnecting.
>>>
>>> Any fixes for this?
>>> "Reconnect started for Netty-Client"
>>>
>>> Thanks
>>> Kashyap
>>>
>>>
>>>
>>> --
>>>
>>> Nikolaos Romanos Katsipoulakis,
>>>
>>> University of Pittsburgh, PhD candidate
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Nikolaos Romanos Katsipoulakis,
>>>
>>> University of Pittsburgh, PhD candidate
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Nikolaos Romanos Katsipoulakis,
>>>
>>> University of Pittsburgh, PhD candidate
>>>
>>>
>>>
>>
>>
>

Re: Netty reconnect

Reply via email to