Thanks Erik. I *think* the issue for me was that I had created more workers than what was possible. I have 5 machines with 4 core each and I should have had 20 workers, but I ended up defining 25 workers. One of the machines went out of circulation for whatever reason (disk space filled up). This meant i had a 16 workers but defined 25 workers. Am not sure if this could be a reason, but after correcting this and migrating to 0.9.5, i think I saw the issue disappear. Am still validating though.
Thanks Kashyap On Mon, Sep 14, 2015 at 6:53 PM, Erik Weathers <[email protected]> wrote: > That exception is certainly a *result* of the original worker death you're > experiencing. As noted in earlier responses on this thread, it seems like > you're experiencing a cascading set of connection exceptions which are > obscuring the original root cause / original worker death. This is one of > the pain points with storm: it can be hard to find the original exception / > reason for a spiraling set of worker deaths. You should look at the > worker and supervisor logs to find out if " > myserver1.personal.com/10.2.72.176:6701" was already dead before you saw > this exception. > > Notably, with storm-0.9.3 we needed to revert to zero-mq instead of netty > to overcome a similar issue. We haven't experienced the problems after > upgrading to 0.9.4 with netty (0.9.5 has also worked for us). When we were > experiencing problems with 0.9.3 and netty, the original worker process > that was dying and invoking the cascading failures was "timing out". i.e., > The supervisor wasn't receiving heartbeats from the worker within the 30 > second window, and then the supervisor *killed* the worker. We noted that > the workers were supposed to write to their heartbeat file once a second, > but the frequency consistently increased, going from 1 second, to 2 > seconds, to 5 seconds, ..., to eventually being longer than 30 seconds, > causing the supervisor to kill the worker. > > So long story short: if you're experiencing the same thing as we were, > just upgrading to 0.9.4 or 0.9.5 might solve it. > > But before doing that you should find the initial worker death's cause (be > it a heartbeat timeout or an exception within the worker). > > - Erik > > On Fri, Sep 11, 2015 at 1:26 PM, Kashyap Mhaisekar <[email protected]> > wrote: > >> Ganesh, All >> Do you know if the answer to this is an upgrade to* 0.9.4 *or or* 0.9.5 *or >> to version* 0.10.0-beta1*. My topology runs fine for 15 mins and then >> gives up with this - >> 2015-09-11 15:19:51 b.s.m.n.Client [INFO] failed to send requests to >> myserver1.personal.com/10.2.72.176:6701: >> java.nio.channels.ClosedChannelException: null >> at >> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:405) >> [storm-core-0.9.3-rc1.jar:0.9.3-rc1] >> at >> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373) >> [storm-core-0.9.3-rc1.jar:0.9.3-rc1] >> at >> org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93) >> [storm-core-0.9.3-rc1.jar:0.9.3-rc1] >> at >> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) >> [storm-core-0.9.3-rc1.jar:0.9.3-rc1] >> at >> org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) >> [storm-core-0.9.3-rc1.jar:0.9.3-rc1] >> at >> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) >> [storm-core-0.9.3-rc1.jar:0.9.3-rc1] >> at >> org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) >> [storm-core-0.9.3-rc1.jar:0.9.3-rc1] >> at >> org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) >> [storm-core-0.9.3-rc1.jar:0.9.3-rc1] >> at >> org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) >> [storm-core-0.9.3-rc1.jar:0.9.3-rc1] >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> [na:1.7.0_79] >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> [na:1.7.0_79] >> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79] >> >> and then with ... >> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for >> Netty-Client-myserver5.personal.com/10.2.72.176:6701... [1] >> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for >> Netty-Client-myserver7.personal.com/10.2.72.72:6704... [1] >> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for >> Netty-Client-myserver3.personal.com/10.2.72.77:6702... [1] >> >> >> It restarts again and the whole thing repeats. >> >> Thanks >> kashyap >> >> On Fri, Sep 4, 2015 at 11:33 AM, Ganesh Chandrasekaran < >> [email protected]> wrote: >> >>> Kashyap, >>> >>> >>> >>> Yes you will need to upgrade Storm version on cluster as well. >>> Personally, I would run tests to see if it fixes existing issue before >>> upgrading. >>> >>> >>> >>> Thanks, >>> >>> Ganesh >>> >>> >>> >>> *From:* Joseph Beard [mailto:[email protected]] >>> *Sent:* Friday, September 04, 2015 12:07 PM >>> >>> *To:* [email protected] >>> *Subject:* Re: Netty reconnect >>> >>> >>> >>> We also ran into the same issue with Storm 0.9.4. We chose to upgrade >>> to 0.10.0-beta1 which solved the problem and has been otherwise stable for >>> our needs. >>> >>> >>> >>> >>> >>> Joe >>> >>> — >>> >>> Joseph Beard >>> >>> [email protected] >>> >>> >>> >>> >>> >>> >>> >>> On Sep 3, 2015, at 10:03 AM, Kashyap Mhaisekar <[email protected]> >>> wrote: >>> >>> >>> >>> Thanks for the advices. Will upgrade from 0.9.3 to 0.9.4. A lame >>> question - Does it mean that the existing clusters need to be rebuilt with >>> 0.9.4? >>> >>> Thanks >>> Kashyap >>> >>> On Sep 3, 2015 08:32, "Nick R. Katsipoulakis" <[email protected]> >>> wrote: >>> >>> Ganesh, >>> >>> >>> >>> No I am not. >>> >>> >>> >>> Cheers, >>> >>> Nick >>> >>> >>> >>> 2015-09-03 9:25 GMT-04:00 Ganesh Chandrasekaran < >>> [email protected]>: >>> >>> Are you using multilang protocol? I know that after upgrading to 0.9.4 >>> it seemed like I was being affected by this bug - >>> https://issues.apache.org/jira/browse/STORM-738 and rolled back to >>> previous stable version of 0.8.2. >>> >>> I did not verify this thoroughly on my cluster though. >>> >>> >>> >>> >>> >>> *From:* Nick R. Katsipoulakis [mailto:[email protected]] >>> *Sent:* Thursday, September 03, 2015 9:08 AM >>> >>> >>> *To:* [email protected] >>> *Subject:* Re: Netty reconnect >>> >>> >>> >>> >>> >>> Hello again, >>> >>> >>> >>> I read STORM-404 and I saw that is resolved on version 0.9.4. However, I >>> have version 0.9.4 installed in my cluster, and I have seen similar >>> behavior in my workers. >>> >>> >>> >>> In fact, at random times I would see that some workers were considered >>> dead (Netty was dropping messages) and they would be restarted by the >>> nimbus. >>> >>> >>> >>> Currently, I only see dropped messages but not restarted workers. >>> >>> >>> >>> FYI, my cluster has the following information >>> >>> >>> >>> - 3X AWS m4.xlarge instances for ZooKeeper and Nimbus >>> - 4X AWS m4.xlarge instances for Supervisors (each one with 2 >>> workers) >>> >>> Thanks, >>> >>> Nick >>> >>> >>> >>> 2015-09-03 8:38 GMT-04:00 Ganesh Chandrasekaran < >>> [email protected]>: >>> >>> Agreed with Jitendra. We were using 0.9.3 version and facing the same >>> issue of netty reconnects which was the issue 404. Upgrading to 0.9.4 fixed >>> the issue. >>> >>> >>> >>> Thanks, >>> >>> Ganesh >>> >>> >>> >>> *From:* Jitendra Yadav [mailto:[email protected]] >>> *Sent:* Thursday, September 03, 2015 8:20 AM >>> *To:* [email protected] >>> *Subject:* Re: Netty reconnect >>> >>> >>> >>> I don't know your storm version, but it's worth to check these Jira's >>> and see if similar scenario occurring. >>> >>> >>> >>> https://issues.apache.org/jira/browse/STORM-404 >>> https://issues.apache.org/jira/browse/STORM-450 >>> >>> >>> >>> Thanks >>> >>> Jitendra >>> >>> >>> >>> On Thu, Sep 3, 2015 at 5:22 PM, John Yost <[email protected]> >>> wrote: >>> >>> Hi Everyone, >>> >>> When I see this, it is evidence that one or more of the workers are not >>> starting up, which results in connections either not occuring or >>> reconnecting occuring when supervisors kill workers that don't start up >>> properly. I recommend checking the supervisor and nimbus logs to see if >>> there are any root causes other than network issues causing the >>> connect/reconnect. >>> >>> --John >>> >>> >>> >>> On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis < >>> [email protected]> wrote: >>> >>> Hello Kashyap, >>> >>> I have been having the same issue for some time now on my AWS cluster. >>> To be honest, I do not know how to resolve it. >>> >>> Regards, >>> >>> Nick >>> >>> >>> >>> 2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <[email protected]>: >>> >>> Hi, >>> Has anyone experienced Netty reconnects repeatedly? My workers seem to >>> be eternally in reconnect state and topology doesn't serve messages at all. >>> It gets connected once in a while and then goes back to getting >>> reconnecting. >>> >>> Any fixes for this? >>> "Reconnect started for Netty-Client" >>> >>> Thanks >>> Kashyap >>> >>> >>> >>> -- >>> >>> Nikolaos Romanos Katsipoulakis, >>> >>> University of Pittsburgh, PhD candidate >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Nikolaos Romanos Katsipoulakis, >>> >>> University of Pittsburgh, PhD candidate >>> >>> >>> >>> >>> >>> -- >>> >>> Nikolaos Romanos Katsipoulakis, >>> >>> University of Pittsburgh, PhD candidate >>> >>> >>> >> >> >
