[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...
Github user asfgit closed the pull request at: https://github.com/apache/storm/pull/802 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...
Github user danielschonfeld commented on the pull request: https://github.com/apache/storm/pull/802#issuecomment-148807939 @Parth-Brahmbhatt that's a tricky one. I haven't found a way to reproduce but leaving nimbus work for a day or so with number of nimbuses > 1 and a good load on the system we see the number of ZK nodes/keys go up to (X*nimbuses)+1 under /leader-lock. When that happens, we have problems trying to do anything as no nimbus thinks it's the leader which is exactly what's described in CURATOR-202. If you can think of a way to disconnect the ZK connection but reconnect using the same session programmatically you'll have a reproduction of this bug as this always starts showing up after something like the following log lines: ``` 2015-10-16 18:16:13 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] State change: RECONNECTED 2015-10-16 18:16:14 o.a.s.s.o.a.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 6668ms for sessionid 0x1506caf14ab005f, closing socket connection and attempting reconnect 2015-10-16 18:16:14 o.a.s.s.o.a.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 6672ms for sessionid 0x1506caf14ab0060, closing socket connection and attempting reconnect 2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server 10.101.1.2/10.101.1.2:2181. Will not attempt to authenticate using SASL (unknown error) 2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket connection established to 10.101.1.2/10.101.1.2:2181, initiating session 2015-10-16 18:16:15 o.a.s.s.o.a.z.ClientCnxn [INFO] Session establishment complete on server 10.101.1.2/10.101.1.2:2181, sessionid = 0x1506caf14ab005f, negotiated timeout = 2 2015-10-16 18:16:15 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] State change: RECONNECTED ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...
Github user Parth-Brahmbhatt commented on the pull request: https://github.com/apache/storm/pull/802#issuecomment-148763402 @revans2 The log concerns were from the origin PR that @danielschonfeld which he has fixed but I guess he force pushed the branch. I am +1 on this change too. @danielschonfeld On a side note, can you provide any steps to reproduce this locally? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/802#issuecomment-148734244 @danielschonfeld It looks fine to me. Personally am +1 on it, but I want to hear back from @Parth-Brahmbhatt. He has a concern about unnecessary log statements. I assume that these are coming directly out of curator itself, so I am not really sure how he wants to handle this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...
Github user danielschonfeld commented on the pull request: https://github.com/apache/storm/pull/802#issuecomment-148731922 @revans2 what can I do to make this PR approval ready? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/802#issuecomment-148728541 @danielschonfeld the test failure is because we have too many tests that don't use ephemeral ports. In all likelihood the JDK8 test was running on the same box at the same time and got the port before this test could. JDK8 tends to run through tests faster then JDK7 does. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...
Github user danielschonfeld commented on the pull request: https://github.com/apache/storm/pull/802#issuecomment-148527966 any idea why the JDK7 test fails? I'm not entirely clear on why DRPC server is having problems starting up --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...
Github user Parth-Brahmbhatt commented on the pull request: https://github.com/apache/storm/pull/802#issuecomment-148524703 lot of unnecessary log statements, can you remove them? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-1115] Stale leader-lock key effectively...
GitHub user danielschonfeld opened a pull request: https://github.com/apache/storm/pull/802 [STORM-1115] Stale leader-lock key effectively bans all nodes from becoming leaders From the issue: ``` I believe this curator bug is what's in play causing the above described situation. https://issues.apache.org/jira/browse/CURATOR-202 ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/schonfeld/storm update-curator Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/802.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #802 commit 580de48c91ab32a1777f0e375d065d3a01b8e3e9 Author: Michael Schonfeld Date: 2015-10-06T17:55:17Z more logging to nimbusinfo commit ea2c3a81855dded8c48d946a7017c5ee1ded5e5b Author: Michael Schonfeld Date: 2015-10-09T19:21:52Z debug logging commit 25ad78e3c6c47e3da3c6cab86d155274ce5ad561 Author: Michael Schonfeld Date: 2015-10-09T19:57:45Z wrong encasing commit b7db4434f10b78c6259b82f6b6edaac676900bef Author: Michael Schonfeld Date: 2015-10-09T20:36:53Z wrap in do() commit 9ca4d5d4821959b7b9df002fbdc589f6232c37da Author: Michael Schonfeld Date: 2015-10-09T20:39:49Z missing ) commit 24c4ce0056d40f82f17c15603e5ab83a3b9596ab Author: Michael Schonfeld Date: 2015-10-09T21:15:21Z more debug commit 8253f83ece8a63c076a9a5e833e2871728419c83 Author: Michael Schonfeld Date: 2015-10-09T21:33:23Z no need for () commit 278206baf7567f270b5bcca41dbeec75c765aaf4 Author: Michael Schonfeld Date: 2015-10-09T21:39:05Z blha commit 9f3e8d55b9083d59423ac9153f0aff8c912242bb Author: Michael Schonfeld Date: 2015-10-09T23:11:28Z dont use local var commit faf9bda82e1c4e1875c87a4920fb2c1c2a33ba56 Author: Daniel Schonfeld Date: 2015-10-15T21:11:45Z update curator to version 2.9.0 to combat stale leader issue as described in https://issues.apache.org/jira/browse/CURATOR-202 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---