[
https://issues.apache.org/jira/browse/STORM-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15360833#comment-15360833
]
ASF GitHub Bot commented on STORM-1941:
---------------------------------------
Github user harshach commented on a diff in the pull request:
https://github.com/apache/storm/pull/1535#discussion_r69408836
--- Diff:
storm-core/src/jvm/org/apache/storm/cluster/StormClusterStateImpl.java ---
@@ -219,6 +219,8 @@ public void stateChanged(CuratorFramework
curatorFramework, ConnectionState conn
LOG.info("Connection state listener invoked, zookeeper
connection state has changed to {}", connectionState);
if (connectionState.equals(ConnectionState.RECONNECTED)) {
LOG.info("Connection state has changed to reconnected
so setting nimbuses entry one more time");
+ // explicit delete for ephmeral node to ensure this
session creates the entry.
+
stateStorage.delete_node(ClusterUtils.nimbusPath(nimbusId));
--- End diff --
Aren't we using ephemeral node for this? If not I think we should do that
no? Instead of manually deleting the node.
> Nimbus discovery can fail when zookeeper reconnect happens.
> -----------------------------------------------------------
>
> Key: STORM-1941
> URL: https://issues.apache.org/jira/browse/STORM-1941
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-core
> Affects Versions: 1.0.0, 1.0.1
> Reporter: Jungtaek Lim
> Assignee: Jungtaek Lim
> Priority: Critical
>
> When zookeeper reconnect happens, nimbus registry can be deleted though
> nimbus is alive.
> Below is zookeeper node for nimbus registry.
> {code}
> get /storm/nimbuses/<host>:6627
> ?f`d``??????M?-?-.?/??5??/H?+.IL???ON??``b`?|???^^???????
> ?'h?g?g?g?g
> t-?,[??Q
> cZxid = 0x4000005ae
> ctime = Fri Jul 01 11:43:51 UTC 2016
> mZxid = 0x4000005ae
> mtime = Fri Jul 01 11:43:51 UTC 2016
> pZxid = 0x4000005ae
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x255a62e310c0005
> dataLength = 98
> numChildren = 0
> {code}
> {code}
> get /storm/nimbuses/<host>:6627
> ?f`d``??????M?-?-.?/??5??/H?+.IL???ON??``b`?|???^^???????
> ?'h?g?g?g?g
> t-?,[??Q
> cZxid = 0x4000005ae
> ctime = Fri Jul 01 11:43:51 UTC 2016
> mZxid = 0x50000000e
> mtime = Fri Jul 01 11:46:08 UTC 2016
> pZxid = 0x4000005ae
> cversion = 0
> dataVersion = 1
> aclVersion = 0
> ephemeralOwner = 0x255a62e310c0005
> dataLength = 98
> numChildren = 0
> {code}
> Below is transaction log for that node.
> {code}
> 7/1/16 11:43:51 AM UTC session 0x255a62e310c0005 cxid 0xd zxid 0x4000005ae
> create
> '/storm/nimbuses/<host>:6627,#1fffffff8b80000000ffffffe36660646060ffffff90ffffffcfffffffcaffffffc9ffffffccffffffd54dffffffcc2dffffffd62d2effffffc92fffffffcaffffffd535ffffffd2ffffffcb2f48ffffffcd2b2e494cffffffceffffffceffffffc94f4effffffccffffffe160606260ffffff907cffffffccffffffc1ffffffc01c5e165effffffceffffffc4ffffffc0ffffffc2ffffffc0ffffffcdffffffc0affffffd42768ffffffa867ffffffa067ffffffa867ffffffa467affffffa4d742dffffff8c2c1805b14ffffffc2ffffffaf51000,v{s{31,s{'world,'anyone}}},T,10
> 7/1/16 11:46:08 AM UTC session 0x355a647bd8c0000 cxid 0x3 zxid 0x50000000e
> setData
> '/storm/nimbuses/<host>:6627,#1fffffff8b80000000ffffffe36660646060ffffff90ffffffcfffffffcaffffffc9ffffffccffffffd54dffffffcc2dffffffd62d2effffffc92fffffffcaffffffd535ffffffd2ffffffcb2f48ffffffcd2b2e494cffffffceffffffceffffffc94f4effffffccffffffe160606260ffffff907cffffffccffffffc1ffffffc01c5e165effffffceffffffc4ffffffc0ffffffc2ffffffc0ffffffcdffffffc0affffffd42768ffffffa867ffffffa067ffffffa867ffffffa467affffffa4d742dffffff8c2c1805b14ffffffc2ffffffaf51000,1
> {code}
> Please take a look at ctime, mtime, and ephemeralOwner.
> Ephemeral owner session was already closed from nimbus side but there's
> possible for node to be not deleted immediately, so new session doesn't
> create new node but set the value to ephemeral node for other session which
> is already closed.
> {code}
> 2016-07-01 11:45:05.675 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Disconnecting client
> for session: 0x255a62e310c0005
> 2016-07-01 11:45:05.675 o.a.s.s.o.a.z.ZooKeeper [INFO] Session:
> 0x255a62e310c0005 closed
> {code}
> We can delete the node first and set ephemeral node when reconnect event
> handler is called.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)