Could you please confirm the Helix version that is used, Dimuthu? The thing is that we have fixed several potential ZkHelixManager concurrency issues in 0.8.2. Basically, that was a race condition in which the disconnect method could get a disconnected non-null zkclient. In this case, reset handler will never finish.
Please let us know if you are already using 0.8.2 or a later version. That probably means we have a new bug to fix. Cheers, -Jiajun > On Apr 1, 2019, at 13:15, kishore g <[email protected]> wrote: > > This is a good catch. @Wang Jiajun <mailto:[email protected]> the stack > trace is good enough to fix this right. We just have to look at all the paths > we can get into this method and make sure resetHandler is thread safe and > validates the state of the zkConnection and handlers. > > On Mon, Apr 1, 2019 at 12:41 PM Wang Jiajun <[email protected] > <mailto:[email protected]>> wrote: > Hi Dimuthu, > > Did you stop the controller when the connection is flapping or when it is > normal? > Could you please list all the steps that you have done in order? > > Best Regards, > Jiajun > > > On Sat, Mar 30, 2019 at 5:54 AM DImuthu Upeksha <[email protected] > <mailto:[email protected]>> > wrote: > > > Hi Folks, > > > > In helix controller, we have seen below log line and by looking at the > > code, I understood that it is due to ZkHelixManager is failing to connect > > to zookeeper for 5 times. So I tried to stop the controller and in the stop > > logic, we have a call to ZkHelixManager.disconnect() method and it hangs. I > > got a thread dump and you can see where it is waiting. Can you please > > advice as better approach to solve this? > > > > I noticed that ZkHelixManager disconnects [1] it self when a flapping is > > detected. Is calling disconnect() twice the reason for that? > > > > 2019-03-29 15:19:56,832 [ > > ZkClient-EventThread-14-api.staging.scigap.org:2181 > > <http://zkclient-eventthread-14-api.staging.scigap.org:2181/>] > > ERROR o.a.h.m.zk.ZKHelixManager - instanceName: helixcontroller is > > flapping. disconnect it. maxDisconnectThreshold: 5 disconnects in > > 300000ms. > > > > Thread-5 - priority:5 - threadId:0x00007f5c740023f0 - nativeId:0x63f1 - > > nativeId (decimal):25585 - state:BLOCKED > > stackTrace: > > java.lang.Thread.State: BLOCKED (on object monitor) > > at > > > > org.apache.helix.manager.zk.ZKHelixManager.resetHandlers(ZKHelixManager.java:903) > > - waiting to lock <0x00000006c7e08110> (a > > org.apache.helix.manager.zk.ZKHelixManager) > > at > > > > org.apache.helix.manager.zk.ZKHelixManager.disconnect(ZKHelixManager.java:693) > > at > > > > org.apache.airavata.helix.impl.controller.HelixController.disconnect(HelixController.java:103) > > at > > > > org.apache.airavata.helix.impl.controller.HelixController$$Lambda$2/846492085.run(Unknown > > Source) > > at java.lang.Thread.run(Thread.java:748) > > Locked ownable synchronizers: > > - None > > > > [1] > > > > https://github.com/apache/helix/blob/helix-0.8.2/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L991 > > > > <https://github.com/apache/helix/blob/helix-0.8.2/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L991> > > Thanks > > Dimuthu > >
