Could you please confirm the Helix version that is used, Dimuthu?
The thing is that we have fixed several potential ZkHelixManager concurrency 
issues in 0.8.2. Basically, that was a race condition in which the disconnect 
method could get a disconnected non-null zkclient. In this case, reset handler 
will never finish.

Please let us know if you are already using 0.8.2 or a later version. That 
probably means we have a new bug to fix.

Cheers,
-Jiajun

> On Apr 1, 2019, at 13:15, kishore g <[email protected]> wrote:
> 
> This is a good catch. @Wang Jiajun <mailto:[email protected]> the stack 
> trace is good enough to fix this right. We just have to look at all the paths 
> we can get into this method and make sure resetHandler is thread safe and 
> validates the state of the zkConnection and handlers.
> 
> On Mon, Apr 1, 2019 at 12:41 PM Wang Jiajun <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Dimuthu,
> 
> Did you stop the controller when the connection is flapping or when it is
> normal?
> Could you please list all the steps that you have done in order?
> 
> Best Regards,
> Jiajun
> 
> 
> On Sat, Mar 30, 2019 at 5:54 AM DImuthu Upeksha <[email protected] 
> <mailto:[email protected]>>
> wrote:
> 
> > Hi Folks,
> >
> > In helix controller, we have seen below log line and by looking at the
> > code, I understood that it is due to ZkHelixManager is failing to connect
> > to zookeeper for 5 times. So I tried to stop the controller and in the stop
> > logic, we have a call to ZkHelixManager.disconnect() method and it hangs. I
> > got a thread dump and you can see where it is waiting. Can you please
> > advice as better approach to solve this?
> >
> > I noticed that ZkHelixManager disconnects [1] it self when a flapping is
> > detected. Is calling disconnect() twice the reason for that?
> >
> > 2019-03-29 15:19:56,832 [
> > ZkClient-EventThread-14-api.staging.scigap.org:2181 
> > <http://zkclient-eventthread-14-api.staging.scigap.org:2181/>]
> > ERROR o.a.h.m.zk.ZKHelixManager  - instanceName: helixcontroller is
> > flapping. disconnect it.  maxDisconnectThreshold: 5 disconnects in
> > 300000ms.
> >
> > Thread-5 - priority:5 - threadId:0x00007f5c740023f0 - nativeId:0x63f1 -
> > nativeId (decimal):25585 - state:BLOCKED
> > stackTrace:
> > java.lang.Thread.State: BLOCKED (on object monitor)
> > at
> >
> > org.apache.helix.manager.zk.ZKHelixManager.resetHandlers(ZKHelixManager.java:903)
> > - waiting to lock <0x00000006c7e08110> (a
> > org.apache.helix.manager.zk.ZKHelixManager)
> > at
> >
> > org.apache.helix.manager.zk.ZKHelixManager.disconnect(ZKHelixManager.java:693)
> > at
> >
> > org.apache.airavata.helix.impl.controller.HelixController.disconnect(HelixController.java:103)
> > at
> >
> > org.apache.airavata.helix.impl.controller.HelixController$$Lambda$2/846492085.run(Unknown
> > Source)
> > at java.lang.Thread.run(Thread.java:748)
> > Locked ownable synchronizers:
> > - None
> >
> > [1]
> >
> > https://github.com/apache/helix/blob/helix-0.8.2/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L991
> >  
> > <https://github.com/apache/helix/blob/helix-0.8.2/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixManager.java#L991>
> > Thanks
> > Dimuthu
> >

Reply via email to