We increased the max connections allowed per client at zk server side. The problem is gone now.
On Tue, May 10, 2016 at 2:50 PM, Neutron sharc <neutronsh...@gmail.com> wrote: > Hi Kanak, thanks for reply. > > The problem is gone if we set a constraint of 1 on "STATE_TRANSITION" > for the resource. If we allow multiple state transitions to be > executed in the resource, then this zklock problem occurs. > > btw, we run multiple participants in a same jvm in our test. In > other words, there are multiple java threads in a same jvm competing > for zklock. > > We haven't profiled the ZKHelixLock._listener.lockAcquired() since we > bypassed this problem using constraint. Will revisit it later. > > > > > On Mon, May 9, 2016 at 8:28 PM, Kanak Biscuitwala <kana...@hotmail.com> wrote: >> Hi, >> >> ZkHelixLock is a thin wrapper around the ZooKeeper WriteLock recipe (which >> was last changed over 5 years ago). Though we haven't extensively tested it >> in production, but we haven't seen it fail to return as described. >> >> Do you know if ZKHelixLock._listener.lockAcquired() is ever called? >> >> Feel free to examine the code here: >> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/ZKHelixLock.java >> >>> From: neutronsh...@gmail.com >>> Date: Mon, 9 May 2016 14:26:43 -0700 >>> Subject: calling ZKHelixLock from state machine transition >>> To: dev@helix.apache.org >>> >>> Hi Helix team, >>> >>> We observed an issue at state machine transition handle: >>> >>> // statemodel.java: >>> >>> public void offlineToSlave(Message message, NotificationContext context) { >>> >>> // do work to start a local shard >>> >>> // we want to save the new shard info to resource config >>> >>> >>> ZKHelixLock zklock = new ZKHelixLock(clusterId, resource, zkclient); >>> try { >>> zklock.lock(); // ==> will be blocked here >>> >>> ZNRecord record = zkclient.readData(scope.getZkPath(), true); >>> update record fields; >>> zkclient.writeData(scope.getZkPath(), record); >>> } finally { >>> zklock.unlock(); >>> } >>> } >>> >>> After several invocation of this method, zklock.lock() method doesn't >>> return (so the lock is not acquired). State machine threads become >>> blocked. >>> >>> At zk path "<cluster>/LOCKS/RESOURCE_resource" I see several znodes >>> there, representing outstanding lock requests. >>> >>> Are there any special care we should be aware of about zk lock ? Thanks. >>> >>> >>> -neutronsharc >>