Hi Helix team, I uploaded a PR to fix this bug: https://github.com/apache/helix/pull/44
Thanks. On Wed, May 18, 2016 at 11:01 PM, Neutron sharc <neutronsh...@gmail.com> wrote: > Hi Kanak, > > The same problem with zk helix lock re-appears. I found some clues > about the potential bug. This potential bug causes all threads > competing for a same zk helix lock to be blocked. > > In my test there are two java threads blocked when trying to grab zk > lock (thread 15 and thread 19) > > Here are related logs before the threads are blocked (inlined with my > comments) > > [INFO 2016-05-18 22:19:54,057 com.hcd.hcdadmin.M1Rebalancer:70] > rebalancer thread 15 before zklock > => T15 enters > > [DEBUG 2016-05-18 22:19:54,069 org.apache.helix.lock.zk.WriteLock:193] > Created id: /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078 > => T15 creates znode, T15 is the smallest so it owns lock > > [INFO 2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:70] > rebalancer thread 19 before zklock > => T19 enters > > [INFO 2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:72] > rebalancer thread 15 start computing for controller host1_admin > => T15 performs its work > > [DEBUG 2016-05-18 22:19:54,080 org.apache.helix.lock.zk.WriteLock:193] > Created id: /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-0000000079 > => T19 creates its znode > > [DEBUG 2016-05-18 22:19:54,081 org.apache.helix.lock.zk.WriteLock:233] > watching less than me node: > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078 > => T19 found its predecessor to wait for, which is T15 > > [WARN 2016-05-18 22:19:54,084 org.apache.helix.lock.zk.WriteLock:239] > Could not find the stats for less than me: > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078 > => T19 calls zookeeper.exist() to register a watcher on T15, but > T15 has called unlock() to delete the znode at the same moment. So > T19 continues to check while(id==null) loop. Because T19 id is not > null now, T19's LockZooKeeperOperation.execute() returns false. T19 > will block at wait(), hoping somebody else will notify it. But since > T19 is currently the smallest so nobody else can grab the lock and > wait up T19; T19 blocks, and every subsequent caller also blocks. > > The code that leads to the problem is here: > https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L238 > > One possible fix is to just set id to null at line 240 and let while() > loop to retry. > https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L240 > > > [INFO 2016-05-18 22:19:54,092 com.hcd.hcdadmin.M1Rebalancer:125] > rebalancer thread 15 released zklock and returned > => T15 has deleted znode a short while ago and returns from this method > > > [INFO 2016-05-18 22:19:54,179 com.hcd.hcdadmin.M1Rebalancer:70] > rebalancer thread 15 before zklock > => T15 calls this method again, > > [DEBUG 2016-05-18 22:19:54,191 org.apache.helix.lock.zk.WriteLock:193] > Created id: /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911663-0000000080 > => T15 creates znode > > [DEBUG 2016-05-18 22:19:54,193 org.apache.helix.lock.zk.WriteLock:233] > watching less than me node: > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-0000000079 > > => T15 found T19 to be smallest so it waits for T19. Nobody will > wake up T19, so T15 is also blocked. > > > > > Any comments appreciated. Thanks. > > > -Neutronsharc > > > > On Sat, May 14, 2016 at 5:20 PM, Neutron sharc <neutronsh...@gmail.com> wrote: >> We increased the max connections allowed per client at zk server side. >> The problem is gone now. >> >> On Tue, May 10, 2016 at 2:50 PM, Neutron sharc <neutronsh...@gmail.com> >> wrote: >>> Hi Kanak, thanks for reply. >>> >>> The problem is gone if we set a constraint of 1 on "STATE_TRANSITION" >>> for the resource. If we allow multiple state transitions to be >>> executed in the resource, then this zklock problem occurs. >>> >>> btw, we run multiple participants in a same jvm in our test. In >>> other words, there are multiple java threads in a same jvm competing >>> for zklock. >>> >>> We haven't profiled the ZKHelixLock._listener.lockAcquired() since we >>> bypassed this problem using constraint. Will revisit it later. >>> >>> >>> >>> >>> On Mon, May 9, 2016 at 8:28 PM, Kanak Biscuitwala <kana...@hotmail.com> >>> wrote: >>>> Hi, >>>> >>>> ZkHelixLock is a thin wrapper around the ZooKeeper WriteLock recipe (which >>>> was last changed over 5 years ago). Though we haven't extensively tested >>>> it in production, but we haven't seen it fail to return as described. >>>> >>>> Do you know if ZKHelixLock._listener.lockAcquired() is ever called? >>>> >>>> Feel free to examine the code here: >>>> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/ZKHelixLock.java >>>> >>>>> From: neutronsh...@gmail.com >>>>> Date: Mon, 9 May 2016 14:26:43 -0700 >>>>> Subject: calling ZKHelixLock from state machine transition >>>>> To: dev@helix.apache.org >>>>> >>>>> Hi Helix team, >>>>> >>>>> We observed an issue at state machine transition handle: >>>>> >>>>> // statemodel.java: >>>>> >>>>> public void offlineToSlave(Message message, NotificationContext context) { >>>>> >>>>> // do work to start a local shard >>>>> >>>>> // we want to save the new shard info to resource config >>>>> >>>>> >>>>> ZKHelixLock zklock = new ZKHelixLock(clusterId, resource, zkclient); >>>>> try { >>>>> zklock.lock(); // ==> will be blocked here >>>>> >>>>> ZNRecord record = zkclient.readData(scope.getZkPath(), true); >>>>> update record fields; >>>>> zkclient.writeData(scope.getZkPath(), record); >>>>> } finally { >>>>> zklock.unlock(); >>>>> } >>>>> } >>>>> >>>>> After several invocation of this method, zklock.lock() method doesn't >>>>> return (so the lock is not acquired). State machine threads become >>>>> blocked. >>>>> >>>>> At zk path "<cluster>/LOCKS/RESOURCE_resource" I see several znodes >>>>> there, representing outstanding lock requests. >>>>> >>>>> Are there any special care we should be aware of about zk lock ? Thanks. >>>>> >>>>> >>>>> -neutronsharc >>>>