sorry I missed this. It looked good to me. Will merge it. On Wed, May 25, 2016 at 7:41 PM, Neutron sharc <neutronsh...@gmail.com> wrote:
> Hi Kishore, Kanak, any updates? > > On Thu, May 19, 2016 at 4:13 PM, kishore g <g.kish...@gmail.com> wrote: > > Thanks Shawn. Will review it tonight. Kanak, It will be great if you can > > take a look at it as well. > > > > On Thu, May 19, 2016 at 3:45 PM, Neutron sharc <neutronsh...@gmail.com> > > wrote: > > > >> Hi Helix team, > >> > >> I uploaded a PR to fix this bug: > https://github.com/apache/helix/pull/44 > >> > >> Thanks. > >> > >> On Wed, May 18, 2016 at 11:01 PM, Neutron sharc <neutronsh...@gmail.com > > > >> wrote: > >> > Hi Kanak, > >> > > >> > The same problem with zk helix lock re-appears. I found some clues > >> > about the potential bug. This potential bug causes all threads > >> > competing for a same zk helix lock to be blocked. > >> > > >> > In my test there are two java threads blocked when trying to grab zk > >> > lock (thread 15 and thread 19) > >> > > >> > Here are related logs before the threads are blocked (inlined with my > >> comments) > >> > > >> > [INFO 2016-05-18 22:19:54,057 com.hcd.hcdadmin.M1Rebalancer:70] > >> > rebalancer thread 15 before zklock > >> > => T15 enters > >> > > >> > [DEBUG 2016-05-18 22:19:54,069 org.apache.helix.lock.zk.WriteLock:193] > >> > Created id: > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078 > >> > => T15 creates znode, T15 is the smallest so it owns lock > >> > > >> > [INFO 2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:70] > >> > rebalancer thread 19 before zklock > >> > => T19 enters > >> > > >> > [INFO 2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:72] > >> > rebalancer thread 15 start computing for controller host1_admin > >> > => T15 performs its work > >> > > >> > [DEBUG 2016-05-18 22:19:54,080 org.apache.helix.lock.zk.WriteLock:193] > >> > Created id: > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-0000000079 > >> > => T19 creates its znode > >> > > >> > [DEBUG 2016-05-18 22:19:54,081 org.apache.helix.lock.zk.WriteLock:233] > >> > watching less than me node: > >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078 > >> > => T19 found its predecessor to wait for, which is T15 > >> > > >> > [WARN 2016-05-18 22:19:54,084 org.apache.helix.lock.zk.WriteLock:239] > >> > Could not find the stats for less than me: > >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078 > >> > => T19 calls zookeeper.exist() to register a watcher on T15, but > >> > T15 has called unlock() to delete the znode at the same moment. So > >> > T19 continues to check while(id==null) loop. Because T19 id is not > >> > null now, T19's LockZooKeeperOperation.execute() returns false. T19 > >> > will block at wait(), hoping somebody else will notify it. But since > >> > T19 is currently the smallest so nobody else can grab the lock and > >> > wait up T19; T19 blocks, and every subsequent caller also blocks. > >> > > >> > The code that leads to the problem is here: > >> > > >> > https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L238 > >> > > >> > One possible fix is to just set id to null at line 240 and let while() > >> > loop to retry. > >> > > >> > https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L240 > >> > > >> > > >> > [INFO 2016-05-18 22:19:54,092 com.hcd.hcdadmin.M1Rebalancer:125] > >> > rebalancer thread 15 released zklock and returned > >> > => T15 has deleted znode a short while ago and returns from this > >> method > >> > > >> > > >> > [INFO 2016-05-18 22:19:54,179 com.hcd.hcdadmin.M1Rebalancer:70] > >> > rebalancer thread 15 before zklock > >> > => T15 calls this method again, > >> > > >> > [DEBUG 2016-05-18 22:19:54,191 org.apache.helix.lock.zk.WriteLock:193] > >> > Created id: > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911663-0000000080 > >> > => T15 creates znode > >> > > >> > [DEBUG 2016-05-18 22:19:54,193 org.apache.helix.lock.zk.WriteLock:233] > >> > watching less than me node: > >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-0000000079 > >> > > >> > => T15 found T19 to be smallest so it waits for T19. Nobody will > >> > wake up T19, so T15 is also blocked. > >> > > >> > > >> > > >> > > >> > Any comments appreciated. Thanks. > >> > > >> > > >> > -Neutronsharc > >> > > >> > > >> > > >> > On Sat, May 14, 2016 at 5:20 PM, Neutron sharc < > neutronsh...@gmail.com> > >> wrote: > >> >> We increased the max connections allowed per client at zk server > side. > >> >> The problem is gone now. > >> >> > >> >> On Tue, May 10, 2016 at 2:50 PM, Neutron sharc < > neutronsh...@gmail.com> > >> wrote: > >> >>> Hi Kanak, thanks for reply. > >> >>> > >> >>> The problem is gone if we set a constraint of 1 on > "STATE_TRANSITION" > >> >>> for the resource. If we allow multiple state transitions to be > >> >>> executed in the resource, then this zklock problem occurs. > >> >>> > >> >>> btw, we run multiple participants in a same jvm in our test. In > >> >>> other words, there are multiple java threads in a same jvm competing > >> >>> for zklock. > >> >>> > >> >>> We haven't profiled the ZKHelixLock._listener.lockAcquired() since > we > >> >>> bypassed this problem using constraint. Will revisit it later. > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> On Mon, May 9, 2016 at 8:28 PM, Kanak Biscuitwala < > kana...@hotmail.com> > >> wrote: > >> >>>> Hi, > >> >>>> > >> >>>> ZkHelixLock is a thin wrapper around the ZooKeeper WriteLock recipe > >> (which was last changed over 5 years ago). Though we haven't extensively > >> tested it in production, but we haven't seen it fail to return as > described. > >> >>>> > >> >>>> Do you know if ZKHelixLock._listener.lockAcquired() is ever called? > >> >>>> > >> >>>> Feel free to examine the code here: > >> > https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/ZKHelixLock.java > >> >>>> > >> >>>>> From: neutronsh...@gmail.com > >> >>>>> Date: Mon, 9 May 2016 14:26:43 -0700 > >> >>>>> Subject: calling ZKHelixLock from state machine transition > >> >>>>> To: dev@helix.apache.org > >> >>>>> > >> >>>>> Hi Helix team, > >> >>>>> > >> >>>>> We observed an issue at state machine transition handle: > >> >>>>> > >> >>>>> // statemodel.java: > >> >>>>> > >> >>>>> public void offlineToSlave(Message message, NotificationContext > >> context) { > >> >>>>> > >> >>>>> // do work to start a local shard > >> >>>>> > >> >>>>> // we want to save the new shard info to resource config > >> >>>>> > >> >>>>> > >> >>>>> ZKHelixLock zklock = new ZKHelixLock(clusterId, resource, > >> zkclient); > >> >>>>> try { > >> >>>>> zklock.lock(); // ==> will be blocked here > >> >>>>> > >> >>>>> ZNRecord record = zkclient.readData(scope.getZkPath(), true); > >> >>>>> update record fields; > >> >>>>> zkclient.writeData(scope.getZkPath(), record); > >> >>>>> } finally { > >> >>>>> zklock.unlock(); > >> >>>>> } > >> >>>>> } > >> >>>>> > >> >>>>> After several invocation of this method, zklock.lock() method > >> doesn't > >> >>>>> return (so the lock is not acquired). State machine threads > become > >> >>>>> blocked. > >> >>>>> > >> >>>>> At zk path "<cluster>/LOCKS/RESOURCE_resource" I see several > znodes > >> >>>>> there, representing outstanding lock requests. > >> >>>>> > >> >>>>> Are there any special care we should be aware of about zk lock ? > >> Thanks. > >> >>>>> > >> >>>>> > >> >>>>> -neutronsharc > >> >>>> > >> >