This feature is available on 0.7.x only. On Wed, May 25, 2016 at 9:20 PM, Neutron sharc <neutronsh...@gmail.com> wrote:
> Thanks Kishore. I tested this patch on tag 0.7.1 because we are using > 0.7.1. The code is same in master branch though. > > Where do you want this patch to land? > > On Wed, May 25, 2016 at 8:35 PM, kishore g <g.kish...@gmail.com> wrote: > > sorry I missed this. It looked good to me. Will merge it. > > > > On Wed, May 25, 2016 at 7:41 PM, Neutron sharc <neutronsh...@gmail.com> > > wrote: > > > >> Hi Kishore, Kanak, any updates? > >> > >> On Thu, May 19, 2016 at 4:13 PM, kishore g <g.kish...@gmail.com> wrote: > >> > Thanks Shawn. Will review it tonight. Kanak, It will be great if you > can > >> > take a look at it as well. > >> > > >> > On Thu, May 19, 2016 at 3:45 PM, Neutron sharc < > neutronsh...@gmail.com> > >> > wrote: > >> > > >> >> Hi Helix team, > >> >> > >> >> I uploaded a PR to fix this bug: > >> https://github.com/apache/helix/pull/44 > >> >> > >> >> Thanks. > >> >> > >> >> On Wed, May 18, 2016 at 11:01 PM, Neutron sharc < > neutronsh...@gmail.com > >> > > >> >> wrote: > >> >> > Hi Kanak, > >> >> > > >> >> > The same problem with zk helix lock re-appears. I found some clues > >> >> > about the potential bug. This potential bug causes all threads > >> >> > competing for a same zk helix lock to be blocked. > >> >> > > >> >> > In my test there are two java threads blocked when trying to grab > zk > >> >> > lock (thread 15 and thread 19) > >> >> > > >> >> > Here are related logs before the threads are blocked (inlined with > my > >> >> comments) > >> >> > > >> >> > [INFO 2016-05-18 22:19:54,057 com.hcd.hcdadmin.M1Rebalancer:70] > >> >> > rebalancer thread 15 before zklock > >> >> > => T15 enters > >> >> > > >> >> > [DEBUG 2016-05-18 22:19:54,069 > org.apache.helix.lock.zk.WriteLock:193] > >> >> > Created id: > >> /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078 > >> >> > => T15 creates znode, T15 is the smallest so it owns lock > >> >> > > >> >> > [INFO 2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:70] > >> >> > rebalancer thread 19 before zklock > >> >> > => T19 enters > >> >> > > >> >> > [INFO 2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:72] > >> >> > rebalancer thread 15 start computing for controller host1_admin > >> >> > => T15 performs its work > >> >> > > >> >> > [DEBUG 2016-05-18 22:19:54,080 > org.apache.helix.lock.zk.WriteLock:193] > >> >> > Created id: > >> /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-0000000079 > >> >> > => T19 creates its znode > >> >> > > >> >> > [DEBUG 2016-05-18 22:19:54,081 > org.apache.helix.lock.zk.WriteLock:233] > >> >> > watching less than me node: > >> >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078 > >> >> > => T19 found its predecessor to wait for, which is T15 > >> >> > > >> >> > [WARN 2016-05-18 22:19:54,084 > org.apache.helix.lock.zk.WriteLock:239] > >> >> > Could not find the stats for less than me: > >> >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078 > >> >> > => T19 calls zookeeper.exist() to register a watcher on T15, > but > >> >> > T15 has called unlock() to delete the znode at the same moment. So > >> >> > T19 continues to check while(id==null) loop. Because T19 id is not > >> >> > null now, T19's LockZooKeeperOperation.execute() returns false. > T19 > >> >> > will block at wait(), hoping somebody else will notify it. But > since > >> >> > T19 is currently the smallest so nobody else can grab the lock and > >> >> > wait up T19; T19 blocks, and every subsequent caller also blocks. > >> >> > > >> >> > The code that leads to the problem is here: > >> >> > > >> >> > >> > https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L238 > >> >> > > >> >> > One possible fix is to just set id to null at line 240 and let > while() > >> >> > loop to retry. > >> >> > > >> >> > >> > https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L240 > >> >> > > >> >> > > >> >> > [INFO 2016-05-18 22:19:54,092 com.hcd.hcdadmin.M1Rebalancer:125] > >> >> > rebalancer thread 15 released zklock and returned > >> >> > => T15 has deleted znode a short while ago and returns from this > >> >> method > >> >> > > >> >> > > >> >> > [INFO 2016-05-18 22:19:54,179 com.hcd.hcdadmin.M1Rebalancer:70] > >> >> > rebalancer thread 15 before zklock > >> >> > => T15 calls this method again, > >> >> > > >> >> > [DEBUG 2016-05-18 22:19:54,191 > org.apache.helix.lock.zk.WriteLock:193] > >> >> > Created id: > >> /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911663-0000000080 > >> >> > => T15 creates znode > >> >> > > >> >> > [DEBUG 2016-05-18 22:19:54,193 > org.apache.helix.lock.zk.WriteLock:233] > >> >> > watching less than me node: > >> >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-0000000079 > >> >> > > >> >> > => T15 found T19 to be smallest so it waits for T19. Nobody will > >> >> > wake up T19, so T15 is also blocked. > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > Any comments appreciated. Thanks. > >> >> > > >> >> > > >> >> > -Neutronsharc > >> >> > > >> >> > > >> >> > > >> >> > On Sat, May 14, 2016 at 5:20 PM, Neutron sharc < > >> neutronsh...@gmail.com> > >> >> wrote: > >> >> >> We increased the max connections allowed per client at zk server > >> side. > >> >> >> The problem is gone now. > >> >> >> > >> >> >> On Tue, May 10, 2016 at 2:50 PM, Neutron sharc < > >> neutronsh...@gmail.com> > >> >> wrote: > >> >> >>> Hi Kanak, thanks for reply. > >> >> >>> > >> >> >>> The problem is gone if we set a constraint of 1 on > >> "STATE_TRANSITION" > >> >> >>> for the resource. If we allow multiple state transitions to be > >> >> >>> executed in the resource, then this zklock problem occurs. > >> >> >>> > >> >> >>> btw, we run multiple participants in a same jvm in our test. In > >> >> >>> other words, there are multiple java threads in a same jvm > competing > >> >> >>> for zklock. > >> >> >>> > >> >> >>> We haven't profiled the ZKHelixLock._listener.lockAcquired() > since > >> we > >> >> >>> bypassed this problem using constraint. Will revisit it later. > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> On Mon, May 9, 2016 at 8:28 PM, Kanak Biscuitwala < > >> kana...@hotmail.com> > >> >> wrote: > >> >> >>>> Hi, > >> >> >>>> > >> >> >>>> ZkHelixLock is a thin wrapper around the ZooKeeper WriteLock > recipe > >> >> (which was last changed over 5 years ago). Though we haven't > extensively > >> >> tested it in production, but we haven't seen it fail to return as > >> described. > >> >> >>>> > >> >> >>>> Do you know if ZKHelixLock._listener.lockAcquired() is ever > called? > >> >> >>>> > >> >> >>>> Feel free to examine the code here: > >> >> > >> > https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/ZKHelixLock.java > >> >> >>>> > >> >> >>>>> From: neutronsh...@gmail.com > >> >> >>>>> Date: Mon, 9 May 2016 14:26:43 -0700 > >> >> >>>>> Subject: calling ZKHelixLock from state machine transition > >> >> >>>>> To: dev@helix.apache.org > >> >> >>>>> > >> >> >>>>> Hi Helix team, > >> >> >>>>> > >> >> >>>>> We observed an issue at state machine transition handle: > >> >> >>>>> > >> >> >>>>> // statemodel.java: > >> >> >>>>> > >> >> >>>>> public void offlineToSlave(Message message, NotificationContext > >> >> context) { > >> >> >>>>> > >> >> >>>>> // do work to start a local shard > >> >> >>>>> > >> >> >>>>> // we want to save the new shard info to resource config > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> ZKHelixLock zklock = new ZKHelixLock(clusterId, resource, > >> >> zkclient); > >> >> >>>>> try { > >> >> >>>>> zklock.lock(); // ==> will be blocked here > >> >> >>>>> > >> >> >>>>> ZNRecord record = zkclient.readData(scope.getZkPath(), > true); > >> >> >>>>> update record fields; > >> >> >>>>> zkclient.writeData(scope.getZkPath(), record); > >> >> >>>>> } finally { > >> >> >>>>> zklock.unlock(); > >> >> >>>>> } > >> >> >>>>> } > >> >> >>>>> > >> >> >>>>> After several invocation of this method, zklock.lock() method > >> >> doesn't > >> >> >>>>> return (so the lock is not acquired). State machine threads > >> become > >> >> >>>>> blocked. > >> >> >>>>> > >> >> >>>>> At zk path "<cluster>/LOCKS/RESOURCE_resource" I see several > >> znodes > >> >> >>>>> there, representing outstanding lock requests. > >> >> >>>>> > >> >> >>>>> Are there any special care we should be aware of about zk lock > ? > >> >> Thanks. > >> >> >>>>> > >> >> >>>>> > >> >> >>>>> -neutronsharc > >> >> >>>> > >> >> > >> >