Re: calling ZKHelixLock from state machine transition

kishore g Wed, 25 May 2016 21:34:38 -0700

This feature is available on 0.7.x only.

On Wed, May 25, 2016 at 9:20 PM, Neutron sharc <[email protected]>
wrote:


> Thanks Kishore.  I tested this patch on tag 0.7.1 because we are using
> 0.7.1.  The code is same in master branch though.
>
> Where do you want this patch to land?
>
> On Wed, May 25, 2016 at 8:35 PM, kishore g <[email protected]> wrote:
> > sorry I missed this. It looked good to me. Will merge it.
> >
> > On Wed, May 25, 2016 at 7:41 PM, Neutron sharc <[email protected]>
> > wrote:
> >
> >> Hi Kishore, Kanak,   any updates?
> >>
> >> On Thu, May 19, 2016 at 4:13 PM, kishore g <[email protected]> wrote:
> >> > Thanks Shawn. Will review it tonight. Kanak, It will be great if you
> can
> >> > take a look at it as well.
> >> >
> >> > On Thu, May 19, 2016 at 3:45 PM, Neutron sharc <
> [email protected]>
> >> > wrote:
> >> >
> >> >> Hi Helix team,
> >> >>
> >> >> I uploaded a PR to fix this bug:
> >> https://github.com/apache/helix/pull/44
> >> >>
> >> >> Thanks.
> >> >>
> >> >> On Wed, May 18, 2016 at 11:01 PM, Neutron sharc <
> [email protected]
> >> >
> >> >> wrote:
> >> >> > Hi Kanak,
> >> >> >
> >> >> > The same problem with zk helix lock re-appears.  I found some clues
> >> >> > about the potential bug.  This potential bug causes all threads
> >> >> > competing for a same zk helix lock to be blocked.
> >> >> >
> >> >> > In my test there are two java threads blocked when trying to grab
> zk
> >> >> > lock (thread 15 and thread 19)
> >> >> >
> >> >> > Here are related logs before the threads are blocked (inlined with
> my
> >> >> comments)
> >> >> >
> >> >> > [INFO  2016-05-18 22:19:54,057 com.hcd.hcdadmin.M1Rebalancer:70]
> >> >> > rebalancer thread 15 before zklock
> >> >> >   =>  T15 enters
> >> >> >
> >> >> > [DEBUG 2016-05-18 22:19:54,069
> org.apache.helix.lock.zk.WriteLock:193]
> >> >> > Created id:
> >> /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078
> >> >> >    =>  T15  creates znode,  T15 is the smallest so it owns lock
> >> >> >
> >> >> > [INFO  2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:70]
> >> >> > rebalancer thread 19 before zklock
> >> >> >    => T19 enters
> >> >> >
> >> >> > [INFO  2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:72]
> >> >> > rebalancer thread 15 start computing for controller host1_admin
> >> >> >    => T15 performs its work
> >> >> >
> >> >> > [DEBUG 2016-05-18 22:19:54,080
> org.apache.helix.lock.zk.WriteLock:193]
> >> >> > Created id:
> >> /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-0000000079
> >> >> >     =>  T19 creates its znode
> >> >> >
> >> >> > [DEBUG 2016-05-18 22:19:54,081
> org.apache.helix.lock.zk.WriteLock:233]
> >> >> > watching less than me node:
> >> >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078
> >> >> >     =>  T19 found its predecessor to wait for, which is T15
> >> >> >
> >> >> > [WARN  2016-05-18 22:19:54,084
> org.apache.helix.lock.zk.WriteLock:239]
> >> >> > Could not find the stats for less than me:
> >> >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078
> >> >> >     =>  T19 calls zookeeper.exist() to register a watcher on T15,
> but
> >> >> > T15 has called unlock() to delete the znode at the same moment.  So
> >> >> > T19 continues to check while(id==null) loop.  Because T19 id is not
> >> >> > null now,  T19's LockZooKeeperOperation.execute() returns false.
> T19
> >> >> > will block at wait(), hoping somebody else will notify it.  But
> since
> >> >> > T19 is currently the smallest so nobody else can grab the lock and
> >> >> > wait up T19;  T19 blocks, and every subsequent caller also blocks.
> >> >> >
> >> >> > The code that leads to the problem is here:
> >> >> >
> >> >>
> >>
> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L238
> >> >> >
> >> >> > One possible fix is to just set id to null at line 240 and let
> while()
> >> >> > loop to retry.
> >> >> >
> >> >>
> >>
> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L240
> >> >> >
> >> >> >
> >> >> > [INFO  2016-05-18 22:19:54,092 com.hcd.hcdadmin.M1Rebalancer:125]
> >> >> > rebalancer thread 15 released zklock and returned
> >> >> >   =>  T15 has deleted znode a short while ago and returns from this
> >> >> method
> >> >> >
> >> >> >
> >> >> > [INFO  2016-05-18 22:19:54,179 com.hcd.hcdadmin.M1Rebalancer:70]
> >> >> > rebalancer thread 15 before zklock
> >> >> >   =>  T15 calls this method again,
> >> >> >
> >> >> > [DEBUG 2016-05-18 22:19:54,191
> org.apache.helix.lock.zk.WriteLock:193]
> >> >> > Created id:
> >> /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911663-0000000080
> >> >> >   => T15 creates znode
> >> >> >
> >> >> > [DEBUG 2016-05-18 22:19:54,193
> org.apache.helix.lock.zk.WriteLock:233]
> >> >> > watching less than me node:
> >> >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-0000000079
> >> >> >
> >> >> >   => T15 found T19 to be smallest so it waits for T19.  Nobody will
> >> >> > wake up T19,  so T15 is also blocked.
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > Any comments appreciated. Thanks.
> >> >> >
> >> >> >
> >> >> > -Neutronsharc
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Sat, May 14, 2016 at 5:20 PM, Neutron sharc <
> >> [email protected]>
> >> >> wrote:
> >> >> >> We increased the max connections allowed per client at zk server
> >> side.
> >> >> >> The problem is gone now.
> >> >> >>
> >> >> >> On Tue, May 10, 2016 at 2:50 PM, Neutron sharc <
> >> [email protected]>
> >> >> wrote:
> >> >> >>> Hi Kanak,  thanks for reply.
> >> >> >>>
> >> >> >>> The problem is gone if we set a constraint of 1 on
> >> "STATE_TRANSITION"
> >> >> >>> for the resource.  If we allow multiple state transitions to be
> >> >> >>> executed in the resource,  then this zklock problem occurs.
> >> >> >>>
> >> >> >>> btw,  we run multiple participants in a same jvm in our test.  In
> >> >> >>> other words, there are multiple java threads in a same jvm
> competing
> >> >> >>> for zklock.
> >> >> >>>
> >> >> >>> We haven't profiled the ZKHelixLock._listener.lockAcquired()
> since
> >> we
> >> >> >>> bypassed this problem using constraint.  Will revisit it later.
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> On Mon, May 9, 2016 at 8:28 PM, Kanak Biscuitwala <
> >> [email protected]>
> >> >> wrote:
> >> >> >>>> Hi,
> >> >> >>>>
> >> >> >>>> ZkHelixLock is a thin wrapper around the ZooKeeper WriteLock
> recipe
> >> >> (which was last changed over 5 years ago). Though we haven't
> extensively
> >> >> tested it in production, but we haven't seen it fail to return as
> >> described.
> >> >> >>>>
> >> >> >>>> Do you know if ZKHelixLock._listener.lockAcquired() is ever
> called?
> >> >> >>>>
> >> >> >>>> Feel free to examine the code here:
> >> >>
> >>
> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/ZKHelixLock.java
> >> >> >>>>
> >> >> >>>>> From: [email protected]
> >> >> >>>>> Date: Mon, 9 May 2016 14:26:43 -0700
> >> >> >>>>> Subject: calling ZKHelixLock from state machine transition
> >> >> >>>>> To: [email protected]
> >> >> >>>>>
> >> >> >>>>> Hi Helix team,
> >> >> >>>>>
> >> >> >>>>> We observed an issue at state machine transition handle:
> >> >> >>>>>
> >> >> >>>>> // statemodel.java:
> >> >> >>>>>
> >> >> >>>>> public void offlineToSlave(Message message, NotificationContext
> >> >> context) {
> >> >> >>>>>
> >> >> >>>>>   // do work to start a local shard
> >> >> >>>>>
> >> >> >>>>>   // we want to save the new shard info to resource config
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>>   ZKHelixLock zklock = new ZKHelixLock(clusterId, resource,
> >> >> zkclient);
> >> >> >>>>>   try {
> >> >> >>>>>     zklock.lock();    // ==> will be blocked here
> >> >> >>>>>
> >> >> >>>>>     ZNRecord record = zkclient.readData(scope.getZkPath(),
> true);
> >> >> >>>>>     update record fields;
> >> >> >>>>>     zkclient.writeData(scope.getZkPath(), record);
> >> >> >>>>>   } finally {
> >> >> >>>>>     zklock.unlock();
> >> >> >>>>>   }
> >> >> >>>>> }
> >> >> >>>>>
> >> >> >>>>> After several invocation of this method,  zklock.lock() method
> >> >> doesn't
> >> >> >>>>> return (so the lock is not acquired).  State machine threads
> >> become
> >> >> >>>>> blocked.
> >> >> >>>>>
> >> >> >>>>> At zk path "<cluster>/LOCKS/RESOURCE_resource"  I see several
> >> znodes
> >> >> >>>>> there, representing outstanding lock requests.
> >> >> >>>>>
> >> >> >>>>> Are there any special care we should be aware of about zk lock
> ?
> >> >> Thanks.
> >> >> >>>>>
> >> >> >>>>>
> >> >> >>>>> -neutronsharc
> >> >> >>>>
> >> >>
> >>
>

Re: calling ZKHelixLock from state machine transition

Reply via email to