David Mollitor created HIVE-21469:
-------------------------------------
Summary: Review of ZooKeeperHiveLockManager
Key: HIVE-21469
URL: https://issues.apache.org/jira/browse/HIVE-21469
Project: Hive
Issue Type: Improvement
Components: Locking
Affects Versions: 4.0.0, 3.2.0
Reporter: David Mollitor
Assignee: David Mollitor
Attachments: HIVE-21469.1.patch
A lot of sins in this class to resolve:
{code:java}
@Override
public void setContext(HiveLockManagerCtx ctx) throws LockException {
try {
curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
try{
curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" +
parent, new byte[0]);
} catch (Exception e) {
// ignore if the parent already exists
if (!(e instanceof KeeperException) || ((KeeperException)e).code() !=
KeeperException.Code.NODEEXISTS) {
LOG.warn("Unexpected ZK exception when creating parent node /" +
parent, e);
}
}
{code}
Every time a new session is created and this {{setContext}} method is called,
it attempts to create the root node. I have seen that, even though the root
node exists, an create node action is written to the ZK logs. Check first if
the node exists before trying to create it.
{code:java}
try {
curatorFramework.delete().forPath(zLock.getPath());
} catch (InterruptedException ie) {
curatorFramework.delete().forPath(zLock.getPath());
}
{code}
There has historically been a quite a few bugs regarding leaked locks. The
Driver will signal the session {{Thread}} by performing an interrupt. That
interrupt can happen any time and it can kill a create/delete action within the
ZK framework. We can see one example of workaround for this. If the ZK action
is interrupted, simply do it again. Well, what if it's interrupted yet again?
The lock will be leaked anyway. Also, when the {{InterruptedException}} is
caught in the try block, the thread's interrupted flag is cleared. The flag is
not reset in this code and therefore we lose the fact that this thread has been
interrupted.
{code:java}
if (tryNum > 1) {
Thread.sleep(sleepTime);
}
unlockPrimitive(hiveLock, parent, curatorFramework);
break;
} catch (Exception e) {
if (tryNum >= numRetriesForUnLock) {
String name = ((ZooKeeperHiveLock)hiveLock).getPath();
throw new LockException("Node " + name + " can not be deleted after "
+ numRetriesForUnLock + " attempts.",
e);
}
}
{code}
... related... the sleep here may be interrupted, but we still need to delete
the lock (again, for fear of leaking it). This sleep should be
uninterruptible. If we need to get the lock deleted, and there's a problem,
interrupting the sleep will cause the code to eventually exit and locks will be
leaked.
It also requires a bunch more TLC.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)