[ https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Mollitor updated HIVE-21469: ---------------------------------- Status: Patch Available (was: Open) I was using {{Collections.unmodifableList()}} and it did *not* like that. > Review of ZooKeeperHiveLockManager > ---------------------------------- > > Key: HIVE-21469 > URL: https://issues.apache.org/jira/browse/HIVE-21469 > Project: Hive > Issue Type: Improvement > Components: Locking > Affects Versions: 4.0.0, 3.2.0 > Reporter: David Mollitor > Assignee: David Mollitor > Priority: Major > Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, > HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch > > > A lot of sins in this class to resolve: > {code:java} > @Override > public void setContext(HiveLockManagerCtx ctx) throws LockException { > try { > curatorFramework = CuratorFrameworkSingleton.getInstance(conf); > parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE); > try{ > curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" > + parent, new byte[0]); > } catch (Exception e) { > // ignore if the parent already exists > if (!(e instanceof KeeperException) || ((KeeperException)e).code() != > KeeperException.Code.NODEEXISTS) { > LOG.warn("Unexpected ZK exception when creating parent node /" + > parent, e); > } > } > {code} > Every time a new session is created and this {{setContext}} method is called, > it attempts to create the root node. I have seen that, even though the root > node exists, an create node action is written to the ZK logs. Check first if > the node exists before trying to create it. > {code:java} > try { > curatorFramework.delete().forPath(zLock.getPath()); > } catch (InterruptedException ie) { > curatorFramework.delete().forPath(zLock.getPath()); > } > {code} > There has historically been a quite a few bugs regarding leaked locks. The > Driver will signal the session {{Thread}} by performing an interrupt. That > interrupt can happen any time and it can kill a create/delete action within > the ZK framework. We can see one example of workaround for this. If the ZK > action is interrupted, simply do it again. Well, what if it's interrupted > yet again? The lock will be leaked. Also, when the {{InterruptedException}} > is caught in the try block, the thread's interrupted flag is cleared. The > flag is not reset in this code and therefore we lose the fact that this > thread has been interrupted. This flag should be preserved so that other > code paths will know that it's time to exit. > {code:java} > if (tryNum > 1) { > Thread.sleep(sleepTime); > } > unlockPrimitive(hiveLock, parent, curatorFramework); > break; > } catch (Exception e) { > if (tryNum >= numRetriesForUnLock) { > String name = ((ZooKeeperHiveLock)hiveLock).getPath(); > throw new LockException("Node " + name + " can not be deleted after > " + numRetriesForUnLock + " attempts.", > e); > } > } > {code} > ... related... the sleep here may be interrupted, but we still need to delete > the lock (again, for fear of leaking it). This sleep should be > uninterruptible. If we need to get the lock deleted, and there's a problem, > interrupting the sleep will cause the code to eventually exit and locks will > be leaked. > It also requires a bunch more TLC. -- This message was sent by Atlassian JIRA (v7.6.3#76005)