[ https://issues.apache.org/jira/browse/HIVE-16487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Vary updated HIVE-16487: ------------------------------ Status: Patch Available (was: Open) > Serious Zookeeper exception is logged when a race condition happens > ------------------------------------------------------------------- > > Key: HIVE-16487 > URL: https://issues.apache.org/jira/browse/HIVE-16487 > Project: Hive > Issue Type: Bug > Components: Locking > Affects Versions: 3.0.0 > Reporter: Peter Vary > Assignee: Peter Vary > Attachments: HIVE-16487.patch > > > A customer started to see this in the logs, but happily everything was > working as intended: > {code} > 2017-03-30 12:01:59,446 ERROR ZooKeeperHiveLockManager: > [HiveServer2-Background-Pool: Thread-620]: Serious Zookeeper exception: > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = > NoNode for /hive_zookeeper_namespace/<TABLE_NAME>/LOCK-SHARED- > {code} > This was happening, because a race condition between the lock releasing, and > lock acquiring. The thread releasing the lock removes the parent ZK node just > after the thread acquiring the lock made sure, that the parent node exists. > Since this can happen without any real problem, I plan to add NODEEXISTS, and > NONODE as a transient ZooKeeper exception, so the users are not confused. > Also, the original author of ZooKeeperHiveLockManager maybe planned to handle > different ZooKeeperExceptions differently, and the code is hard to > understand. See the {{continue}} and the {{break}}. The {{break}} only breaks > the switch, and not the loop which IMHO is not intuitive: > {code} > do { > try { > [..] > ret = lockPrimitive(key, mode, keepAlive, parentCreated, > } catch (Exception e1) { > if (e1 instanceof KeeperException) { > KeeperException e = (KeeperException) e1; > switch (e.code()) { > case CONNECTIONLOSS: > case OPERATIONTIMEOUT: > LOG.debug("Possibly transient ZooKeeper exception: ", e); > continue; > default: > LOG.error("Serious Zookeeper exception: ", e); > break; > } > } > [..] > } > } while (tryNum < numRetriesForLock); > {code} > If we do not want to try again in case of a "Serious Zookeeper exception:", > then we should add a label to the do loop, and break it in the switch. > If we do want to try regardless of the type of the ZK exception, then we > should just change the {{continue;}} to {{break;}} and move the lines part of > the code which did not run in case of {{continue}} to the {{default}} switch, > so it is easier to understand the code. > Any suggestions or ideas [~ctang.ma] or [~szehon]? -- This message was sent by Atlassian JIRA (v6.3.15#6346)