Peter Vary created HIVE-16487:
---------------------------------

             Summary: Serious Zookeeper exception is logged when a race 
condition happens
                 Key: HIVE-16487
                 URL: https://issues.apache.org/jira/browse/HIVE-16487
             Project: Hive
          Issue Type: Bug
          Components: Locking
    Affects Versions: 3.0.0
            Reporter: Peter Vary
            Assignee: Peter Vary


A customer started to see this in the logs, but happily everything was working 
as intended:
{code}
2017-03-30 12:01:59,446 ERROR ZooKeeperHiveLockManager: 
[HiveServer2-Background-Pool: Thread-620]: Serious Zookeeper exception: 
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
for /hive_zookeeper_namespace/<TABLE_NAME>/LOCK-SHARED-
{code}
This was happening, because a race condition between the lock releasing, and 
lock acquiring. The thread releasing the lock removes the parent ZK node just 
after the thread acquiring the lock made sure, that the parent node exists.

Since this can happen without any real problem, I plan to add NODEEXISTS, and 
NONODE as a transient ZooKeeper exception, so the users are not confused.

Also, the original author of ZooKeeperHiveLockManager maybe planned to handle 
different ZooKeeperExceptions differently, and the code is hard to understand. 
See the {{continue}} and the {{break}}. The {{break}} only breaks the switch, 
and not the loop which IMHO is not intuitive:
{code}
    do {
      try {
[..]
        ret = lockPrimitive(key, mode, keepAlive, parentCreated, 
      } catch (Exception e1) {
        if (e1 instanceof KeeperException) {
          KeeperException e = (KeeperException) e1;
          switch (e.code()) {
          case CONNECTIONLOSS:
          case OPERATIONTIMEOUT:
            LOG.debug("Possibly transient ZooKeeper exception: ", e);
            continue;
          default:
            LOG.error("Serious Zookeeper exception: ", e);
            break;
          }
        }
[..]
      }
    } while (tryNum < numRetriesForLock);
{code}

If we do not want to try again in case of a "Serious Zookeeper exception:", 
then we should add a label to the do loop, and break it in the switch.

If we do want to try regardless of the type of the ZK exception, then we should 
just change the {{continue;}} to {{break;}} and move the lines part of the code 
which did not run in case of {{continue}} to the {{default}} switch, so it is 
easier to understand the code.

Any suggestions or ideas [~ctang.ma] or [~szehon]?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to