[ 
https://issues.apache.org/jira/browse/HIVE-16487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Vary updated HIVE-16487:
------------------------------
    Status: Patch Available  (was: Open)

> Serious Zookeeper exception is logged when a race condition happens
> -------------------------------------------------------------------
>
>                 Key: HIVE-16487
>                 URL: https://issues.apache.org/jira/browse/HIVE-16487
>             Project: Hive
>          Issue Type: Bug
>          Components: Locking
>    Affects Versions: 3.0.0
>            Reporter: Peter Vary
>            Assignee: Peter Vary
>         Attachments: HIVE-16487.patch
>
>
> A customer started to see this in the logs, but happily everything was 
> working as intended:
> {code}
> 2017-03-30 12:01:59,446 ERROR ZooKeeperHiveLockManager: 
> [HiveServer2-Background-Pool: Thread-620]: Serious Zookeeper exception: 
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for /hive_zookeeper_namespace/<TABLE_NAME>/LOCK-SHARED-
> {code}
> This was happening, because a race condition between the lock releasing, and 
> lock acquiring. The thread releasing the lock removes the parent ZK node just 
> after the thread acquiring the lock made sure, that the parent node exists.
> Since this can happen without any real problem, I plan to add NODEEXISTS, and 
> NONODE as a transient ZooKeeper exception, so the users are not confused.
> Also, the original author of ZooKeeperHiveLockManager maybe planned to handle 
> different ZooKeeperExceptions differently, and the code is hard to 
> understand. See the {{continue}} and the {{break}}. The {{break}} only breaks 
> the switch, and not the loop which IMHO is not intuitive:
> {code}
>     do {
>       try {
> [..]
>         ret = lockPrimitive(key, mode, keepAlive, parentCreated, 
>       } catch (Exception e1) {
>         if (e1 instanceof KeeperException) {
>           KeeperException e = (KeeperException) e1;
>           switch (e.code()) {
>           case CONNECTIONLOSS:
>           case OPERATIONTIMEOUT:
>             LOG.debug("Possibly transient ZooKeeper exception: ", e);
>             continue;
>           default:
>             LOG.error("Serious Zookeeper exception: ", e);
>             break;
>           }
>         }
> [..]
>       }
>     } while (tryNum < numRetriesForLock);
> {code}
> If we do not want to try again in case of a "Serious Zookeeper exception:", 
> then we should add a label to the do loop, and break it in the switch.
> If we do want to try regardless of the type of the ZK exception, then we 
> should just change the {{continue;}} to {{break;}} and move the lines part of 
> the code which did not run in case of {{continue}} to the {{default}} switch, 
> so it is easier to understand the code.
> Any suggestions or ideas [~ctang.ma] or [~szehon]?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to