Josh Elser created HBASE-21796:
----------------------------------

             Summary: RecoverableZooKeeper indefinitely retries a client stuck 
in AUTH_FAILED
                 Key: HBASE-21796
                 URL: https://issues.apache.org/jira/browse/HBASE-21796
             Project: HBase
          Issue Type: Bug
          Components: Zookeeper
            Reporter: Josh Elser
            Assignee: Josh Elser
             Fix For: 1.5.0


We've observed the following situation inside of a RegionServer which leaves an 
HConnection in a broken state as a result of the ZooKeeper client having 
received an AUTH_FAILED case in the Phoenix secondary indexing code-path. The 
result was that the HConnection used to write the secondary index updates 
failed every time the client re-attempted the write but we had no outward signs 
from the HConnection that there was a problem with that HConnection instance.

ZooKeeper programmer docs tell us that if a ZooKeeper instance goes to the 
{{AUTH_FAILED}} state that we must open a new ZooKeeper instance: 
[https://zookeeper.apache.org/doc/r3.4.13/zookeeperProgrammers.html#ch_zkSessions]

When a new HConnection (or one without a cached meta location) tries to access 
ZooKeeper to find meta's location or the cluster ID, this spin indefinitely 
because we can never access ZooKeeper because our client is broken from the 
AUTH_FAILED. For the Phoenix use-case (where we're trying to use this 
HConnection within the RS), this breaks things pretty fast.

The circumstances that caused us to observe this are not an HBase (or Phoenix 
or ZooKeeper) problem. The AUTH_FAILED exception we see is a result of 
networking issues on a user's system. Despite this, we can make our handling of 
this situation better.

We already have logic inside of RecoverableZooKeeper to re-create a ZooKeeper 
object when we need one (e.g. session expired/closed). We can extend this same 
logic to also re-create the ZK client object if we observe an AUTH_FAILED state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to