Andor Molnar created HBASE-28339:
------------------------------------

             Summary: HBaseReplicationEndpoint creates new ZooKeeper client 
every time it tries to reconnect
                 Key: HBASE-28339
                 URL: https://issues.apache.org/jira/browse/HBASE-28339
             Project: HBase
          Issue Type: Bug
          Components: Replication
    Affects Versions: 2.5.7, 3.0.0-beta-1, 2.4.17, 2.6.0, 2.7.0
            Reporter: Andor Molnar
            Assignee: Andor Molnar


Asbtract base class {{HBaseReplicationEndpoint}} and therefore 
{{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client instance 
every time there's an error occurs in communication and it tries to reconnect. 
This was not a problem with ZooKeeper 3.4.x versions, because the TGT Login 
thread was a static reference and only created once for all clients in the same 
JVM. With the upgrade to ZooKeeper 3.5.x the login thread is dedicated to the 
client instance, hence we have a new login thread every time the replication 
endpoint reconnects.
{code:java}
/**
 * A private method used to re-establish a zookeeper session with a peer 
cluster.
 */
protected void reconnect(KeeperException ke) {
  if (
    ke instanceof ConnectionLossException || ke instanceof 
SessionExpiredException
      || ke instanceof AuthFailedException
  ) {
    String clusterKey = ctx.getPeerConfig().getClusterKey();
    LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke);
    try {
      reloadZkWatcher();
    } catch (IOException io) {
      LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey, 
io);
    }
  }
}{code}
{code:java}
/**
 * Closes the current ZKW (if not null) and creates a new one
 * @throws IOException If anything goes wrong connecting
 */
synchronized void reloadZkWatcher() throws IOException {
  if (zkw != null) zkw.close();
  zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " + 
ctx.getPeerId(), this);
  getZkw().registerListener(new PeerRegionServerListener(this));
} {code}
If the target cluster of replication is unavailable for some reason, the 
replication endpoint keeps trying to reconnect to ZooKeeper destroying and 
creating new Login threads constantly which will carpet bomb the KDC host with 
login requests.
 
I'm not sure how to fix this yet, trying to create a unit test first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to