Andor Molnar created HBASE-28339: ------------------------------------ Summary: HBaseReplicationEndpoint creates new ZooKeeper client every time it tries to reconnect Key: HBASE-28339 URL: https://issues.apache.org/jira/browse/HBASE-28339 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 2.5.7, 3.0.0-beta-1, 2.4.17, 2.6.0, 2.7.0 Reporter: Andor Molnar Assignee: Andor Molnar
Asbtract base class {{HBaseReplicationEndpoint}} and therefore {{HBaseInterClusterReplicationEndpoint}} creates new ZooKeeper client instance every time there's an error occurs in communication and it tries to reconnect. This was not a problem with ZooKeeper 3.4.x versions, because the TGT Login thread was a static reference and only created once for all clients in the same JVM. With the upgrade to ZooKeeper 3.5.x the login thread is dedicated to the client instance, hence we have a new login thread every time the replication endpoint reconnects. {code:java} /** * A private method used to re-establish a zookeeper session with a peer cluster. */ protected void reconnect(KeeperException ke) { if ( ke instanceof ConnectionLossException || ke instanceof SessionExpiredException || ke instanceof AuthFailedException ) { String clusterKey = ctx.getPeerConfig().getClusterKey(); LOG.warn("Lost the ZooKeeper connection for peer " + clusterKey, ke); try { reloadZkWatcher(); } catch (IOException io) { LOG.warn("Creation of ZookeeperWatcher failed for peer " + clusterKey, io); } } }{code} {code:java} /** * Closes the current ZKW (if not null) and creates a new one * @throws IOException If anything goes wrong connecting */ synchronized void reloadZkWatcher() throws IOException { if (zkw != null) zkw.close(); zkw = new ZKWatcher(ctx.getConfiguration(), "connection to cluster: " + ctx.getPeerId(), this); getZkw().registerListener(new PeerRegionServerListener(this)); } {code} If the target cluster of replication is unavailable for some reason, the replication endpoint keeps trying to reconnect to ZooKeeper destroying and creating new Login threads constantly which will carpet bomb the KDC host with login requests. I'm not sure how to fix this yet, trying to create a unit test first. -- This message was sent by Atlassian Jira (v8.20.10#820010)