[ https://issues.apache.org/jira/browse/HBASE-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Scott Wilson updated HBASE-19816: --------------------------------- Attachment: (was: HBASE-19816.master.002.patch) > Replication sink list is not updated on UnknownHostException > ------------------------------------------------------------ > > Key: HBASE-19816 > URL: https://issues.apache.org/jira/browse/HBASE-19816 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 2.0.0, 1.2.0 > Environment: We have two clusters set up with bi-directional > replication. The clusters are around 400 nodes each and hosted in AWS. > Reporter: Scott Wilson > Priority: Major > Attachments: HBASE-19816.master.001.patch > > > We have two clusters, call them 1 and 2. Cluster 1 was the current "primary" > cluster and taking all live traffic which is replicated to cluster 2. We > decommissioned several instances in cluster 2 which involves deleting the > instance and its DNS record. After this happened most of the regions servers > in cluster 1 showed this message in their logs repeatedly. > > {code} > 2018-01-12 23:49:36,507 WARN > org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint: > Can't replicate because of a local or network error: > java.net.UnknownHostException: data-017b.hbase-2.prod > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.<init>(AbstractRpcClient.java:315) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.createBlockingRpcChannel(AbstractRpcClient.java:267) > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getAdmin(ConnectionManager.java:1737) > at > org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getAdmin(ConnectionManager.java:1719) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSinkManager.getReplicationSink(ReplicationSinkManager.java:119) > at > org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:339) > at > org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:326) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > The host data-017b.hbase-2.prod was one of those that had been removed from > cluster 2. Next we observed our replication lag from cluster 1 to cluster 2 > was elevated. Some region servers reported ageOfLastShippedOperation to be > close to an hour. > The only way we found to clear the message was to restart the region servers > that showed this message in the log. Once we did replication returned to > normal. Restarting the affected region servers in cluster 1 took several days > because we could not bring the cluster down. > From reading the code it appears the cause was the zookeeper watch not being > triggered for the region server list change in cluster 2. We verified the > list in zookeeper for cluster 2 was correct and did not include the removed > nodes. > One concrete improvement to make would be to force a refresh of the sink > cluster region server list when an {{UnknownHostException}} is found. This is > already done if the there is a {{ConnectException}} in > {{HBaseInterClusterReplicationEndpoint.java}} > {code:java} > } else if (ioe instanceof ConnectException) { > LOG.warn("Peer is unavailable, rechecking all sinks: ", ioe); > replicationSinkMgr.chooseSinks(); > {code} > I propose that should be extended to cover {{UnknownHostException}}. > We observed this behavior on 1.2.0-cdh-5.11.1 but it appears the same code > still exists on the current master branch. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)