[ 
https://issues.apache.org/jira/browse/HBASE-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329741#comment-16329741
 ] 

Hadoop QA commented on HBASE-19816:
-----------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
11s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
41s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 3s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  5m 
42s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
54s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
26m  0s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.5 2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}112m 22s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
19s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}157m 34s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.replication.TestReplicationDroppedTables |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 |
| JIRA Issue | HBASE-19816 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12906488/HBASE-19816.master.001.patch
 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  shadedjars  
hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 5b5b4faf01d9 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 
15:49:21 UTC 2017 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh
 |
| git revision | master / c1a8dc09d6 |
| maven | version: Apache Maven 3.5.2 
(138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) |
| Default Java | 1.8.0_151 |
| unit | 
https://builds.apache.org/job/PreCommit-HBASE-Build/11090/artifact/patchprocess/patch-unit-hbase-server.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/11090/testReport/ |
| modules | C: hbase-server U: hbase-server |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/11090/console |
| Powered by | Apache Yetus 0.6.0   http://yetus.apache.org |


This message was automatically generated.



> Replication sink list is not updated on UnknownHostException
> ------------------------------------------------------------
>
>                 Key: HBASE-19816
>                 URL: https://issues.apache.org/jira/browse/HBASE-19816
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.0.0, 1.2.0
>         Environment: We have two clusters set up with bi-directional 
> replication. The clusters are around 400 nodes each and hosted in AWS.
>            Reporter: Scott Wilson
>            Assignee: Scott Wilson
>            Priority: Major
>         Attachments: HBASE-19816.master.001.patch
>
>
> We have two clusters, call them 1 and 2. Cluster 1 was the current "primary" 
> cluster and taking all live traffic which is replicated to cluster 2. We 
> decommissioned several instances in cluster 2 which involves deleting the 
> instance and its DNS record. After this happened most of the regions servers 
> in cluster 1 showed this message in their logs repeatedly. 
>  
> {code}
> 2018-01-12 23:49:36,507 WARN 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
>  Can't replicate because of a local or network error:
> java.net.UnknownHostException: data-017b.hbase-2.prod
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.<init>(AbstractRpcClient.java:315)
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.createBlockingRpcChannel(AbstractRpcClient.java:267)
> at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getAdmin(ConnectionManager.java:1737)
> at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getAdmin(ConnectionManager.java:1719)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSinkManager.getReplicationSink(ReplicationSinkManager.java:119)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:339)
> at 
> org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:326)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> The host data-017b.hbase-2.prod was one of those that had been removed from 
> cluster 2. Next we observed our replication lag from cluster 1 to cluster 2 
> was elevated. Some region servers reported ageOfLastShippedOperation to be 
> close to an hour.
> The only way we found to clear the message was to restart the region servers 
> that showed this message in the log. Once we did replication returned to 
> normal. Restarting the affected region servers in cluster 1 took several days 
> because we could not bring the cluster down.
> From reading the code it appears the cause was the zookeeper watch not being 
> triggered for the region server list change in cluster 2. We verified the 
> list in zookeeper for cluster 2 was correct and did not include the removed 
> nodes.
> One concrete improvement to make would be to force a refresh of the sink 
> cluster region server list when an {{UnknownHostException}} is found. This is 
> already done if the there is a {{ConnectException}} in 
> {{HBaseInterClusterReplicationEndpoint.java}}
> {code:java}
> } else if (ioe instanceof ConnectException) {
>   LOG.warn("Peer is unavailable, rechecking all sinks: ", ioe);
>   replicationSinkMgr.chooseSinks();
> {code}
> I propose that should be extended to cover {{UnknownHostException}}.
> We observed this behavior on 1.2.0-cdh-5.11.1 but it appears the same code 
> still exists on the current master branch.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to