[ https://issues.apache.org/jira/browse/HBASE-8919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709993#comment-13709993 ]
Jean-Daniel Cryans commented on HBASE-8919: ------------------------------------------- This is a different failure mode, the source cluster became completely unavailable. > TestReplicationQueueFailover (and Compressed) can fail because the recovered > queue gets stuck on ClosedByInterruptException > --------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-8919 > URL: https://issues.apache.org/jira/browse/HBASE-8919 > Project: HBase > Issue Type: Bug > Reporter: Jean-Daniel Cryans > Assignee: Jean-Daniel Cryans > Attachments: HBASE-8919.patch > > > Looking at this build: > https://builds.apache.org/job/hbase-0.95-on-hadoop2/173/testReport/org.apache.hadoop.hbase.replication/TestReplicationQueueFailoverCompressed/queueFailover/ > The only thing I can find that went wrong is that the recovered queue was not > completely done because the source fails like this: > {noformat} > 2013-07-10 11:53:51,538 INFO [Thread-1259] > regionserver.ReplicationSource$2(799): Slave cluster looks down: Call to > hemera.apache.org/140.211.11.27:38614 failed on local exception: > java.nio.channels.ClosedByInterruptException > {noformat} > And just before that it got: > {noformat} > 2013-07-10 11:53:51,290 WARN > [ReplicationExecutor-0.replicationSource,2-hemera.apache.org,43669,1373457208379] > regionserver.ReplicationSource(661): Can't replicate because of an error on > the remote cluster: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException): > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed > 1594 actions: FailedServerException: 1594 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:158) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:146) > at > org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:692) > at > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:2106) > at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:689) > at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:697) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:682) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:239) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:161) > at > org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:173) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.replicateWALEntry(HRegionServer.java:3735) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:14402) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2122) > at > org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1829) > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1369) > at > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573) > at > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:15177) > at > org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:94) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:642) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:376) > {noformat} > I wonder what's closing the socket with an interrupt, it seems it still needs > to replicate more data. I'll start by adding the stack trace for the message > when it fails to replicate on a "local exception". Also I found a thread that > wasn't shutdown properly that I'm going to fix to help with debugging. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira