[ 
https://issues.apache.org/jira/browse/HBASE-8919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709993#comment-13709993
 ] 

Jean-Daniel Cryans commented on HBASE-8919:
-------------------------------------------

This is a different failure mode, the source cluster became completely 
unavailable.
                
> TestReplicationQueueFailover (and Compressed) can fail because the recovered 
> queue gets stuck on ClosedByInterruptException
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8919
>                 URL: https://issues.apache.org/jira/browse/HBASE-8919
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>         Attachments: HBASE-8919.patch
>
>
> Looking at this build: 
> https://builds.apache.org/job/hbase-0.95-on-hadoop2/173/testReport/org.apache.hadoop.hbase.replication/TestReplicationQueueFailoverCompressed/queueFailover/
> The only thing I can find that went wrong is that the recovered queue was not 
> completely done because the source fails like this:
> {noformat}
> 2013-07-10 11:53:51,538 INFO  [Thread-1259] 
> regionserver.ReplicationSource$2(799): Slave cluster looks down: Call to 
> hemera.apache.org/140.211.11.27:38614 failed on local exception: 
> java.nio.channels.ClosedByInterruptException
> {noformat}
> And just before that it got:
> {noformat}
> 2013-07-10 11:53:51,290 WARN  
> [ReplicationExecutor-0.replicationSource,2-hemera.apache.org,43669,1373457208379]
>  regionserver.ReplicationSource(661): Can't replicate because of an error on 
> the remote cluster: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException):
>  org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 
> 1594 actions: FailedServerException: 1594 times, 
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:158)
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:146)
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:692)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:2106)
>       at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:689)
>       at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:697)
>       at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:682)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:239)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:161)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:173)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.replicateWALEntry(HRegionServer.java:3735)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:14402)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2122)
>       at 
> org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1829)
>       at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1369)
>       at 
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573)
>       at 
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:15177)
>       at 
> org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:94)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:642)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:376)
> {noformat}
> I wonder what's closing the socket with an interrupt, it seems it still needs 
> to replicate more data. I'll start by adding the stack trace for the message 
> when it fails to replicate on a "local exception". Also I found a thread that 
> wasn't shutdown properly that I'm going to fix to help with debugging.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to