[ 
https://issues.apache.org/jira/browse/HBASE-8919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13713211#comment-13713211
 ] 

Jean-Daniel Cryans commented on HBASE-8919:
-------------------------------------------

[~stack] So I dug into that failure you pasted on Jul 16th and it wasn't the 
source cluster that was bad but the destination that was. The RS that was 
killed there was never fully recovered, one log was still being replayed when 
the test timed out. It could be related to this (THIS IS NOT SUPPOSED TO HAPPEN 
it says):

{noformat}
2013-07-16 17:14:25,933 INFO  [IPC Server handler 2 on 56710] 
blockmanagement.BlockInfoUnderConstruction(248): BLOCK* 
blk_4297992342878601848_1031{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, 
replicas=[ReplicaUnderConstruction[127.0.0.1:47006|RBW], 
ReplicaUnderConstruction[127.0.0.1:51438|RBW]]} recovery started, 
primary=127.0.0.1:47006
2013-07-16 17:14:25,933 WARN  [IPC Server handler 2 on 56710] 
namenode.FSNamesystem(3135): DIR* NameSystem.internalReleaseLease: File 
/user/ec2-user/hbase/.logs/ip-10-197-55-49.us-west-1.compute.internal,39939,1373994850314-splitting/ip-10-197-55-49.us-west-1.compute.internal%2C39939%2C1373994850314.1373994862658
 has not been closed. Lease recovery is in progress. RecoveryId = 1040 for 
block blk_4297992342878601848_1031{blockUCState=UNDER_RECOVERY, 
primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[127.0.0.1:47006|RBW], 
ReplicaUnderConstruction[127.0.0.1:51438|RBW]]}
...
2013-07-16 17:14:26,666 INFO  [IPC Server handler 5 on 56710] 
blockmanagement.BlockManager(2174): BLOCK* addStoredBlock: blockMap updated: 
127.0.0.1:51438 is added to 
blk_4297992342878601848_1031{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, 
replicas=[ReplicaUnderConstruction[127.0.0.1:47006|RBW], 
ReplicaUnderConstruction[127.0.0.1:51438|RBW]]} size 0
2013-07-16 17:14:26,666 INFO  [IPC Server handler 4 on 56710] 
blockmanagement.BlockManager(2174): BLOCK* addStoredBlock: blockMap updated: 
127.0.0.1:47006 is added to 
blk_4297992342878601848_1031{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, 
replicas=[ReplicaUnderConstruction[127.0.0.1:47006|RBW], 
ReplicaUnderConstruction[127.0.0.1:51438|RBW]]} size 0
...
2013-07-16 17:14:30,338 INFO  [IPC Server handler 8 on 56710] 
blockmanagement.BlockInfoUnderConstruction(248): BLOCK* 
blk_4297992342878601848_1031{blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, 
replicas=[ReplicaUnderConstruction[127.0.0.1:47006|RBW], 
ReplicaUnderConstruction[127.0.0.1:51438|RBW]]} recovery started, 
primary=127.0.0.1:51438
2013-07-16 17:14:30,339 WARN  [IPC Server handler 8 on 56710] 
namenode.FSNamesystem(3135): DIR* NameSystem.internalReleaseLease: File 
/user/ec2-user/hbase/.logs/ip-10-197-55-49.us-west-1.compute.internal,39939,1373994850314-splitting/ip-10-197-55-49.us-west-1.compute.internal%2C39939%2C1373994850314.1373994862658
 has not been closed. Lease recovery is in progress. RecoveryId = 1041 for 
block blk_4297992342878601848_1031{blockUCState=UNDER_RECOVERY, 
primaryNodeIndex=1, replicas=[ReplicaUnderConstruction[127.0.0.1:47006|RBW], 
ReplicaUnderConstruction[127.0.0.1:51438|RBW]]}
...
2013-07-16 17:14:33,340 ERROR [IPC Server handler 7 on 35081] 
security.UserGroupInformation(1481): PriviledgedActionException as:ec2-user 
(auth:SIMPLE) cause:java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: 
replica.getGenerationStamp() >= recoveryId = 1041, 
block=blk_4297992342878601848_1041, replica=FinalizedReplica, 
blk_4297992342878601848_1041, FINALIZED
  getNumBytes()     = 794
  getBytesOnDisk()  = 794
  getVisibleLength()= 794
  getVolume()       = 
/home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data3/current
  getBlockFile()    = 
/home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data3/current/BP-1477359609-10.197.55.49-1373994849464/current/finalized/blk_4297992342878601848
  unlinked          =false
2013-07-16 17:14:33,341 WARN  
[org.apache.hadoop.hdfs.server.datanode.DataNode$2@64a1fcba] 
datanode.DataNode(1894): Failed to obtain replica info for block 
(=BP-1477359609-10.197.55.49-1373994849464:blk_4297992342878601848_1041) from 
datanode (=127.0.0.1:47006)
java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: 
replica.getGenerationStamp() >= recoveryId = 1041, 
block=blk_4297992342878601848_1041, replica=FinalizedReplica, 
blk_4297992342878601848_1041, FINALIZED
  getNumBytes()     = 794
  getBytesOnDisk()  = 794
  getVisibleLength()= 794
  getVolume()       = 
/home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data3/current
  getBlockFile()    = 
/home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data3/current/BP-1477359609-10.197.55.49-1373994849464/current/finalized/blk_4297992342878601848
  unlinked          =false
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1462)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1422)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:1801)
        at 
org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55)
        at 
org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:2198)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
        at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:1814)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1880)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:215)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$2.run(DataNode.java:1786)
        at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): THIS IS 
NOT SUPPOSED TO HAPPEN: replica.getGenerationStamp() >= recoveryId = 1041, 
block=blk_4297992342878601848_1041, replica=FinalizedReplica, 
blk_4297992342878601848_1041, FINALIZED
  getNumBytes()     = 794
  getBytesOnDisk()  = 794
  getVisibleLength()= 794
  getVolume()       = 
/home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data3/current
  getBlockFile()    = 
/home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data3/current/BP-1477359609-10.197.55.49-1373994849464/current/finalized/blk_4297992342878601848
  unlinked          =false
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1462)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1422)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:1801)
        at 
org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55)
        at 
org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:2198)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)

        at org.apache.hadoop.ipc.Client.call(Client.java:1235)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
        at com.sun.proxy.$Proxy24.initReplicaRecovery(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.initReplicaRecovery(InterDatanodeProtocolTranslatorPB.java:83)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:1812)
        ... 4 more
2013-07-16 17:14:33,342 WARN  
[org.apache.hadoop.hdfs.server.datanode.DataNode$2@64a1fcba] 
datanode.DataNode(1894): Failed to obtain replica info for block 
(=BP-1477359609-10.197.55.49-1373994849464:blk_4297992342878601848_1041) from 
datanode (=127.0.0.1:51438)
java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: 
replica.getGenerationStamp() >= recoveryId = 1041, 
block=blk_4297992342878601848_1041, replica=FinalizedReplica, 
blk_4297992342878601848_1041, FINALIZED
  getNumBytes()     = 794
  getBytesOnDisk()  = 794
  getVisibleLength()= 794
  getVolume()       = 
/home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data1/current
  getBlockFile()    = 
/home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data1/current/BP-1477359609-10.197.55.49-1373994849464/current/finalized/blk_4297992342878601848
  unlinked          =false
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1462)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1422)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:1801)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:1812)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1880)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:215)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$2.run(DataNode.java:1786)
        at java.lang.Thread.run(Thread.java:662)
2013-07-16 17:14:33,342 WARN  
[org.apache.hadoop.hdfs.server.datanode.DataNode$2@64a1fcba] 
datanode.DataNode$2(1788): recoverBlocks FAILED: 
RecoveringBlock{BP-1477359609-10.197.55.49-1373994849464:blk_4297992342878601848_1041;
 getBlockSize()=794; corrupt=false; offset=-1; locs=[127.0.0.1:47006, 
127.0.0.1:51438]}
java.io.IOException: All datanodes failed: 
block=BP-1477359609-10.197.55.49-1373994849464:blk_4297992342878601848_1041, 
datanodeids=[127.0.0.1:47006, 127.0.0.1:51438]
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1901)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:215)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$2.run(DataNode.java:1786)
        at java.lang.Thread.run(Thread.java:662)
{noformat}
                
> TestReplicationQueueFailover (and Compressed) can fail because the recovered 
> queue gets stuck on ClosedByInterruptException
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8919
>                 URL: https://issues.apache.org/jira/browse/HBASE-8919
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>         Attachments: HBASE-8919.patch
>
>
> Looking at this build: 
> https://builds.apache.org/job/hbase-0.95-on-hadoop2/173/testReport/org.apache.hadoop.hbase.replication/TestReplicationQueueFailoverCompressed/queueFailover/
> The only thing I can find that went wrong is that the recovered queue was not 
> completely done because the source fails like this:
> {noformat}
> 2013-07-10 11:53:51,538 INFO  [Thread-1259] 
> regionserver.ReplicationSource$2(799): Slave cluster looks down: Call to 
> hemera.apache.org/140.211.11.27:38614 failed on local exception: 
> java.nio.channels.ClosedByInterruptException
> {noformat}
> And just before that it got:
> {noformat}
> 2013-07-10 11:53:51,290 WARN  
> [ReplicationExecutor-0.replicationSource,2-hemera.apache.org,43669,1373457208379]
>  regionserver.ReplicationSource(661): Can't replicate because of an error on 
> the remote cluster: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException):
>  org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 
> 1594 actions: FailedServerException: 1594 times, 
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:158)
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:146)
>       at 
> org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:692)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:2106)
>       at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:689)
>       at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:697)
>       at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:682)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:239)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:161)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:173)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.replicateWALEntry(HRegionServer.java:3735)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:14402)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2122)
>       at 
> org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1829)
>       at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1369)
>       at 
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573)
>       at 
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:15177)
>       at 
> org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:94)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:642)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:376)
> {noformat}
> I wonder what's closing the socket with an interrupt, it seems it still needs 
> to replicate more data. I'll start by adding the stack trace for the message 
> when it fails to replicate on a "local exception". Also I found a thread that 
> wasn't shutdown properly that I'm going to fix to help with debugging.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to