[ https://issues.apache.org/jira/browse/HBASE-8919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13713211#comment-13713211 ]
Jean-Daniel Cryans commented on HBASE-8919: ------------------------------------------- [~stack] So I dug into that failure you pasted on Jul 16th and it wasn't the source cluster that was bad but the destination that was. The RS that was killed there was never fully recovered, one log was still being replayed when the test timed out. It could be related to this (THIS IS NOT SUPPOSED TO HAPPEN it says): {noformat} 2013-07-16 17:14:25,933 INFO [IPC Server handler 2 on 56710] blockmanagement.BlockInfoUnderConstruction(248): BLOCK* blk_4297992342878601848_1031{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[127.0.0.1:47006|RBW], ReplicaUnderConstruction[127.0.0.1:51438|RBW]]} recovery started, primary=127.0.0.1:47006 2013-07-16 17:14:25,933 WARN [IPC Server handler 2 on 56710] namenode.FSNamesystem(3135): DIR* NameSystem.internalReleaseLease: File /user/ec2-user/hbase/.logs/ip-10-197-55-49.us-west-1.compute.internal,39939,1373994850314-splitting/ip-10-197-55-49.us-west-1.compute.internal%2C39939%2C1373994850314.1373994862658 has not been closed. Lease recovery is in progress. RecoveryId = 1040 for block blk_4297992342878601848_1031{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[127.0.0.1:47006|RBW], ReplicaUnderConstruction[127.0.0.1:51438|RBW]]} ... 2013-07-16 17:14:26,666 INFO [IPC Server handler 5 on 56710] blockmanagement.BlockManager(2174): BLOCK* addStoredBlock: blockMap updated: 127.0.0.1:51438 is added to blk_4297992342878601848_1031{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[127.0.0.1:47006|RBW], ReplicaUnderConstruction[127.0.0.1:51438|RBW]]} size 0 2013-07-16 17:14:26,666 INFO [IPC Server handler 4 on 56710] blockmanagement.BlockManager(2174): BLOCK* addStoredBlock: blockMap updated: 127.0.0.1:47006 is added to blk_4297992342878601848_1031{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[127.0.0.1:47006|RBW], ReplicaUnderConstruction[127.0.0.1:51438|RBW]]} size 0 ... 2013-07-16 17:14:30,338 INFO [IPC Server handler 8 on 56710] blockmanagement.BlockInfoUnderConstruction(248): BLOCK* blk_4297992342878601848_1031{blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, replicas=[ReplicaUnderConstruction[127.0.0.1:47006|RBW], ReplicaUnderConstruction[127.0.0.1:51438|RBW]]} recovery started, primary=127.0.0.1:51438 2013-07-16 17:14:30,339 WARN [IPC Server handler 8 on 56710] namenode.FSNamesystem(3135): DIR* NameSystem.internalReleaseLease: File /user/ec2-user/hbase/.logs/ip-10-197-55-49.us-west-1.compute.internal,39939,1373994850314-splitting/ip-10-197-55-49.us-west-1.compute.internal%2C39939%2C1373994850314.1373994862658 has not been closed. Lease recovery is in progress. RecoveryId = 1041 for block blk_4297992342878601848_1031{blockUCState=UNDER_RECOVERY, primaryNodeIndex=1, replicas=[ReplicaUnderConstruction[127.0.0.1:47006|RBW], ReplicaUnderConstruction[127.0.0.1:51438|RBW]]} ... 2013-07-16 17:14:33,340 ERROR [IPC Server handler 7 on 35081] security.UserGroupInformation(1481): PriviledgedActionException as:ec2-user (auth:SIMPLE) cause:java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: replica.getGenerationStamp() >= recoveryId = 1041, block=blk_4297992342878601848_1041, replica=FinalizedReplica, blk_4297992342878601848_1041, FINALIZED getNumBytes() = 794 getBytesOnDisk() = 794 getVisibleLength()= 794 getVolume() = /home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data3/current getBlockFile() = /home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data3/current/BP-1477359609-10.197.55.49-1373994849464/current/finalized/blk_4297992342878601848 unlinked =false 2013-07-16 17:14:33,341 WARN [org.apache.hadoop.hdfs.server.datanode.DataNode$2@64a1fcba] datanode.DataNode(1894): Failed to obtain replica info for block (=BP-1477359609-10.197.55.49-1373994849464:blk_4297992342878601848_1041) from datanode (=127.0.0.1:47006) java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: replica.getGenerationStamp() >= recoveryId = 1041, block=blk_4297992342878601848_1041, replica=FinalizedReplica, blk_4297992342878601848_1041, FINALIZED getNumBytes() = 794 getBytesOnDisk() = 794 getVisibleLength()= 794 getVolume() = /home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data3/current getBlockFile() = /home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data3/current/BP-1477359609-10.197.55.49-1373994849464/current/finalized/blk_4297992342878601848 unlinked =false at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1462) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1422) at org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:1801) at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55) at org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:2198) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79) at org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:1814) at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1880) at org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:215) at org.apache.hadoop.hdfs.server.datanode.DataNode$2.run(DataNode.java:1786) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): THIS IS NOT SUPPOSED TO HAPPEN: replica.getGenerationStamp() >= recoveryId = 1041, block=blk_4297992342878601848_1041, replica=FinalizedReplica, blk_4297992342878601848_1041, FINALIZED getNumBytes() = 794 getBytesOnDisk() = 794 getVisibleLength()= 794 getVolume() = /home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data3/current getBlockFile() = /home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data3/current/BP-1477359609-10.197.55.49-1373994849464/current/finalized/blk_4297992342878601848 unlinked =false at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1462) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1422) at org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:1801) at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55) at org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:2198) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735) at org.apache.hadoop.ipc.Client.call(Client.java:1235) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) at com.sun.proxy.$Proxy24.initReplicaRecovery(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.initReplicaRecovery(InterDatanodeProtocolTranslatorPB.java:83) at org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:1812) ... 4 more 2013-07-16 17:14:33,342 WARN [org.apache.hadoop.hdfs.server.datanode.DataNode$2@64a1fcba] datanode.DataNode(1894): Failed to obtain replica info for block (=BP-1477359609-10.197.55.49-1373994849464:blk_4297992342878601848_1041) from datanode (=127.0.0.1:51438) java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: replica.getGenerationStamp() >= recoveryId = 1041, block=blk_4297992342878601848_1041, replica=FinalizedReplica, blk_4297992342878601848_1041, FINALIZED getNumBytes() = 794 getBytesOnDisk() = 794 getVisibleLength()= 794 getVolume() = /home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data1/current getBlockFile() = /home/ec2-user/jenkins/workspace/HBase-0.95-Hadoop-2/hbase-server/target/test-data/f2763e32-fe49-4988-ac94-eeca82431821/dfscluster_643a635e-4e39-4aa5-974c-25e01db16ff7/dfs/data/data1/current/BP-1477359609-10.197.55.49-1373994849464/current/finalized/blk_4297992342878601848 unlinked =false at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1462) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1422) at org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:1801) at org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:1812) at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1880) at org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:215) at org.apache.hadoop.hdfs.server.datanode.DataNode$2.run(DataNode.java:1786) at java.lang.Thread.run(Thread.java:662) 2013-07-16 17:14:33,342 WARN [org.apache.hadoop.hdfs.server.datanode.DataNode$2@64a1fcba] datanode.DataNode$2(1788): recoverBlocks FAILED: RecoveringBlock{BP-1477359609-10.197.55.49-1373994849464:blk_4297992342878601848_1041; getBlockSize()=794; corrupt=false; offset=-1; locs=[127.0.0.1:47006, 127.0.0.1:51438]} java.io.IOException: All datanodes failed: block=BP-1477359609-10.197.55.49-1373994849464:blk_4297992342878601848_1041, datanodeids=[127.0.0.1:47006, 127.0.0.1:51438] at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1901) at org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:215) at org.apache.hadoop.hdfs.server.datanode.DataNode$2.run(DataNode.java:1786) at java.lang.Thread.run(Thread.java:662) {noformat} > TestReplicationQueueFailover (and Compressed) can fail because the recovered > queue gets stuck on ClosedByInterruptException > --------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-8919 > URL: https://issues.apache.org/jira/browse/HBASE-8919 > Project: HBase > Issue Type: Bug > Reporter: Jean-Daniel Cryans > Assignee: Jean-Daniel Cryans > Attachments: HBASE-8919.patch > > > Looking at this build: > https://builds.apache.org/job/hbase-0.95-on-hadoop2/173/testReport/org.apache.hadoop.hbase.replication/TestReplicationQueueFailoverCompressed/queueFailover/ > The only thing I can find that went wrong is that the recovered queue was not > completely done because the source fails like this: > {noformat} > 2013-07-10 11:53:51,538 INFO [Thread-1259] > regionserver.ReplicationSource$2(799): Slave cluster looks down: Call to > hemera.apache.org/140.211.11.27:38614 failed on local exception: > java.nio.channels.ClosedByInterruptException > {noformat} > And just before that it got: > {noformat} > 2013-07-10 11:53:51,290 WARN > [ReplicationExecutor-0.replicationSource,2-hemera.apache.org,43669,1373457208379] > regionserver.ReplicationSource(661): Can't replicate because of an error on > the remote cluster: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException): > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed > 1594 actions: FailedServerException: 1594 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:158) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:146) > at > org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:692) > at > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:2106) > at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:689) > at org.apache.hadoop.hbase.client.HTable.batchCallback(HTable.java:697) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:682) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:239) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:161) > at > org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:173) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.replicateWALEntry(HRegionServer.java:3735) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:14402) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2122) > at > org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1829) > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1369) > at > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573) > at > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:15177) > at > org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:94) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:642) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:376) > {noformat} > I wonder what's closing the socket with an interrupt, it seems it still needs > to replicate more data. I'll start by adding the stack trace for the message > when it fails to replicate on a "local exception". Also I found a thread that > wasn't shutdown properly that I'm going to fix to help with debugging. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira