[ https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201540#comment-16201540 ]
Jiandan Yang commented on HDFS-12638: -------------------------------------- datanode revover failed because new blocksize is Long.MAX {code:java} 2017-10-09 19:19:17,054 INFO [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@437346ab] org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at et2btsm1.et2.tbsite.net/11.251.159.136:8020 calls recoverBlock(BP-1721125339-xx.xxx.xx.xxx-1505883414013:blk_1084203820_11907141, targets=[DatanodeInfoWithStorage[xx.xxx.xx.aaa:50010,null,null], DatanodeInfoWithStorage[xx.xxx.xx.bbb:50010,null,null], DatanodeInfoWithStorage[xx.xxx.xx.ccc:50010,null,null]], newGenerationStamp=11907145, newBlock=blk_1084203824_11907145) 2017-10-09 19:19:17,055 INFO [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@437346ab] org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1084203820_11907141, recoveryId=11907145, replica=FinalizedReplica, blk_1084203820_11907141, FINALIZED getNumBytes() = 7 getBytesOnDisk() = 7 getVisibleLength()= 7 getVolume() = /dump/10/dfs/data/current getBlockFile() = /dump/10/dfs/data/current/BP-1721125339-xx.xxx.xx.xxx-1505883414013/current/finalized/subdir31/subdir3/blk_1084203820 2017-10-09 19:19:17,055 INFO [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@437346ab] org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: changing replica state for blk_1084203820_11907141 from FINALIZED to RUR 2017-10-09 19:19:17,058 WARN [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@437346ab] org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to updateBlock (newblock=BP-1721125339-xx.xxx.xx.xxx-1505883414013:blk_1084203824_11907145, datanode=DatanodeInfoWithStorage[xx.xxx.xx.aaa:50010,null,null]) org.apache.hadoop.ipc.RemoteException(java.io.IOException): rur.getNumBytes() < newlength = 9223372036854775807, rur=ReplicaUnderRecovery, blk_1084203820_11907141, RUR getNumBytes() = 7 getBytesOnDisk() = 7 getVisibleLength()= 7 getVolume() = /dump/9/dfs/data/current getBlockFile() = /dump/9/dfs/data/current/BP-1721125339-11.251.159.136-1505883414013/current/finalized/subdir31/subdir3/blk_1084203820 recoveryId=11907145 original=FinalizedReplica, blk_1084203820_11907141, FINALIZED getNumBytes() = 7 getBytesOnDisk() = 7 getVisibleLength()= 7 getVolume() = /dump/9/dfs/data/current getBlockFile() = /dump/9/dfs/data/current/BP-1721125339-11.251.159.136-1505883414013/current/finalized/subdir31/subdir3/blk_1084203820 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.updateReplicaUnderRecovery(FsDatasetImpl.java:2736) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.updateReplicaUnderRecovery(FsDatasetImpl.java:2678) at org.apache.hadoop.hdfs.server.datanode.DataNode.updateReplicaUnderRecovery(DataNode.java:2776) at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:78) at org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3107) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1483) at org.apache.hadoop.ipc.Client.call(Client.java:1429) at org.apache.hadoop.ipc.Client.call(Client.java:1339) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy22.updateReplicaUnderRecovery(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolTranslatorPB.java:112) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:77) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$600(BlockRecoveryWorker.java:60) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:283) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:175) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:382) at java.lang.Thread.run(Thread.java:834) .... .... .... 2017-10-09 19:19:17,060 WARN [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@437346ab] org.apache.hadoop.hdfs.server.datanode.DataNode: recoverBlocks FAILED: RecoveringBlock{BP-1721125339-xx.xxx.xx.xxx-1505883414013:blk_1084203820_11907141; getBlockSize()=7; corrupt=false; offset=-1; locs=[DatanodeInfoWithStorage[xx.xxx.xx.aaa:50010,null,null], DatanodeInfoWithStorage[xx.xxx.xx.bbb:50010,null,null], DatanodeInfoWithStorage[xx.xxx.xx.ccc:50010,null,null]]} java.io.IOException: Cannot recover BP-1721125339-xx.xxx.xx.xxx-1505883414013:blk_1084203820_11907141, the following 3 data-nodes failed { DatanodeInfoWithStorage[xx.xxx.xx.aaa:50010,null,null] DatanodeInfoWithStorage[xx.xxx.xx.bbb:50010,null,null] DatanodeInfoWithStorage[xx.xxx.xx.ccc:50010,null,null] } {code} > NameNode exits due to ReplicationMonitor thread received Runtime exception in > ReplicationWork#chooseTargets > ----------------------------------------------------------------------------------------------------------- > > Key: HDFS-12638 > URL: https://issues.apache.org/jira/browse/HDFS-12638 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Affects Versions: 2.8.2 > Reporter: Jiandan Yang > > Active NamNode exit due to NPE, I can confirm that the BlockCollection passed > in when creating ReplicationWork is null, but I do not know why > BlockCollection is null, By view history I found > [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging > whether BlockCollection is null. > NN logs are as following: > {code:java} > 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > ReplicationMonitor thread received Runtime exception. > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744) > at java.lang.Thread.run(Thread.java:834) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org