Hello folks, I'm running Apache Hadoop 2.6.0 and I'm seeing a weird problem where I keep seeing corrupt replicas. Example: 2016-11-15 06:42:38,104 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: *blk_1073747320_231160*{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-11d5d492-a608-4bc0-9a04-048b8127bb32:NORMAL:10.0.8.185:50010|RBW]]}, Expected Replicas: 2, *live replicas: 0, corrupt replicas: 2*, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.185:50010 10.0.8.148:50010 10.0.8.149:50010 , Current Datanode: 10.0.8.185:50010, Is current datanode decommissioning: true
But I can't figure out which file this block belongs to - *hadoop fsck / -files -blocks -locations | grep blk_1073747320_231160* returns nothing. So I'm unable to delete the file and my concern is that this seems to be blocking decommissioning of my datanode (going on for ~18 hours now) since, looking at the code in BlockManager.java, we would not mark the DN as decommissioned if there are blocks with no live replicas on it. My questions are: 1. What causes corrupt replicas and how to avoid them? I seem to be seeing these frequently: (examples from prior runs) hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: blk_1074063633_2846521{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-7b8e7b76-6066-43fb-8340-d93f7ab9c6ea:NORMAL:10.0.8.75:50010|RBW]]}, Expected Replicas: 2, *live replicas: 0*, *corrupt replicas: 4*, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.75:50010 10.0.8.156:50010 10.0.8.188:50010 10.0.8.34:50010 10.0.8.74:50010 , Current Datanode: 10.0.8.75:50010, Is current datanode decommissioning: true hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: blk_1073975974_2185091{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9ee8:NORMAL:10.0.8.153:50010|RBW]]}, Expected Replicas: 2, *live replicas: 0, corrupt replicas: 3*, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.153:50010 10.0.8.74:50010 10.0.8.7:50010 10.0.8.198:50010 , Current Datanode: 10.0.8.153:50010, Is current datanode decommissioning: true hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: blk_1073975974_2185091{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9ee8:NORMAL:10.0.8.153:50010|RBW]]}, Expected Replicas: 2, *live replicas: 0, corrupt replicas: 3*, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.153:50010 10.0.8.74:50010 10.0.8.7:50010 10.0.8.198:50010 , Current Datanode: 10.0.8.7:50010, Is current datanode decommissioning: true 2. Is this possibly a JIRA that's fixed in recent versions (I realize I'm running a very old version)? 3. Anything I can do to "force" decommissioning of such nodes (apart from forcefully terminating them)? Thanks, Hari