zhouyingchao created HDFS-10240:
-----------------------------------

             Summary: Race between close/recoverLease leads to missing block
                 Key: HDFS-10240
                 URL: https://issues.apache.org/jira/browse/HDFS-10240
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: zhouyingchao
            Assignee: zhouyingchao


We got a missing block in our cluster, and logs related to the missing block 
are as follows:
2016-03-28,10:00:06,188 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
allocateBlock: XXXXXXXXXX. BP-219149063-10.108.84.25-1446859315800 
blk_1226490256_153006345{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
 
ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW],
 
ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]}
2016-03-28,10:00:06,205 INFO BlockStateChange: BLOCK* 
blk_1226490256_153006345{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, 
replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
 
ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW],
 
ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]}
 recovery started, 
primary=ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]
2016-03-28,10:00:06,205 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
NameSystem.internalReleaseLease: File XXXXXXXXXX has not been closed. Lease 
recovery is in progress. RecoveryId = 153006357 for block 
blk_1226490256_153006345{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, 
replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
 
ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW],
 
ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]}
2016-03-28,10:00:06,248 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* checkFileProgress: 
blk_1226490256_153006345{blockUCState=COMMITTED, primaryNodeIndex=2, 
replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
 
ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW],
 
ReplicaUnderConstruction[[DISK]DS-85819f0d-bdbb-4a9b-b90c-eba078547c23:NORMAL|RBW]]}
 has not reached minimal replication 1
2016-03-28,10:00:06,358 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap 
updated: 10.114.5.53:11402 is added to 
blk_1226490256_153006345{blockUCState=COMMITTED, primaryNodeIndex=2, 
replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
 
ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW],
 
ReplicaUnderConstruction[[DISK]DS-85819f0d-bdbb-4a9b-b90c-eba078547c23:NORMAL|RBW]]}
 size 139
2016-03-28,10:00:06,441 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap 
updated: 10.114.5.44:11402 is added to blk_1226490256_153006345 size 139
2016-03-28,10:00:06,660 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap 
updated: 10.114.6.14:11402 is added to blk_1226490256_153006345 size 139
2016-03-28,10:00:08,808 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
commitBlockSynchronization(lastblock=BP-219149063-10.108.84.25-1446859315800:blk_1226490256_153006345,
 newgenerationstamp=153006357, newlength=139, newtargets=[10.114.6.14:11402, 
10.114.5.53:11402, 10.114.5.44:11402], closeFile=true, deleteBlock=false)
2016-03-28,10:00:08,836 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1226490256 added as corrupt on 
10.114.6.14:11402 by /10.114.6.14 because block is COMPLETE and reported 
genstamp 153006357 does not match genstamp in block map 153006345
2016-03-28,10:00:08,836 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1226490256 added as corrupt on 
10.114.5.53:11402 by /10.114.5.53 because block is COMPLETE and reported 
genstamp 153006357 does not match genstamp in block map 153006345
2016-03-28,10:00:08,837 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1226490256 added as corrupt on 
10.114.5.44:11402 by /10.114.5.44 because block is COMPLETE and reported 
genstamp 153006357 does not match genstamp in block map 153006345

>From the log, I guess this is what has happened in order:
1  Process A open a file F for write.
2  Somebody else called recoverLease against F.
3  A closed F.

The root cause of the missing block is that recoverLease increased gen count of 
blocks on datanode whereas the gen count on Namenode is reset by close in step 
3. I think we should check if the last block is under recovery when trying to 
close a file. If so we should just throw an exception to client quickly.

Although the issue is found on a 2.4 HDFS, it looks like the issue also exist 
on the trunk from code. I'll post a patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to