[ https://issues.apache.org/jira/browse/HDFS-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15295930#comment-15295930 ]
Harsh J commented on HDFS-10240: -------------------------------- This seems similar to the situation described in HDFS-3584. The comment there from [~umamaheswararao] suggests a probable better approach of altering the lease of the file to the recovering client upfront, to prevent the older client from coming back and closing the file out concurrently (given that close call does already check for active lease). Do you think such an approach would be better? > Race between close/recoverLease leads to missing block > ------------------------------------------------------ > > Key: HDFS-10240 > URL: https://issues.apache.org/jira/browse/HDFS-10240 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: zhouyingchao > Assignee: zhouyingchao > Attachments: HDFS-10240-001.patch > > > We got a missing block in our cluster, and logs related to the missing block > are as follows: > 2016-03-28,10:00:06,188 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocateBlock: XXXXXXXXXX. BP-219149063-10.108.84.25-1446859315800 > blk_1226490256_153006345{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]} > 2016-03-28,10:00:06,205 INFO BlockStateChange: BLOCK* > blk_1226490256_153006345{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, > replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]} > recovery started, > primary=ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW] > 2016-03-28,10:00:06,205 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.internalReleaseLease: File XXXXXXXXXX has not been closed. Lease > recovery is in progress. RecoveryId = 153006357 for block > blk_1226490256_153006345{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, > replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]} > 2016-03-28,10:00:06,248 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > checkFileProgress: blk_1226490256_153006345{blockUCState=COMMITTED, > primaryNodeIndex=2, > replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-85819f0d-bdbb-4a9b-b90c-eba078547c23:NORMAL|RBW]]} > has not reached minimal replication 1 > 2016-03-28,10:00:06,358 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.114.5.53:11402 is added to > blk_1226490256_153006345{blockUCState=COMMITTED, primaryNodeIndex=2, > replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-85819f0d-bdbb-4a9b-b90c-eba078547c23:NORMAL|RBW]]} > size 139 > 2016-03-28,10:00:06,441 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.114.5.44:11402 is added to blk_1226490256_153006345 size > 139 > 2016-03-28,10:00:06,660 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.114.6.14:11402 is added to blk_1226490256_153006345 size > 139 > 2016-03-28,10:00:08,808 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > commitBlockSynchronization(lastblock=BP-219149063-10.108.84.25-1446859315800:blk_1226490256_153006345, > newgenerationstamp=153006357, newlength=139, newtargets=[10.114.6.14:11402, > 10.114.5.53:11402, 10.114.5.44:11402], closeFile=true, deleteBlock=false) > 2016-03-28,10:00:08,836 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1226490256 added as corrupt on > 10.114.6.14:11402 by /10.114.6.14 because block is COMPLETE and reported > genstamp 153006357 does not match genstamp in block map 153006345 > 2016-03-28,10:00:08,836 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1226490256 added as corrupt on > 10.114.5.53:11402 by /10.114.5.53 because block is COMPLETE and reported > genstamp 153006357 does not match genstamp in block map 153006345 > 2016-03-28,10:00:08,837 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1226490256 added as corrupt on > 10.114.5.44:11402 by /10.114.5.44 because block is COMPLETE and reported > genstamp 153006357 does not match genstamp in block map 153006345 > From the log, I guess this is what has happened in order: > 1 Process A open a file F for write. > 2 Somebody else called recoverLease against F. > 3 A closed F. > The root cause of the missing block is that recoverLease increased gen count > of blocks on datanode whereas the gen count on Namenode is reset by close in > step 3. I think we should check if the last block is under recovery when > trying to close a file. If so we should just throw an exception to client > quickly. > Although the issue is found on a 2.4 HDFS, it looks like the issue also exist > on the trunk from code. I'll post a patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org