Stephen O'Donnell created HDFS-15725:
----------------------------------------

             Summary: Lease Recovery never completes for a committed block 
which the DNs never finalize
                 Key: HDFS-15725
                 URL: https://issues.apache.org/jira/browse/HDFS-15725
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 3.4.0
            Reporter: Stephen O'Donnell
            Assignee: Stephen O'Donnell


It a very rare condition, the HDFS client process can get killed right at the 
time it is completing a block / file.

The client sends the "complete" call to the namenode, moving the block into a 
committed state, but it dies before it can send the final packet to the 
Datanodes telling them to finalize the block.

This means the blocks are stuck on the datanodes in RBW state and nothing will 
ever tell them to move out of that state.

The namenode / lease manager will retry forever to close the file, but it will 
always complain it is waiting for blocks to reach minimal replication.

I have a simple test and patch to fix this, but I think it warrants some 
discussion on whether this is the correct thing to do, or if I need to put the 
fix behind a config switch.

My idea, is that if lease recovery occurs, and the block is still waiting on 
"minimal replication", just put the file back to UNDER_CONSTRUCTION so that on 
the next lease recovery attempt, BLOCK RECOVERY will happen, close the file and 
move the replicas to FINALIZED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to