[ 
https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697456#comment-17697456
 ] 

ASF GitHub Bot commented on HDFS-16942:
---------------------------------------

sodonnel opened a new pull request, #5460:
URL: https://github.com/apache/hadoop/pull/5460

   ### Description of PR
   
   When a datanode sends a FBR to the namenode, it requires a lease to send it. 
On a couple of busy clusters, we have seen an issue where the DN is somehow 
delayed in sending the FBR after requesting the least. Then the NN rejects the 
FBR and logs a message to that effect, but from the Datanodes point of view, it 
thinks the report was successful and does not try to send another report until 
the 6 hour default interval has passed.
   
   If this happens to a few DNs, there can be missing and under replicated 
blocks, further adding to the cluster load. Even worse, I have see the DNs join 
the cluster with zero blocks, so it is not obvious the under replication is 
caused by lost a FBR, as all DNs appear to be up and running.
   
   I believe we should propagate an error back to the DN if the FBR is 
rejected, that way, the DN can request a new lease and try again.
   
   ### How was this patch tested?
   
   Modified a test and added a new test. Also validated manually by changing 
the code to always throw the InvalidLease exception to ensure the DN handled 
it, reset the least ID and retried.




> Send error to datanode if FBR is rejected due to bad lease
> ----------------------------------------------------------
>
>                 Key: HDFS-16942
>                 URL: https://issues.apache.org/jira/browse/HDFS-16942
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>
> When a datanode sends a FBR to the namenode, it requires a lease to send it. 
> On a couple of busy clusters, we have seen an issue where the DN is somehow 
> delayed in sending the FBR after requesting the least. Then the NN rejects 
> the FBR and logs a message to that effect, but from the Datanodes point of 
> view, it thinks the report was successful and does not try to send another 
> report until the 6 hour default interval has passed.
> If this happens to a few DNs, there can be missing and under replicated 
> blocks, further adding to the cluster load. Even worse, I have see the DNs 
> join the cluster with zero blocks, so it is not obvious the under replication 
> is caused by lost a FBR, as all DNs appear to be up and running.
> I believe we should propagate an error back to the DN if the FBR is rejected, 
> that way, the DN can request a new lease and try again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to