[ 
https://issues.apache.org/jira/browse/HDFS-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698112#comment-17698112
 ] 

ASF GitHub Bot commented on HDFS-16942:
---------------------------------------

virajjasani commented on code in PR #5460:
URL: https://github.com/apache/hadoop/pull/5460#discussion_r1130195994


##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java:
##########
@@ -791,6 +792,9 @@ private void offerService() throws Exception {
           shouldServiceRun = false;
           return;
         }
+        if (InvalidBlockReportLeaseException.class.getName().equals(reClass)) {
+          fullBlockReportLeaseId = 0;

Review Comment:
   Or maybe `reportId` and `leaseId` can be added as constructor argument to 
`InvalidBlockReportLeaseException`. This way `RemoteException in offerService` 
log will likely print them anyways?





> Send error to datanode if FBR is rejected due to bad lease
> ----------------------------------------------------------
>
>                 Key: HDFS-16942
>                 URL: https://issues.apache.org/jira/browse/HDFS-16942
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>              Labels: pull-request-available
>
> When a datanode sends a FBR to the namenode, it requires a lease to send it. 
> On a couple of busy clusters, we have seen an issue where the DN is somehow 
> delayed in sending the FBR after requesting the least. Then the NN rejects 
> the FBR and logs a message to that effect, but from the Datanodes point of 
> view, it thinks the report was successful and does not try to send another 
> report until the 6 hour default interval has passed.
> If this happens to a few DNs, there can be missing and under replicated 
> blocks, further adding to the cluster load. Even worse, I have see the DNs 
> join the cluster with zero blocks, so it is not obvious the under replication 
> is caused by lost a FBR, as all DNs appear to be up and running.
> I believe we should propagate an error back to the DN if the FBR is rejected, 
> that way, the DN can request a new lease and try again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to