[ https://issues.apache.org/jira/browse/HDFS-16379?focusedWorklogId=694359&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-694359 ]
ASF GitHub Bot logged work on HDFS-16379: ----------------------------------------- Author: ASF GitHub Bot Created on: 11/Dec/21 02:50 Start Date: 11/Dec/21 02:50 Worklog Time Spent: 10m Work Description: tomscut opened a new pull request #3787: URL: https://github.com/apache/hadoop/pull/3787 JIRA: [HDFS-16379](https://issues.apache.org/jira/browse/HDFS-16379). Recently we encountered FBR-related problems in the production environment, which were solved by introducing HDFS-12914 and HDFS-14314. But there may be situations like this: 1 DN got `fullBlockReportLeaseId` via heartbeat. 2 DN trigger a blockReport, but some exception occurs (this may be rare, but it may exist), and then DN does multiple retries without resetting fullBlockReportLeaseId. Because fullBlockReportLeaseId is reset only if it succeeds currently. 3 After a while, the exception is cleared, but the `fullBlockReportLeaseId` has expired. Since NN did not throw an exception after the lease expired, the DN considered that the blockReport was successful. So the blockReport was not actually executed this time and needs to wait until the next time. Therefore, should we consider resetting the `fullBlockReportLeaseId` in the finally block? The advantage of this is that lease expiration can be avoided. The downside is that each heartbeat will apply for a new `fullBlockReportLeaseId` during the exception, but I think this cost is negligible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 694359) Remaining Estimate: 0h Time Spent: 10m > Reset fullBlockReportLeaseId after any exceptions > ------------------------------------------------- > > Key: HDFS-16379 > URL: https://issues.apache.org/jira/browse/HDFS-16379 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: tomscut > Assignee: tomscut > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Recently we encountered FBR-related problems in the production environment, > which were solved by introducing HDFS-12914 and HDFS-14314. > But there may be situations like this: > 1 DN got *fullBlockReportLeaseId* via heartbeat. > 2 DN trigger a blockReport, but some exception occurs (this may be rare, but > it may exist), and then DN does multiple retries {*}without resetting > leaseID{*}. Because leaseID is reset only if it succeeds currently. > 3 After a while, the exception is cleared, but the LeaseID has expired. > *Since NN did not throw an exception after the lease expired, the DN > considered that the blockReport was successful.* So the blockReport was not > actually executed this time and needs to wait until the next time. > Therefore, {*}should we consider resetting the fullBlockReportLeaseId in the > finally block{*}? The advantage of this is that lease expiration can be > avoided. The downside is that each heartbeat will apply for a new > fullBlockReportLeaseId during the exception, but I think this cost is > negligible. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org