[ https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775983#comment-16775983 ]
Hadoop QA commented on HDFS-14314: ---------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 6s{color} | {color:red} HDFS-14314 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HDFS-14314 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12959914/HDFS-14314.patch | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/26312/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > fullBlockReportLeaseId should be reset after registering to NN > -------------------------------------------------------------- > > Key: HDFS-14314 > URL: https://issues.apache.org/jira/browse/HDFS-14314 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.8.4 > Environment: > > > Reporter: star > Priority: Critical > Fix For: 2.8.4 > > Attachments: HDFS-14314.patch > > > since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full > block lease id from active NN before sending full block to NN. Then DN will > send full block report together with lease id. If the lease id is invalid, NN > will reject the full block report and log "not in the pending set". > In a case when DN is doing full block reporting while NN is restarted. > It happens that DN will later send a full block report with lease id > ,acquired from previous NN instance, which is invalid to the new NN instance. > Though DN recognized the new NN instance by heartbeat and reregister itself, > it did not reset the lease id from previous instance. > The issuse may cause DNs to temporarily go dead, making it unsafe to > restart NN especially in hadoop cluster which has large amount of DNs. > HDFS-12914 reported the issue without any clues why it occurred and remain > unsolved. > To make it clear, look at code below. We take it from method > offerService of class BPServiceActor. We eliminate some code to focus on > current issue. fullBlockReportLeaseId is a local variable to hold lease id > from NN. Exceptions will occur at blockReport call when NN restarting, which > will be caught by catch block in while loop. Thus fullBlockReportLeaseId will > not be set to 0. After NN restarted, DN will send full block report which > will be rejected by the new NN instance. DN will never send full block report > until the next full block report schedule, about an hour later. > Solution is simple, just reset fullBlockReportLeaseId to 0 after any > exception or after registering to NN. Thus it will ask for a valid > fullBlockReportLeaseId from new NN instance. > {code:java} > private void offerService() throws Exception { > long fullBlockReportLeaseId = 0; > // > // Now loop for a long time.... > // > while (shouldRun()) { > try { > final long startTime = scheduler.monotonicNow(); > // > // Every so often, send heartbeat or block-report > // > final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime); > HeartbeatResponse resp = null; > if (sendHeartbeat) { > > boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) && > scheduler.isBlockReportDue(startTime); > scheduler.scheduleNextHeartbeat(); > if (!dn.areHeartbeatsDisabledForTests()) { > resp = sendHeartBeat(requestBlockReportLease); > assert resp != null; > if (resp.getFullBlockReportLeaseId() != 0) { > if (fullBlockReportLeaseId != 0) { > LOG.warn(nnAddr + " sent back a full block report lease " + > "ID of 0x" + > Long.toHexString(resp.getFullBlockReportLeaseId()) + > ", but we already have a lease ID of 0x" + > Long.toHexString(fullBlockReportLeaseId) + ". " + > "Overwriting old lease ID."); > } > fullBlockReportLeaseId = resp.getFullBlockReportLeaseId(); > } > > } > } > > > if ((fullBlockReportLeaseId != 0) || forceFullBr) { > //Exception occurred here when NN restarting > cmds = blockReport(fullBlockReportLeaseId); > fullBlockReportLeaseId = 0; > } > > } catch(RemoteException re) { > > } // while (shouldRun()) > } // offerService{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org