[ https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16779013#comment-16779013 ]
He Xiaoqiao commented on HDFS-14314: ------------------------------------ Thanks [~starphin] for your contribution, it seems to be getting close to the truth. Please fix code style following https://builds.apache.org/job/PreCommit-HDFS-Build/26332/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt and check failure unit test(https://builds.apache.org/job/PreCommit-HDFS-Build/26332/testReport/) if related to your patch, maybe running at local is good choice. FYI. BTW, all links I just mentioned are refer from Jenkins result as the last comment shows. > fullBlockReportLeaseId should be reset after registering to NN > -------------------------------------------------------------- > > Key: HDFS-14314 > URL: https://issues.apache.org/jira/browse/HDFS-14314 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.8.4 > Environment: > > > Reporter: star > Priority: Critical > Fix For: 2.8.4 > > Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, > HDFS-14314-trunk.002.patch, HDFS-14314.0.patch, HDFS-14314.2.patch, > HDFS-14314.patch > > > since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full > block lease id from active NN before sending full block to NN. Then DN will > send full block report together with lease id. If the lease id is invalid, NN > will reject the full block report and log "not in the pending set". > In a case when DN is doing full block reporting while NN is restarted. > It happens that DN will later send a full block report with lease id > ,acquired from previous NN instance, which is invalid to the new NN instance. > Though DN recognized the new NN instance by heartbeat and reregister itself, > it did not reset the lease id from previous instance. > The issuse may cause DNs to temporarily go dead, making it unsafe to > restart NN especially in hadoop cluster which has large amount of DNs. > HDFS-12914 reported the issue without any clues why it occurred and remain > unsolved. > To make it clear, look at code below. We take it from method > offerService of class BPServiceActor. We eliminate some code to focus on > current issue. fullBlockReportLeaseId is a local variable to hold lease id > from NN. Exceptions will occur at blockReport call when NN restarting, which > will be caught by catch block in while loop. Thus fullBlockReportLeaseId will > not be set to 0. After NN restarted, DN will send full block report which > will be rejected by the new NN instance. DN will never send full block report > until the next full block report schedule, about an hour later. > Solution is simple, just reset fullBlockReportLeaseId to 0 after any > exception or after registering to NN. Thus it will ask for a valid > fullBlockReportLeaseId from new NN instance. > {code:java} > private void offerService() throws Exception { > long fullBlockReportLeaseId = 0; > // > // Now loop for a long time.... > // > while (shouldRun()) { > try { > final long startTime = scheduler.monotonicNow(); > // > // Every so often, send heartbeat or block-report > // > final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime); > HeartbeatResponse resp = null; > if (sendHeartbeat) { > > boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) && > scheduler.isBlockReportDue(startTime); > scheduler.scheduleNextHeartbeat(); > if (!dn.areHeartbeatsDisabledForTests()) { > resp = sendHeartBeat(requestBlockReportLease); > assert resp != null; > if (resp.getFullBlockReportLeaseId() != 0) { > if (fullBlockReportLeaseId != 0) { > LOG.warn(nnAddr + " sent back a full block report lease " + > "ID of 0x" + > Long.toHexString(resp.getFullBlockReportLeaseId()) + > ", but we already have a lease ID of 0x" + > Long.toHexString(fullBlockReportLeaseId) + ". " + > "Overwriting old lease ID."); > } > fullBlockReportLeaseId = resp.getFullBlockReportLeaseId(); > } > > } > } > > > if ((fullBlockReportLeaseId != 0) || forceFullBr) { > //Exception occurred here when NN restarting > cmds = blockReport(fullBlockReportLeaseId); > fullBlockReportLeaseId = 0; > } > > } catch(RemoteException re) { > > } // while (shouldRun()) > } // offerService{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org