[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16779013#comment-16779013
 ] 

He Xiaoqiao commented on HDFS-14314:
------------------------------------

Thanks [~starphin] for your contribution, it seems to be getting close to the 
truth. Please fix code style following 
https://builds.apache.org/job/PreCommit-HDFS-Build/26332/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 and check failure unit 
test(https://builds.apache.org/job/PreCommit-HDFS-Build/26332/testReport/) if 
related to your patch, maybe running at local is good choice. FYI. BTW, all 
links I just mentioned are refer from Jenkins result as the last comment shows.

> fullBlockReportLeaseId should be reset after registering to NN
> --------------------------------------------------------------
>
>                 Key: HDFS-14314
>                 URL: https://issues.apache.org/jira/browse/HDFS-14314
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.8.4
>         Environment:  
>  
>  
>            Reporter: star
>            Priority: Critical
>             Fix For: 2.8.4
>
>         Attachments: HDFS-14314-trunk.001.patch, HDFS-14314-trunk.001.patch, 
> HDFS-14314-trunk.002.patch, HDFS-14314.0.patch, HDFS-14314.2.patch, 
> HDFS-14314.patch
>
>
>       since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>       In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>       The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>        To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>       Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time....
>   //
>   while (shouldRun()) {
>     try {
>       final long startTime = scheduler.monotonicNow();
>       //
>       // Every so often, send heartbeat or block-report
>       //
>       final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>       HeartbeatResponse resp = null;
>       if (sendHeartbeat) {
>       
>         boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
>                 scheduler.isBlockReportDue(startTime);
>         scheduler.scheduleNextHeartbeat();
>         if (!dn.areHeartbeatsDisabledForTests()) {
>           resp = sendHeartBeat(requestBlockReportLease);
>           assert resp != null;
>           if (resp.getFullBlockReportLeaseId() != 0) {
>             if (fullBlockReportLeaseId != 0) {
>               LOG.warn(nnAddr + " sent back a full block report lease " +
>                       "ID of 0x" +
>                       Long.toHexString(resp.getFullBlockReportLeaseId()) +
>                       ", but we already have a lease ID of 0x" +
>                       Long.toHexString(fullBlockReportLeaseId) + ". " +
>                       "Overwriting old lease ID.");
>             }
>             fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>           }
>          
>         }
>       }
>    
>      
>       if ((fullBlockReportLeaseId != 0) || forceFullBr) {
>         //Exception occurred here when NN restarting
>         cmds = blockReport(fullBlockReportLeaseId);
>         fullBlockReportLeaseId = 0;
>       }
>       
>     } catch(RemoteException re) {
>       
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to