[ 
https://issues.apache.org/jira/browse/HDFS-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

star updated HDFS-14314:
------------------------
    Description: 
      since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
block lease id from active NN before sending full block to NN. Then DN will 
send full block report together with lease id. If the lease id is invalid, NN 
will reject the full block report and log "not in the pending set".

      In a case when DN is doing full block reporting while NN is restarted. It 
happens that DN will later send a full block report with lease id ,acquired 
from previous NN instance, which is invalid to the new NN instance. Though DN 
recognized the new NN instance by heartbeat and reregister itself, it did not 
reset the lease id from previous instance.

      The issuse may cause DNs to temporarily go dead, making it unsafe to 
restart NN especially in hadoop cluster which has large amount of DNs. 
HDFS-12914 reported the issue  without any clues why it occurred and remain 
unsolved.

       To make it clear, look at code below. We take it from method 
offerService of class BPServiceActor. We eliminate some code to focus on 
current issue. fullBlockReportLeaseId is a local variable to hold lease id from 
NN. Exceptions will occur at blockReport call when NN restarting, which will be 
caught by catch block in while loop. Thus fullBlockReportLeaseId will not be 
set to 0. After NN restarted, DN will send full block report which will be 
rejected by the new NN instance. DN will never send full block report until the 
next full block report schedule, about an hour later.

      Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
exception or after registering to NN. Thus it will ask for a valid 
fullBlockReportLeaseId from new NN instance.
{code:java}
private void offerService() throws Exception {

  long fullBlockReportLeaseId = 0;

  //
  // Now loop for a long time....
  //
  while (shouldRun()) {
    try {
      final long startTime = scheduler.monotonicNow();

      //
      // Every so often, send heartbeat or block-report
      //
      final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
      HeartbeatResponse resp = null;
      if (sendHeartbeat) {
      
        boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
                scheduler.isBlockReportDue(startTime);
        scheduler.scheduleNextHeartbeat();
        if (!dn.areHeartbeatsDisabledForTests()) {
          resp = sendHeartBeat(requestBlockReportLease);
          assert resp != null;
          if (resp.getFullBlockReportLeaseId() != 0) {
            if (fullBlockReportLeaseId != 0) {
              LOG.warn(nnAddr + " sent back a full block report lease " +
                      "ID of 0x" +
                      Long.toHexString(resp.getFullBlockReportLeaseId()) +
                      ", but we already have a lease ID of 0x" +
                      Long.toHexString(fullBlockReportLeaseId) + ". " +
                      "Overwriting old lease ID.");
            }
            fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
          }
         
        }
      }
   
     
      if ((fullBlockReportLeaseId != 0) || forceFullBr) {
        //Exception occurred here when NN restarting
        cmds = blockReport(fullBlockReportLeaseId);
        fullBlockReportLeaseId = 0;
      }
      
    } catch(RemoteException re) {
      
  } // while (shouldRun())
} // offerService{code}
 

  was:
      since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
block lease id from active NN before sending full block to NN. Then DN will 
send full block report together with lease id. If the lease id is invalid, NN 
will reject the full block report and log "not in the pending set".

      In a case when DN is doing full block reporting while NN is restarted. It 
happens that DN will later send a full block report with lease id ,acquired 
from previous NN instance, which is invalid to the new NN instance. Though DN 
recognized the new NN instance by heartbeat and reregister itself, it did not 
reset the lease id from previous instance.

      The issuse may cause DNs to temporarily go dead, making it unsafe to 
restart NN especially in hadoop cluster which has large amount of DNs. 
HDFS-12914 reported the issue  without any clues why it occurred and remain 
unsolved.

       To make it clear, look at code below. We take it from method 
offerService of class BPServiceActor. We eliminate some code to focus on 
current issue. fullBlockReportLeaseId is a local variable. Exceptions will 
occur at blockReport call when NN restarting, which will be caught by catch 
block in while loop. Thus fullBlockReportLeaseId will not be set to 0. After NN 
restarted, DN will send full block report which will be rejected by the new NN 
instance. DN will never send full block report until the next full block report 
schedule, about an hour later.

      Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
exception or after registering to NN. Thus it will ask for a valid 
fullBlockReportLeaseId from new NN instance.
{code:java}
private void offerService() throws Exception {

  long fullBlockReportLeaseId = 0;

  //
  // Now loop for a long time....
  //
  while (shouldRun()) {
    try {
      final long startTime = scheduler.monotonicNow();

      //
      // Every so often, send heartbeat or block-report
      //
      final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
      HeartbeatResponse resp = null;
      if (sendHeartbeat) {
      
        boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
                scheduler.isBlockReportDue(startTime);
        scheduler.scheduleNextHeartbeat();
        if (!dn.areHeartbeatsDisabledForTests()) {
          resp = sendHeartBeat(requestBlockReportLease);
          assert resp != null;
          if (resp.getFullBlockReportLeaseId() != 0) {
            if (fullBlockReportLeaseId != 0) {
              LOG.warn(nnAddr + " sent back a full block report lease " +
                      "ID of 0x" +
                      Long.toHexString(resp.getFullBlockReportLeaseId()) +
                      ", but we already have a lease ID of 0x" +
                      Long.toHexString(fullBlockReportLeaseId) + ". " +
                      "Overwriting old lease ID.");
            }
            fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
          }
         
        }
      }
   
     
      if ((fullBlockReportLeaseId != 0) || forceFullBr) {
        //Exception occurred here when NN restarting
        cmds = blockReport(fullBlockReportLeaseId);
        fullBlockReportLeaseId = 0;
      }
      
    } catch(RemoteException re) {
      
  } // while (shouldRun())
} // offerService{code}
 


> fullBlockReportLeaseId should be reset after 
> ---------------------------------------------
>
>                 Key: HDFS-14314
>                 URL: https://issues.apache.org/jira/browse/HDFS-14314
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.8.0
>         Environment:  
>  
>  
>            Reporter: star
>            Priority: Critical
>
>       since HDFS-7923 ,to rate-limit DN block report, DN will ask for a full 
> block lease id from active NN before sending full block to NN. Then DN will 
> send full block report together with lease id. If the lease id is invalid, NN 
> will reject the full block report and log "not in the pending set".
>       In a case when DN is doing full block reporting while NN is restarted. 
> It happens that DN will later send a full block report with lease id 
> ,acquired from previous NN instance, which is invalid to the new NN instance. 
> Though DN recognized the new NN instance by heartbeat and reregister itself, 
> it did not reset the lease id from previous instance.
>       The issuse may cause DNs to temporarily go dead, making it unsafe to 
> restart NN especially in hadoop cluster which has large amount of DNs. 
> HDFS-12914 reported the issue  without any clues why it occurred and remain 
> unsolved.
>        To make it clear, look at code below. We take it from method 
> offerService of class BPServiceActor. We eliminate some code to focus on 
> current issue. fullBlockReportLeaseId is a local variable to hold lease id 
> from NN. Exceptions will occur at blockReport call when NN restarting, which 
> will be caught by catch block in while loop. Thus fullBlockReportLeaseId will 
> not be set to 0. After NN restarted, DN will send full block report which 
> will be rejected by the new NN instance. DN will never send full block report 
> until the next full block report schedule, about an hour later.
>       Solution is simple, just reset fullBlockReportLeaseId to 0 after any 
> exception or after registering to NN. Thus it will ask for a valid 
> fullBlockReportLeaseId from new NN instance.
> {code:java}
> private void offerService() throws Exception {
>   long fullBlockReportLeaseId = 0;
>   //
>   // Now loop for a long time....
>   //
>   while (shouldRun()) {
>     try {
>       final long startTime = scheduler.monotonicNow();
>       //
>       // Every so often, send heartbeat or block-report
>       //
>       final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
>       HeartbeatResponse resp = null;
>       if (sendHeartbeat) {
>       
>         boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
>                 scheduler.isBlockReportDue(startTime);
>         scheduler.scheduleNextHeartbeat();
>         if (!dn.areHeartbeatsDisabledForTests()) {
>           resp = sendHeartBeat(requestBlockReportLease);
>           assert resp != null;
>           if (resp.getFullBlockReportLeaseId() != 0) {
>             if (fullBlockReportLeaseId != 0) {
>               LOG.warn(nnAddr + " sent back a full block report lease " +
>                       "ID of 0x" +
>                       Long.toHexString(resp.getFullBlockReportLeaseId()) +
>                       ", but we already have a lease ID of 0x" +
>                       Long.toHexString(fullBlockReportLeaseId) + ". " +
>                       "Overwriting old lease ID.");
>             }
>             fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
>           }
>          
>         }
>       }
>    
>      
>       if ((fullBlockReportLeaseId != 0) || forceFullBr) {
>         //Exception occurred here when NN restarting
>         cmds = blockReport(fullBlockReportLeaseId);
>         fullBlockReportLeaseId = 0;
>       }
>       
>     } catch(RemoteException re) {
>       
>   } // while (shouldRun())
> } // offerService{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to