[ 
https://issues.apache.org/jira/browse/HDFS-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015065#comment-17015065
 ] 

Xiaoqiao He commented on HDFS-15113:
------------------------------------

Thanks all for your comments. [~elgoiri],[~weichiu],[~brahmareddy].
To [~elgoiri]
{quote}In the test should we have the old case and the new one?{quote}
TestBPOfferService#testIBRClearanceForStandbyOnReRegister could cover most case 
about restart, So I try to add logic just for this corner case. If we need 
split them, I would like to do that later. Thanks.
To [~brahmareddy]
{quote}is this have high chance when "dfs.blockreport.initialDelay" is 
configured with "0"{quote}
It is exactly true. in my experience, we do not set and used the default value 
0, so it is very easy to reproduce.
For the unit test, it could reproduce if we revert {{BPServiceActor}}, then add 
the following fault injector between schedule heartbeat and clean IBR.
{code:java}
      DataNodeFaultInjector.get().waitFullBlockReport();
{code}
Thanks a lot.

> Missing IBR when NameNode restart if open processCommand async feature
> ----------------------------------------------------------------------
>
>                 Key: HDFS-15113
>                 URL: https://issues.apache.org/jira/browse/HDFS-15113
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Xiaoqiao He
>            Assignee: Xiaoqiao He
>            Priority: Major
>         Attachments: HDFS-15113.001.patch, HDFS-15113.002.patch, 
> HDFS-15113.003.patch
>
>
> Recently, I meet one case that NameNode missing block after restart which is 
> related with HDFS-14997.
> a. during NameNode restart, it will return command `DNA_REGISTER` to DataNode 
> when receive some RPC request from DataNode.
> b. when DataNode receive `DNA_REGISTER` command, it will run #reRegister 
> async.
> {code:java}
>   void reRegister() throws IOException {
>     if (shouldRun()) {
>       // re-retrieve namespace info to make sure that, if the NN
>       // was restarted, we still match its version (HDFS-2120)
>       NamespaceInfo nsInfo = retrieveNamespaceInfo();
>       // and re-register
>       register(nsInfo);
>       scheduler.scheduleHeartbeat();
>       // HDFS-9917,Standby NN IBR can be very huge if standby namenode is down
>       // for sometime.
>       if (state == HAServiceState.STANDBY || state == 
> HAServiceState.OBSERVER) {
>         ibrManager.clearIBRs();
>       }
>     }
>   }
> {code}
> c. As we know, #register will trigger BR immediately.
> d. because #reRegister run async, so we could not make sure which one run 
> first between send FBR and clear IBR. If clean IBR run first, it will be OK. 
> But if send FBR first then clear IBR, it will missing some blocks received 
> between these two time point until next FBR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to