[ https://issues.apache.org/jira/browse/HDFS-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321320#comment-14321320 ]
Hadoop QA commented on HDFS-7009: --------------------------------- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12698889/HDFS-7009-3.patch against trunk revision 6804d68. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.datanode.TestDataNodeRollingUpgrade Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9583//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9583//console This message is automatically generated. > Active NN and standby NN have different live nodes > -------------------------------------------------- > > Key: HDFS-7009 > URL: https://issues.apache.org/jira/browse/HDFS-7009 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Ming Ma > Assignee: Ming Ma > Attachments: HDFS-7009-2.patch, HDFS-7009-3.patch, HDFS-7009.patch > > > To follow up on https://issues.apache.org/jira/browse/HDFS-6478, in most > cases, given DN sends HB and BR to NN regularly, if a specific RPC call > fails, it isn't a big deal. > However, there are cases where DN fails to register with NN during initial > handshake due to exceptions not covered by RPC client's connection retry. > When this happens, the DN won't talk to that NN until the DN restarts. > {noformat} > BPServiceActor > public void run() { > LOG.info(this + " starting to offer service"); > try { > // init stuff > try { > // setup storage > connectToNNAndHandshake(); > } catch (IOException ioe) { > // Initial handshake, storage recovery or registration failed > // End BPOfferService thread > LOG.fatal("Initialization failed for block pool " + this, ioe); > return; > } > initialized = true; // bp is initialized; > > while (shouldRun()) { > try { > offerService(); > } catch (Exception ex) { > LOG.error("Exception in BPOfferService for " + this, ex); > sleepAndLogInterrupts(5000, "offering service"); > } > } > ... > {noformat} > Here is an example of the call stack. > {noformat} > java.io.IOException: Failed on local exception: java.io.IOException: Response > is null.; Host Details : local host is: "xxx"; destination host is: > "yyy":8030; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761) > at org.apache.hadoop.ipc.Client.call(Client.java:1239) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) > at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) > at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Response is null. > at > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844) > {noformat} > This will create discrepancy between active NN and standby NN in terms of > live nodes. > > Here is a possible scenario of missing blocks after failover. > 1. DN A, B set up handshakes with active NN, but not with standby NN. > 2. A block is replicated to DN A, B and C. > 3. From standby NN's point of view, given A and B are dead nodes, the block > is under replicated. > 4. DN C is down. > 5. Before active NN detects DN C is down, it fails over. > 6. The new active NN considers the block is missing. Even though there are > two replicas on DN A and B. -- This message was sent by Atlassian JIRA (v6.3.4#6332)