[ https://issues.apache.org/jira/browse/HDFS-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14334076#comment-14334076 ]
Ming Ma commented on HDFS-7009: ------------------------------- Thanks, Chris, Arpit and Nicholas. > Active NN and standby NN have different live nodes > -------------------------------------------------- > > Key: HDFS-7009 > URL: https://issues.apache.org/jira/browse/HDFS-7009 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.6.0 > Reporter: Ming Ma > Assignee: Ming Ma > Fix For: 2.7.0 > > Attachments: HDFS-7009-2.patch, HDFS-7009-3.patch, HDFS-7009-4.patch, > HDFS-7009.patch > > > To follow up on https://issues.apache.org/jira/browse/HDFS-6478, in most > cases, given DN sends HB and BR to NN regularly, if a specific RPC call > fails, it isn't a big deal. > However, there are cases where DN fails to register with NN during initial > handshake due to exceptions not covered by RPC client's connection retry. > When this happens, the DN won't talk to that NN until the DN restarts. > {noformat} > BPServiceActor > public void run() { > LOG.info(this + " starting to offer service"); > try { > // init stuff > try { > // setup storage > connectToNNAndHandshake(); > } catch (IOException ioe) { > // Initial handshake, storage recovery or registration failed > // End BPOfferService thread > LOG.fatal("Initialization failed for block pool " + this, ioe); > return; > } > initialized = true; // bp is initialized; > > while (shouldRun()) { > try { > offerService(); > } catch (Exception ex) { > LOG.error("Exception in BPOfferService for " + this, ex); > sleepAndLogInterrupts(5000, "offering service"); > } > } > ... > {noformat} > Here is an example of the call stack. > {noformat} > java.io.IOException: Failed on local exception: java.io.IOException: Response > is null.; Host Details : local host is: "xxx"; destination host is: > "yyy":8030; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761) > at org.apache.hadoop.ipc.Client.call(Client.java:1239) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) > at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) > at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Response is null. > at > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844) > {noformat} > This will create discrepancy between active NN and standby NN in terms of > live nodes. > > Here is a possible scenario of missing blocks after failover. > 1. DN A, B set up handshakes with active NN, but not with standby NN. > 2. A block is replicated to DN A, B and C. > 3. From standby NN's point of view, given A and B are dead nodes, the block > is under replicated. > 4. DN C is down. > 5. Before active NN detects DN C is down, it fails over. > 6. The new active NN considers the block is missing. Even though there are > two replicas on DN A and B. -- This message was sent by Atlassian JIRA (v6.3.4#6332)