[jira] [Commented] (HDFS-7009) Active NN and standby NN have different live nodes

Hadoop QA (JIRA) Sat, 14 Feb 2015 02:11:48 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321320#comment-14321320
 ]


Hadoop QA commented on HDFS-7009:
---------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12698889/HDFS-7009-3.patch
  against trunk revision 6804d68.

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.server.datanode.TestDataNodeRollingUpgrade

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9583//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9583//console

This message is automatically generated.

> Active NN and standby NN have different live nodes
> --------------------------------------------------
>
>                 Key: HDFS-7009
>                 URL: https://issues.apache.org/jira/browse/HDFS-7009
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HDFS-7009-2.patch, HDFS-7009-3.patch, HDFS-7009.patch
>
>
> To follow up on https://issues.apache.org/jira/browse/HDFS-6478, in most 
> cases, given DN sends HB and BR to NN regularly, if a specific RPC call 
> fails, it isn't a big deal.
> However, there are cases where DN fails to register with NN during initial 
> handshake due to exceptions not covered by RPC client's connection retry. 
> When this happens, the DN won't talk to that NN until the DN restarts.
> {noformat}
> BPServiceActor
>   public void run() {
>     LOG.info(this + " starting to offer service");
>     try {
>       // init stuff
>       try {
>         // setup storage
>         connectToNNAndHandshake();
>       } catch (IOException ioe) {
>         // Initial handshake, storage recovery or registration failed
>         // End BPOfferService thread
>         LOG.fatal("Initialization failed for block pool " + this, ioe);
>         return;
>       }
>       initialized = true; // bp is initialized;
>       
>       while (shouldRun()) {
>         try {
>           offerService();
>         } catch (Exception ex) {
>           LOG.error("Exception in BPOfferService for " + this, ex);
>           sleepAndLogInterrupts(5000, "offering service");
>         }
>       }
> ...
> {noformat}
> Here is an example of the call stack.
> {noformat}
> java.io.IOException: Failed on local exception: java.io.IOException: Response 
> is null.; Host Details : local host is: "xxx"; destination host is: 
> "yyy":8030;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1239)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Response is null.
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)
> {noformat}
> This will create discrepancy between active NN and standby NN in terms of 
> live nodes.
>  
> Here is a possible scenario of missing blocks after failover.
> 1. DN A, B set up handshakes with active NN, but not with standby NN.
> 2. A block is replicated to DN A, B and C.
> 3. From standby NN's point of view, given A and B are dead nodes, the block 
> is under replicated.
> 4. DN C is down.
> 5. Before active NN detects DN C is down, it fails over.
> 6. The new active NN considers the block is missing. Even though there are 
> two replicas on DN A and B.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7009) Active NN and standby NN have different live nodes

Reply via email to