[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-05-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987694#comment-13987694
 ] 

Hudson commented on HDFS-6289:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1774 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1774/])
HDFS-6289. HA failover can fail if there are pending DN messages for DNs which 
no longer exist. Contributed by Aaron T. Myers. (atm: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1591413)
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingDataNodeMessages.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetUtil.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPendingCorruptDnMessages.java


> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Fix For: 2.5.0
>
> Attachments: HDFS-6289.patch, HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-05-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987689#comment-13987689
 ] 

Hudson commented on HDFS-6289:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1748 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1748/])
HDFS-6289. HA failover can fail if there are pending DN messages for DNs which 
no longer exist. Contributed by Aaron T. Myers. (atm: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1591413)
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingDataNodeMessages.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetUtil.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPendingCorruptDnMessages.java


> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Fix For: 2.5.0
>
> Attachments: HDFS-6289.patch, HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-05-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987596#comment-13987596
 ] 

Hudson commented on HDFS-6289:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #557 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/557/])
HDFS-6289. HA failover can fail if there are pending DN messages for DNs which 
no longer exist. Contributed by Aaron T. Myers. (atm: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1591413)
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingDataNodeMessages.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetUtil.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPendingCorruptDnMessages.java


> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Fix For: 2.5.0
>
> Attachments: HDFS-6289.patch, HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986350#comment-13986350
 ] 

Hudson commented on HDFS-6289:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5589 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5589/])
HDFS-6289. HA failover can fail if there are pending DN messages for DNs which 
no longer exist. Contributed by Aaron T. Myers. (atm: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1591413)
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingDataNodeMessages.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetUtil.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/MiniDFSCluster.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPendingCorruptDnMessages.java


> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Fix For: 2.5.0
>
> Attachments: HDFS-6289.patch, HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-30 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985795#comment-13985795
 ] 

Aaron T. Myers commented on HDFS-6289:
--

The latest test failure is just because of the following:

{noformat}
java.net.BindException: Port in use: localhost:50070
at sun.nio.ch.Net.bind(Native Method)
at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
at 
org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
at 
org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:853)
at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:794)
{noformat}

I ran TestBlockRecovery several times on my box and it passes without issue.

I'm going to go ahead and commit this momentarily.

> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-6289.patch, HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985238#comment-13985238
 ] 

Hadoop QA commented on HDFS-6289:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642380/HDFS-6289.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/6771//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6771//console

This message is automatically generated.

> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-6289.patch, HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-29 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985094#comment-13985094
 ] 

Aaron T. Myers commented on HDFS-6289:
--

Thanks a lot for the review, Todd. The TestDNFencingWithReplication test failed 
with the following error:

{noformat}
java.lang.RuntimeException: Deferred
at 
org.apache.hadoop.test.MultithreadedTestUtil$TestContext.checkException(MultithreadedTestUtil.java:130)
at 
org.apache.hadoop.test.MultithreadedTestUtil$TestContext.stop(MultithreadedTestUtil.java:166)
at 
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress(TestDNFencingWithReplication.java:135)
Caused by: java.io.IOException: Timed out waiting for 2 replicas on path /test-3
at 
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication$ReplicationToggler.waitForReplicas(TestDNFencingWithReplication.java:96)
at 
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication$ReplicationToggler.doAnAction(TestDNFencingWithReplication.java:78)
at 
org.apache.hadoop.test.MultithreadedTestUtil$RepeatingTestThread.doWork(MultithreadedTestUtil.java:222)
at 
org.apache.hadoop.test.MultithreadedTestUtil$TestingThread.run(MultithreadedTestUtil.java:189)
{noformat}

I'm fairly confident this was just a one-off flake, especially considering the 
code change in this patch is only triggered by DN restarts which 
TestDNFencingWithReplication doesn't do, but just to be sure I looped 
TestDNFencingWithReplication 50 times on my box and never saw a failure. I've 
also just kicked Jenkins to build this JIRA again, so hopefully it'll pass 
then. If it passes that, I'll go ahead and commit this based on your previous 
+1.

> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-6289.patch, HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-29 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984611#comment-13984611
 ] 

Todd Lipcon commented on HDFS-6289:
---

Can you double check that this test isn't made more flaky by this patch? I've 
seen this test fail once or twice before in precommits, but given that it's 
very much related to the code touched by this patch, we should probably 
investigate it a bit before committing.

Otherwise +1

> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-6289.patch, HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983964#comment-13983964
 ] 

Hadoop QA commented on HDFS-6289:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642380/HDFS-6289.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/6759//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6759//console

This message is automatically generated.

> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-6289.patch, HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-28 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983825#comment-13983825
 ] 

Todd Lipcon commented on HDFS-6289:
---

Is there any test you could write to show that bug? I agree with your logic, 
but surprised that there isn't some bug that it causes. Given that the current 
test isn't a regression test for that bug, maybe should tackle it separately?

> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-28 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983805#comment-13983805
 ] 

Aaron T. Myers commented on HDFS-6289:
--

Thanks for the review, Todd.

bq. Maybe you should file a separate follow-up JIRA here for this second issue, 
since you aren't fixing it here?

I could also just fix it here. It seems pretty transparently obvious that we 
should make that change. Do you agree? If so, I'll just post a patch fixing 
that as well.

> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-28 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983802#comment-13983802
 ] 

Todd Lipcon commented on HDFS-6289:
---

{code}
+// TODO(atm): This should be s/storedBlock/block, since we should be
+// postponing the info of the reported block, not the stored block,
+// though that actually exacerbates the bug, doesn't fix it.
{code}

Out of context, this comment won't make much sense -- what's "the bug" it's refe
rring to? Maybe you should file a separate follow-up JIRA here for this second i
ssue, since you aren't fixing it here?

Otherwise lgtm.

> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-28 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983679#comment-13983679
 ] 

Aaron T. Myers commented on HDFS-6289:
--

I feel confident that the TestBalancerWithNodeGroup failure is spurious. It 
passes fine on my box, isn't really related to this code, and has been flaky 
off and on for a long time.

> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981806#comment-13981806
 ] 

Hadoop QA commented on HDFS-6289:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642020/HDFS-6289.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/6742//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6742//console

This message is automatically generated.

> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-25 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981791#comment-13981791
 ] 

Yongjun Zhang commented on HDFS-6289:
-

Thanks for answering my questions ATM, that helps!


> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-25 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981773#comment-13981773
 ] 

Aaron T. Myers commented on HDFS-6289:
--

Hi Yongjun,

bq. When Standby NN receive the messages, it's actually aware of what message 
it received, but as time goes by, the messages may become stale due to DN 
reformatting. That's what you meant, right?

Nope, that's not what I was referring to. I was referring to why we queue 
messages in the first place in the standby NN. This happens because the standby 
NN is always a bit behind by the active in its knowledge of the namespace and 
set of blocks which exist, so when the active NN allocates new blocks the DNs 
will report things about these blocks to both NNs but the standby won't yet 
know about the existence of the blocks those messages refer to. In some sense 
these messages are from the future from the DN's perspective, so we queue them 
up.

bq. Since the DN that the queued messages are associated with is reformatted, 
these queued messages become stale and useless, and can be safely removed, 
right? or my real question is, are there any messages that need to applied even 
after DN reformatting?

No, if the storage ID of the DN is changing, we should assume that the DN no 
longer has the blocks that these messages were previously referring to.

bq. I found a util function in DatanodeUtil.java...

OK, I'll make this change when I do the next rev of this patch.

bq. I think the testcase you wrote is very nice to demonstrate the problem.

Thanks!

> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6289) HA failover can fail if there are pending DN messages for DNs which no longer exist

2014-04-25 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981757#comment-13981757
 ] 

Yongjun Zhang commented on HDFS-6289:
-

Hi ATM,

Thanks a lot for discovering this tricky problem and providing the fix. As a 
learning experience, I read through the patch and had a few comments actually 
mainly questions):

1. "In an HA setup, the standby NN may receive messages from DNs for blocks 
which the standby NN is not yet aware of." When Standby NN receive the 
messages, it's actually aware of what message it received, but as time goes by, 
the messages may become stale due to DN reformatting. That's what you meant, 
right?

2. Since the DN that the queued messages are associated with is reformatted, 
these queued messages become stale and useless, and can be safely removed, 
right? or my real question is, are there any messages that need to applied even 
after DN reformatting?

3. I found a util function in DatanodeUtil.java
{code}
   public static String getMetaName(String blockName, long generationStamp) {
return blockName + "_" + generationStamp + Block.METADATA_EXTENSION; 
  }
{code}
that you might consider using in 
{code}
public static boolean changeGenStampOfBlock(int dnIndex, ExtendedBlock blk,
  long newGenStamp) throws IOException {
{code}

4. I think the testcase you wrote is very nice to demonstrate the problem. 

Thanks.



> HA failover can fail if there are pending DN messages for DNs which no longer 
> exist
> ---
>
> Key: HDFS-6289
> URL: https://issues.apache.org/jira/browse/HDFS-6289
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.4.0
>Reporter: Aaron T. Myers
>Assignee: Aaron T. Myers
>Priority: Critical
> Attachments: HDFS-6289.patch
>
>
> In an HA setup, the standby NN may receive messages from DNs for blocks which 
> the standby NN is not yet aware of. It queues up these messages and replays 
> them when it next reads from the edit log or fails over. On a failover, all 
> of these pending DN messages must be processed successfully in order for the 
> failover to succeed. If one of these pending DN messages refers to a DN 
> storageId that no longer exists (because the DN with that transfer address 
> has been reformatted and has re-registered with the same transfer address) 
> then on transition to active the NN will not be able to process this DN 
> message and will suicide with an error like the following:
> {noformat}
> 2014-04-25 14:23:17,922 FATAL namenode.NameNode 
> (NameNode.java:doImmediateShutdown(1525)) - Error encountered requiring NN 
> shutdown. Shutting down immediately.
> java.io.IOException: Cannot mark 
> blk_1073741825_900(stored=blk_1073741825_1001) as corrupt because datanode 
> 127.0.0.1:33324 does not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)