[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2016-02-01 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127560#comment-15127560
 ] 

Junping Du commented on HDFS-8995:
--

Move it out of 2.6.4 to 2.6.5 as haven't updated for a while.

> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2016-01-03 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080553#comment-15080553
 ] 

Junping Du commented on HDFS-8995:
--

Hi [~hitliuyi] and [~kihwal], as Sangjin's comments earlier, shall we backport 
the fix to branch-2.6?

> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-11-20 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019107#comment-15019107
 ] 

Sangjin Lee commented on HDFS-8995:
---

Does this issue exist in 2.6.x? Should this be backported to branch-2.6?

> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-09-01 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725833#comment-14725833
 ] 

Kihwal Lee commented on HDFS-8995:
--

Reran the failed test cases. They pass.
{noformat}
---
 T E S T S
---
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.server.namenode.TestEditLog
Tests run: 23, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 71.907 sec - 
in org.apache.hadoop.hdfs.server.namenode.TestEditLog
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.server.namenode.TestNameNodeMXBean
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 22.427 sec - in 
org.apache.hadoop.hdfs.server.namenode.TestNameNodeMXBean
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.web.TestWebHDFSOAuth2
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.744 sec - in 
org.apache.hadoop.hdfs.web.TestWebHDFSOAuth2

Results :

Tests run: 30, Failures: 0, Errors: 0, Skipped: 0
{noformat}

> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-09-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726686#comment-14726686
 ] 

Hudson commented on HDFS-8995:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1066 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1066/])
HDFS-8995. Flaw in registration bookeeping can make DN die on reconnect. 
(Kihwal Lee via yliu) (yliu: rev 5652131d2ea68c408dd3cd8bee31723642a8cdde)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java


> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-09-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726602#comment-14726602
 ] 

Hudson commented on HDFS-8995:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2280 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2280/])
HDFS-8995. Flaw in registration bookeeping can make DN die on reconnect. 
(Kihwal Lee via yliu) (yliu: rev 5652131d2ea68c408dd3cd8bee31723642a8cdde)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java


> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-09-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726562#comment-14726562
 ] 

Hudson commented on HDFS-8995:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #339 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/339/])
HDFS-8995. Flaw in registration bookeeping can make DN die on reconnect. 
(Kihwal Lee via yliu) (yliu: rev 5652131d2ea68c408dd3cd8bee31723642a8cdde)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-09-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726683#comment-14726683
 ] 

Hudson commented on HDFS-8995:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #331 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/331/])
HDFS-8995. Flaw in registration bookeeping can make DN die on reconnect. 
(Kihwal Lee via yliu) (yliu: rev 5652131d2ea68c408dd3cd8bee31723642a8cdde)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java


> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-09-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726541#comment-14726541
 ] 

Hudson commented on HDFS-8995:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #8383 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8383/])
HDFS-8995. Flaw in registration bookeeping can make DN die on reconnect. 
(Kihwal Lee via yliu) (yliu: rev 5652131d2ea68c408dd3cd8bee31723642a8cdde)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java


> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-09-01 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726485#comment-14726485
 ] 

Yi Liu commented on HDFS-8995:
--

+1, thanks Kihwal. Will commit it shortly.

> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-09-01 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724932#comment-14724932
 ] 

Yi Liu commented on HDFS-8995:
--

Yes, in the case of re-registration failure, datanode will get 
{{UnregisteredNodeException}} from NN while doing further incremental block 
report and heartbeat, and cause BP-xxx service shutdown. And we can see the 
exception log. 

{quote}
The fix is not saving the registration until the NN updates it
{quote}
Agree.

My comment is the change in  {{BPOfferService#registrationSucceeded}} and 
{{DataNode#bpRegistrationSucceeded}} is necessary? Since re-registration 
failure will throw exception, and only successful registration will go to that 
logic and update the variables if they are not null. But I think it's also OK 
to update them every time when registration or re-registration.

So +1 pending Jenkins.

> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-09-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725240#comment-14725240
 ] 

Hadoop QA commented on HDFS-8995:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m 41s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 49s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  1s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 22s | The applied patch generated  1 
new checkstyle issues (total was 188, now 188). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 27s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 26s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | native |   3m 10s | Pre-build of native portion |
| {color:red}-1{color} | hdfs tests | 161m 22s | Tests failed in hadoop-hdfs. |
| | | 206m 15s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.hdfs.server.namenode.TestEditLog |
|   | hadoop.hdfs.web.TestWebHDFSOAuth2 |
|   | hadoop.hdfs.server.namenode.TestNameNodeMXBean |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12753300/HDFS-8995.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 7ad3556 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-HDFS-Build/12228/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/12228/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/12228/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf900.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/12228/console |


This message was automatically generated.

> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-09-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726730#comment-14726730
 ] 

Hudson commented on HDFS-8995:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2261 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2261/])
HDFS-8995. Flaw in registration bookeeping can make DN die on reconnect. 
(Kihwal Lee via yliu) (yliu: rev 5652131d2ea68c408dd3cd8bee31723642a8cdde)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-09-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14726732#comment-14726732
 ] 

Hudson commented on HDFS-8995:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #322 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/322/])
HDFS-8995. Flaw in registration bookeeping can make DN die on reconnect. 
(Kihwal Lee via yliu) (yliu: rev 5652131d2ea68c408dd3cd8bee31723642a8cdde)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-08-31 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723497#comment-14723497
 ] 

Kihwal Lee commented on HDFS-8995:
--

{noformat}
2018-09-26 12:15:08,497 WARN datanode.DataNode: Block pool BP-xxx (Datanode 
Uuid xxx) service to the-namenode.elephantland.gov/
10.2.3.4:8020 is shutting down
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.UnregisteredNodeException):
 Data node DatanodeRegistration(0.0.0.0, datanodeUuid=xxx, infoPort=, 
infoSecurePort=, ipcPort=, storageInfo=lv=-56;cid=CID-1xxx;c=xxx)
 is attempting to report storage ID abc. Node 10.100.100.100:100 (actual ip 
addr) is expected to serve this storage.
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.getDatanode(DatanodeManager.java:483)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processIncrementalBlockReport(BlockManager.java:3094)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.processIncrementalBlockReport(FSNamesystem.java:6406)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReceivedAndDeleted(NameNodeRpcServer.java:1200)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolServerSideTranslatorPB.java:215)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26632)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090)

at org.apache.hadoop.ipc.Client.call(Client.java:1451)
at org.apache.hadoop.ipc.Client.call(Client.java:1382)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy14.blockReceivedAndDeleted(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:240)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reportReceivedDeletedBlocks(BPServiceActor.java:289)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:692)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:834)
at java.lang.Thread.run(Thread.java:745)
{noformat}

The namenode is saying the incremental block report came with a DN registration 
containing address 0.0.0.0. This is what DN does on registration, not for block 
report. During registration, NN sends back with the address it saw and the DN 
uses the registration from the NN from that point on. So subsequent calls 
contain the address of its external interface.  This exception trace suggests 
there is a bug in exception handling and re-registration.

> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Priority: Critical
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

2015-08-31 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723500#comment-14723500
 ] 

Kihwal Lee commented on HDFS-8995:
--

[~daryn] did further analysis:

{panel}
It's a bug during re-registration. The DN is supposed to create a registration 
object which contains the 0.0.0.0 addr, pass it to the NN which updates the 
addr and returns it, then the DN saves the updated registration for future 
calls.

The problem is the DN saves off the initial registration with 0.0.0.0 before it 
receives the NN's response. When the DN encounters an exception contacting the 
NN, it is left with the invalid registration containing 0.0.0.0.

The fix is not saving the registration until the NN updates it. There's a 
couple places where the DN isn't updating all references to a new registration.
{panel}

> Flaw in registration bookeeping can make DN die on reconnect
> 
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Priority: Critical
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)