[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909080#comment-14909080
 ] 

Hudson commented on HDFS-9107:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #419 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/419/])
HDFS-9107. Prevent NN's unrecoverable death spiral after full GC (Daryn Sharp 
via Colin P. McCabe) (cmccabe: rev 4e7c6a653f108d44589f84d78a03d92ee0e8a3c3)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java
Add HDFS-9107 to CHANGES.txt (cmccabe: rev 
878504dcaacdc1bea42ad571ad5f4e537c1d7167)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: HDFS-9107.patch, HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909042#comment-14909042
 ] 

Hudson commented on HDFS-9107:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2386 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2386/])
HDFS-9107. Prevent NN's unrecoverable death spiral after full GC (Daryn Sharp 
via Colin P. McCabe) (cmccabe: rev 4e7c6a653f108d44589f84d78a03d92ee0e8a3c3)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java
Add HDFS-9107 to CHANGES.txt (cmccabe: rev 
878504dcaacdc1bea42ad571ad5f4e537c1d7167)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: HDFS-9107.patch, HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909019#comment-14909019
 ] 

Hudson commented on HDFS-9107:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2359 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2359/])
HDFS-9107. Prevent NN's unrecoverable death spiral after full GC (Daryn Sharp 
via Colin P. McCabe) (cmccabe: rev 4e7c6a653f108d44589f84d78a03d92ee0e8a3c3)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
Add HDFS-9107 to CHANGES.txt (cmccabe: rev 
878504dcaacdc1bea42ad571ad5f4e537c1d7167)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: HDFS-9107.patch, HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909011#comment-14909011
 ] 

Hudson commented on HDFS-9107:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #1181 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1181/])
HDFS-9107. Prevent NN's unrecoverable death spiral after full GC (Daryn Sharp 
via Colin P. McCabe) (cmccabe: rev 4e7c6a653f108d44589f84d78a03d92ee0e8a3c3)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
Add HDFS-9107 to CHANGES.txt (cmccabe: rev 
878504dcaacdc1bea42ad571ad5f4e537c1d7167)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: HDFS-9107.patch, HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908941#comment-14908941
 ] 

Hudson commented on HDFS-9107:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #448 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/448/])
HDFS-9107. Prevent NN's unrecoverable death spiral after full GC (Daryn Sharp 
via Colin P. McCabe) (cmccabe: rev 4e7c6a653f108d44589f84d78a03d92ee0e8a3c3)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
Add HDFS-9107 to CHANGES.txt (cmccabe: rev 
878504dcaacdc1bea42ad571ad5f4e537c1d7167)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: HDFS-9107.patch, HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908887#comment-14908887
 ] 

Hudson commented on HDFS-9107:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8521 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8521/])
HDFS-9107. Prevent NN's unrecoverable death spiral after full GC (Daryn Sharp 
via Colin P. McCabe) (cmccabe: rev 4e7c6a653f108d44589f84d78a03d92ee0e8a3c3)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java
Add HDFS-9107 to CHANGES.txt (cmccabe: rev 
878504dcaacdc1bea42ad571ad5f4e537c1d7167)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: HDFS-9107.patch, HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908831#comment-14908831
 ] 

Hudson commented on HDFS-9107:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #441 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/441/])
HDFS-9107. Prevent NN's unrecoverable death spiral after full GC (Daryn Sharp 
via Colin P. McCabe) (cmccabe: rev 4e7c6a653f108d44589f84d78a03d92ee0e8a3c3)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java
Add HDFS-9107 to CHANGES.txt (cmccabe: rev 
878504dcaacdc1bea42ad571ad5f4e537c1d7167)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: HDFS-9107.patch, HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-25 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908763#comment-14908763
 ] 

Colin Patrick McCabe commented on HDFS-9107:


+1.  Test failures not related.  We can do cleanups in a follow-on.  Thanks, 
[~daryn]

> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-9107.patch, HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-22 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903077#comment-14903077
 ] 

Colin Patrick McCabe commented on HDFS-9107:


Also (although I don't feel strongly about this), I don't think we need to 
optimize by checking at end of the entire scan for whether to skip the next 
scan.  Long GCs are rare enough that we don't need to optimize the code path... 
just keep it simple.

> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-9107.patch, HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-22 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903048#comment-14903048
 ] 

Colin Patrick McCabe commented on HDFS-9107:


I guess if we want to be 100% correct, we have to do the stopwatch check right 
after getting back a "true" result from {{DatanodeManager#isDatanodeDead}}, 
right?  Otherwise we could always have a TOCTOU where we have a long GC pause 
right before calling that function.  What do you think?

> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-9107.patch, HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901673#comment-14901673
 ] 

Hadoop QA commented on HDFS-9107:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m 54s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m 21s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 25s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 22s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 41s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 31s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | native |   3m 16s | Pre-build of native portion |
| {color:red}-1{color} | hdfs tests | 198m  6s | Tests failed in hadoop-hdfs. |
| | | 244m 36s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.hdfs.TestReplaceDatanodeOnFailure |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12761485/HDFS-9107.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / b00392d |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/12574/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/12574/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/12574/console |


This message was automatically generated.

> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-9107.patch, HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-21 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901165#comment-14901165
 ] 

Colin Patrick McCabe commented on HDFS-9107:


bq. I don't trust monotonicNow if the thread can suspend between calls; cores 
on different sockets may give different answers, though it's not something I've 
seen in the field.

Oracle's blog here [ 
https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks ] says:

bq. If you are interested in measuring/calculating elapsed time, then always 
use System.nanoTime(). On most systems it will give a resolution on the order 
of microseconds. Be aware though, this call can also take microseconds to 
execute on some platforms.

Of course, {{System#nanoTime}} is just a very thin wrapper around the operating 
system's monotonic clock.  In x86-land, the monotonic clock generally comes 
from one of two sources: the TSC (timestamp counter) or the HPET (high 
precision event timer).

In the 2000s, the TSC started becoming less useful because multi-core systems 
started becoming more common, and at that time, TSC wasn't synchronized across 
cores.  This has since changed (at least for Intel systems), and the TSC is now 
synchronized across cores.  So the alarm you are raising is about 5 years too 
late.  Anyway, if you have a "bad" TSC, you can still get {{System#nanoTime}} 
to behave correctly by switching your operating system's clock source to the 
HPET.  It's slower, but more reliable.

If you want to read more about this, check out 
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/332570

tl;dr
1. Operating systems implement various tricks to work around TSC bad behaviors
2. TSC bad behaviors are becoming less common in modern CPUs
3. You don't have to use the TSC if you don't want to!

Let's let the hardware and OS people do their job and just do ours.

I agree with [~hitliuyi]... +1 for the patch.  Would be even better if we could 
close that small window of a GC happening at a time other than during the 
{{Thread#sleep}}.

> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-21 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900820#comment-14900820
 ] 

Daryn Sharp commented on HDFS-9107:
---

[~hitliuyi], good points.

# Trust me, it's more than possible for a ~10 min full GC with a big heap.  
We've even bumped the recheck up on the largest clusters.  I should mention 
these big clusters go through 2-4 full GCs at startup while loading...  The 
overhead of artificially losing nodes doesn't help.  This patch won't stop a 
full GC during image load, or the first full GC in safemode, but should reduce 
the probability of additional full GCs.
# I thought of the exact same thing this weekend.  I'll post a revised and 
equally small patch that addresses the issue more thoroughly.

> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-21 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900352#comment-14900352
 ] 

Yi Liu commented on HDFS-9107:
--

Sorry I just see Steve's comments. 
{quote}
cores on different sockets may give different answers
{quote}
About the {{nanoTime}}, yes, I also ever saw similar points and discussion like 
this, but seems it's not correct and {{nanoTime}} is safe, see more discussion 
in 
http://stackoverflow.com/questions/510462/is-system-nanotime-completely-useless.
  (There are some links to oracle article.)

> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-21 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900335#comment-14900335
 ] 

Yi Liu commented on HDFS-9107:
--

Thanks [~daryn], the issue seems critical.
I have few comments:
*1.* the default heartbeat recheck interval is 5 minutes if not configured, is 
it possible a full GC longer than 5 minutes? I see some full gc lasts tens of 
seconds, but not saw so long, of course, it depends on the heap size (old 
generation). Actually the data node dead (heartbeat expire) interval is 2x than 
heartbeat recheck interval, so the full gc should last 10 minutes.

*2.* The patch assumes the full gc happens during the {{sleep}}, it's most 
possible, but if it happens after {{long now = ..}} or setting 
{{lastHeatbeatCheck}} to {{now}}, the issue still exists, even though small 
probability. 

But I would like to give +1 for the patch, since it solves the issue if really 
happen, and doesn't affect existing logic.

> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14877072#comment-14877072
 ] 

Steve Loughran commented on HDFS-9107:
--

# I don't trust monotonicNow if the thread can suspend between calls; cores on 
different sockets *may* give different answers, though it's not something I've 
seen in the field.
# kicking off a new test run; this looks like an mvn dependency failure

> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-9107.patch
>
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876346#comment-14876346
 ] 

Hadoop QA commented on HDFS-9107:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m 50s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 56s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 18s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 22s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 29s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 31s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | native |   3m 12s | Pre-build of native portion |
| {color:red}-1{color} | hdfs tests |  42m 47s | Tests failed in hadoop-hdfs. |
| | |  88m 26s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.hdfs.TestFileStatus |
|   | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes |
|   | hadoop.hdfs.server.namenode.TestINodeFile |
|   | hadoop.fs.contract.hdfs.TestHDFSContractOpen |
|   | hadoop.hdfs.server.datanode.TestFsDatasetCache |
|   | hadoop.hdfs.server.datanode.fsdataset.impl.TestDatanodeRestart |
|   | hadoop.hdfs.TestFileCreationDelete |
|   | hadoop.hdfs.server.namenode.ha.TestHASafeMode |
|   | hadoop.hdfs.TestEncryptionZonesWithHA |
|   | hadoop.fs.contract.hdfs.TestHDFSContractMkdir |
|   | hadoop.hdfs.server.namenode.TestNameNodeXAttr |
|   | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby |
|   | hadoop.hdfs.shortcircuit.TestShortCircuitCache |
|   | hadoop.hdfs.TestDecommission |
|   | hadoop.hdfs.server.namenode.TestFSEditLogLoader |
|   | hadoop.hdfs.server.namenode.ha.TestDelegationTokensWithHA |
|   | hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes |
|   | hadoop.hdfs.server.blockmanagement.TestNameNodePrunesMissingStorages |
|   | hadoop.hdfs.server.datanode.TestCachingStrategy |
|   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
|   | hadoop.cli.TestXAttrCLI |
|   | hadoop.hdfs.server.namenode.TestDeleteRace |
|   | hadoop.hdfs.server.namenode.TestParallelImageWrite |
|   | hadoop.hdfs.server.namenode.TestNameNodeRespectsBindHostKeys |
|   | hadoop.hdfs.server.namenode.TestNNStorageRetentionFunctional |
|   | hadoop.hdfs.server.namenode.TestSaveNamespace |
|   | hadoop.hdfs.TestDFSRename |
|   | hadoop.hdfs.util.TestDiff |
|   | hadoop.hdfs.server.datanode.TestDataNodeFSDataSetSink |
|   | hadoop.hdfs.server.namenode.TestFsck |
|   | hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA |
|   | 
hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaPlacement |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureToleration |
|   | hadoop.hdfs.server.datanode.TestDeleteBlockPool |
|   | hadoop.hdfs.TestRemoteBlockReader2 |
|   | hadoop.hdfs.server.namenode.TestStorageRestore |
|   | hadoop.hdfs.server.namenode.TestFileLimit |
|   | hadoop.hdfs.server.blockmanagement.TestNodeCount |
|   | hadoop.fs.contract.hdfs.TestHDFSContractSetTimes |
|   | hadoop.hdfs.server.namenode.snapshot.TestCheckpointsWithSnapshots |
|   | hadoop.hdfs.qjournal.TestNNWithQJM |
|   | hadoop.hdfs.server.namenode.TestFSPermissionChecker |
|   | hadoop.hdfs.server.namenode.TestSecureNameNode |
|   | hadoop.hdfs.server.namenode.TestFileContextAcl |
|   | hadoop.hdfs.server.datanode.TestDataXceiverLazyPersistHint |
|   | hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA |
|   | hadoop.hdfs.TestDataTransferProtocol |
|   | hadoop.fs.viewfs.TestViewFsWithXAttrs |
|   | hadoop.hdfs.server.datanode.TestDiskError |
|   | hadoop.hdfs.TestFsShellPermission |
|   | hadoop.hdfs.server.namenode.TestMalformedURLs |
|   | hadoop.hdfs.TestReadWhileWriting |
|   | hadoop.fs.TestSWebHdfsFileContextMainOperations |
|   | hadoop.hdfs.TestIsMethodSupported |
|   | hadoop.hdfs.TestParallelShortCircuitReadNoChecksum |
|   | hadoop.hdfs.server.blockmanagement.TestAvailableSpaceBlockPlacementPolicy 
|
|   | hadoop.hdfs.TestFileCreationClient |
|   | hadoop.cli.TestAclCLI |
|   | hadoop.hdfs.TestFS

[jira] [Commented] (HDFS-9107) Prevent NN's unrecoverable death spiral after full GC

2015-09-18 Thread Esteban Gutierrez (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876097#comment-14876097
 ] 

Esteban Gutierrez commented on HDFS-9107:
-

Perhaps us the JvmPauseMonitor to monitor for a large pause and delay the 
expiration if above a threshold?

> Prevent NN's unrecoverable death spiral after full GC
> -
>
> Key: HDFS-9107
> URL: https://issues.apache.org/jira/browse/HDFS-9107
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> A full GC pause in the NN that exceeds the dead node interval can lead to an 
> infinite cycle of full GCs.  The most common situation that precipitates an 
> unrecoverable state is a network issue that temporarily cuts off multiple 
> racks.
> The NN wakes up and falsely starts marking nodes dead. This bloats the 
> replication queues which increases memory pressure. The replications create a 
> flurry of incremental block reports and a glut of over-replicated blocks.
> The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
> which requires a full block report - more memory pressure. The NN now has to 
> invalidate all the over-replicated blocks. The extra blocks are added to 
> invalidation queues, tracked in an excess blocks map, etc - much more memory 
> pressure.
> All the memory pressure can push the NN into another full GC which repeats 
> the entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)