[ https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908887#comment-14908887 ]
Hudson commented on HDFS-9107: ------------------------------ FAILURE: Integrated in Hadoop-trunk-Commit #8521 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8521/]) HDFS-9107. Prevent NN's unrecoverable death spiral after full GC (Daryn Sharp via Colin P. McCabe) (cmccabe: rev 4e7c6a653f108d44589f84d78a03d92ee0e8a3c3) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestHeartbeatHandling.java Add HDFS-9107 to CHANGES.txt (cmccabe: rev 878504dcaacdc1bea42ad571ad5f4e537c1d7167) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > Prevent NN's unrecoverable death spiral after full GC > ----------------------------------------------------- > > Key: HDFS-9107 > URL: https://issues.apache.org/jira/browse/HDFS-9107 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.0.0-alpha > Reporter: Daryn Sharp > Assignee: Daryn Sharp > Priority: Critical > Fix For: 2.8.0 > > Attachments: HDFS-9107.patch, HDFS-9107.patch > > > A full GC pause in the NN that exceeds the dead node interval can lead to an > infinite cycle of full GCs. The most common situation that precipitates an > unrecoverable state is a network issue that temporarily cuts off multiple > racks. > The NN wakes up and falsely starts marking nodes dead. This bloats the > replication queues which increases memory pressure. The replications create a > flurry of incremental block reports and a glut of over-replicated blocks. > The "dead" nodes heartbeat within seconds. The NN forces a re-registration > which requires a full block report - more memory pressure. The NN now has to > invalidate all the over-replicated blocks. The extra blocks are added to > invalidation queues, tracked in an excess blocks map, etc - much more memory > pressure. > All the memory pressure can push the NN into another full GC which repeats > the entire cycle. -- This message was sent by Atlassian JIRA (v6.3.4#6332)