[ https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran updated HDFS-9107: --------------------------------- Status: Open (was: Patch Available) > Prevent NN's unrecoverable death spiral after full GC > ----------------------------------------------------- > > Key: HDFS-9107 > URL: https://issues.apache.org/jira/browse/HDFS-9107 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.0.0-alpha > Reporter: Daryn Sharp > Assignee: Daryn Sharp > Priority: Critical > Attachments: HDFS-9107.patch > > > A full GC pause in the NN that exceeds the dead node interval can lead to an > infinite cycle of full GCs. The most common situation that precipitates an > unrecoverable state is a network issue that temporarily cuts off multiple > racks. > The NN wakes up and falsely starts marking nodes dead. This bloats the > replication queues which increases memory pressure. The replications create a > flurry of incremental block reports and a glut of over-replicated blocks. > The "dead" nodes heartbeat within seconds. The NN forces a re-registration > which requires a full block report - more memory pressure. The NN now has to > invalidate all the over-replicated blocks. The extra blocks are added to > invalidation queues, tracked in an excess blocks map, etc - much more memory > pressure. > All the memory pressure can push the NN into another full GC which repeats > the entire cycle. -- This message was sent by Atlassian JIRA (v6.3.4#6332)