Daryn Sharp created HDFS-9107:
---------------------------------

             Summary: Prevent NN's unrecoverable death spiral after full GC
                 Key: HDFS-9107
                 URL: https://issues.apache.org/jira/browse/HDFS-9107
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.0.0-alpha
            Reporter: Daryn Sharp
            Assignee: Daryn Sharp
            Priority: Critical


A full GC pause in the NN that exceeds the dead node interval can lead to an 
infinite cycle of full GCs.  The most common situation that precipitates an 
unrecoverable state is a network issue that temporarily cuts off multiple racks.

The NN wakes up and falsely starts marking nodes dead. This bloats the 
replication queues which increases memory pressure. The replications create a 
flurry of incremental block reports and a glut of over-replicated blocks.

The "dead" nodes heartbeat within seconds. The NN forces a re-registration 
which requires a full block report - more memory pressure. The NN now has to 
invalidate all the over-replicated blocks. The extra blocks are added to 
invalidation queues, tracked in an excess blocks map, etc - much more memory 
pressure.

All the memory pressure can push the NN into another full GC which repeats the 
entire cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to