[ https://issues.apache.org/jira/browse/HDFS-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900352#comment-14900352 ]
Yi Liu commented on HDFS-9107: ------------------------------ Sorry I just see Steve's comments. {quote} cores on different sockets may give different answers {quote} About the {{nanoTime}}, yes, I also ever saw similar points and discussion like this, but seems it's not correct and {{nanoTime}} is safe, see more discussion in http://stackoverflow.com/questions/510462/is-system-nanotime-completely-useless. (There are some links to oracle article.) > Prevent NN's unrecoverable death spiral after full GC > ----------------------------------------------------- > > Key: HDFS-9107 > URL: https://issues.apache.org/jira/browse/HDFS-9107 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.0.0-alpha > Reporter: Daryn Sharp > Assignee: Daryn Sharp > Priority: Critical > Attachments: HDFS-9107.patch > > > A full GC pause in the NN that exceeds the dead node interval can lead to an > infinite cycle of full GCs. The most common situation that precipitates an > unrecoverable state is a network issue that temporarily cuts off multiple > racks. > The NN wakes up and falsely starts marking nodes dead. This bloats the > replication queues which increases memory pressure. The replications create a > flurry of incremental block reports and a glut of over-replicated blocks. > The "dead" nodes heartbeat within seconds. The NN forces a re-registration > which requires a full block report - more memory pressure. The NN now has to > invalidate all the over-replicated blocks. The extra blocks are added to > invalidation queues, tracked in an excess blocks map, etc - much more memory > pressure. > All the memory pressure can push the NN into another full GC which repeats > the entire cycle. -- This message was sent by Atlassian JIRA (v6.3.4#6332)