Aaron T. Myers created HDFS-5060:
------------------------------------

             Summary: NN should proactively perform a saveNamespace if it has a 
huge number of outstanding uncheckpointed transactions
                 Key: HDFS-5060
                 URL: https://issues.apache.org/jira/browse/HDFS-5060
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.1.0-beta
            Reporter: Aaron T. Myers
            Assignee: Aaron T. Myers


In a properly-functioning HDFS system, checkpoints will be triggered either by 
the secondary NN or standby NN regularly, by default every hour or 1MM 
outstanding edits transactions, whichever come first. However, in cases where 
this second node is down for an extended period of time, the number of 
outstanding transactions can grow so large as to cause a restart to take an 
inordinately long time.

This JIRA proposes to make the active NN monitor its number of outstanding 
transactions and perform a proactive local saveNamespace if it grows beyond a 
configurable threshold. I'm envisioning something like 10x the configured 
number of transactions which in a properly-functioning cluster would result in 
a checkpoint from the second NN. Though this would be disruptive to clients 
while it's taking place, likely for a few minutes, this seems better than the 
alternative of a subsequent multi-hour restart and should never actually occur 
in a properly-functioning cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to