[ https://issues.apache.org/jira/browse/HDFS-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240366#comment-14240366 ]
Lei (Eddy) Xu commented on HDFS-6440: ------------------------------------- [~jesse_yates] Thanks for working on this cool feature. We have read your design doc and came up only a few questions: # What is the procedure for adding or replacing NNs? Could it support dynamically adding NNs without downtime? # It seems that whether to upload a fsimage is mostly determined by SNN (e.g., finishing a checkpoint). Would it be possible to avoid mulitple SNNs to upload fsimages with trivial deltas in a short time? E.g., let ANN to reject upload requests if {{lastUploadTime > now - quiet period && num of edits < N}} ? # It seems that QJM inherits the behaviors from the current ANN/SNN design that it will purge edit logs after *_one_* SNN uploads a fsimage. Would it be possible that this behavior makes other SNNs miss the edit logs? E.g., if a SNN crashes and comes back online, but the edit logs are purged? # Does this work support rolling upgrade? # Would it makes client failover more complicated? And some minor concerns: # What would be the impact on the DN side? # What are the changes on the test resources files (hadoop-*-reserved.tgz) ? Thanks again for this awesome work! > Support more than 2 NameNodes > ----------------------------- > > Key: HDFS-6440 > URL: https://issues.apache.org/jira/browse/HDFS-6440 > Project: Hadoop HDFS > Issue Type: New Feature > Components: auto-failover, ha, namenode > Affects Versions: 2.4.0 > Reporter: Jesse Yates > Assignee: Jesse Yates > Attachments: Multiple-Standby-NameNodes_V1.pdf, > hdfs-6440-cdh-4.5-full.patch, hdfs-multiple-snn-trunk-v0.patch > > > Most of the work is already done to support more than 2 NameNodes (one > active, one standby). This would be the last bit to support running multiple > _standby_ NameNodes; one of the standbys should be available for fail-over. > Mostly, this is a matter of updating how we parse configurations, some > complexity around managing the checkpointing, and updating a whole lot of > tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)