[ https://issues.apache.org/jira/browse/HADOOP-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558173#action_12558173 ]
Konstantin Shvachko commented on HADOOP-2585: --------------------------------------------- We had a real example of such failure on one of our clusters. And we were able to reconstruct the namespace image from the secondary node using the following manual procedure, which might be useful for those who find themselves in the same type of trouble. h4. Manual recovery procedure from the secondary image. # Stop the cluster to make sure all data-nodes and *-tracker are down. # Select a node where you will run a new name-node, and set it up as usually for the name-node. # Format the new name-node. # cd <dfs.name.dir>/current # You will see file VERSION in there. You will need to provide namespaceID of the old cluster in it. The old namespaceID could be obtained from one of the data-nodes just copy it from <dfs.data.dir>/current/VERSION.namespaceID # rm <dfs.name.dir>/current/fsimage # scp <secondary-node>:<fs.checkpoint.dir>/destimage.tmp ./fsimage # Start the cluster. Upgrade is recommended, so that you could rollback if something goes wrong. # Run fsck, and remove files with missing blocks if any. h4. Automatic recovery proposal. The proposal consists has 2 parts. # The secondary node should store the latest check-pointed image file in compliance with the name-node storage directory structure. It is best if secondary node uses Storage class (or FSImage if code re-use makes sense here) in order to maintain the checkpoint directory. This should provide that the checkpointed image is always ready to be read by a name-node if the directory is listed in its "dfs.name.dir" list. # The name-node should consider the configuration variable "fs.checkpoint.dir" as a possible location of the image available for read-only access during startup. This means that if name-node finds all directories listed in "dfs.name.dir" unavailable or finds their images corrupted, then it should turn to the "fs.checkpoint.dir" directory and try to fetch the image from there. I think this should not be the default behavior but rather triggered by a name-node startup option, something like: {code} hadoop namenode -fromCheckpoint {code} So the name-node can start with the secondary image as long as the secondary node drive is mounted. And the name-node will never attempt to write anything to this drive. h4. Added bonuses provided by this approach - One can choose to restart failed name-node directly on the node where the secondary node ran. This brings us a step closer to the hot standby. - Replication of the image to NFS can be delegated to the secondary name-node if we will support multiple entries in "fs.checkpoint.dir". This is of course if the administrator chooses to accept outdated images in order to boost the name-node performance. > Automatic namespace recovery from the secondary image. > ------------------------------------------------------ > > Key: HADOOP-2585 > URL: https://issues.apache.org/jira/browse/HADOOP-2585 > Project: Hadoop > Issue Type: New Feature > Components: dfs > Affects Versions: 0.15.0 > Reporter: Konstantin Shvachko > > Hadoop has a three way (configuration controlled) protection from loosing the > namespace image. > # image can be replicated on different hard-drives of the same node; > # image can be replicated on a nfs mounted drive on an independent node; > # a stale replica of the image is created during periodic checkpointing and > stored on the secondary name-node. > Currently during startup the name-node examines all configured storage > directories, selects the > most up to date image, reads it, merges with the corresponding edits, and > writes to the new image back > into all storage directories. Everything is done automatically. > If due to multiple hardware failures none of those images on mounted hard > drives (local or remote) > are available the secondary image although stale (up to one hour old by > default) can be still > used in order to recover the majority of the file system data. > Currently one can reconstruct a valid name-node image from the secondary one > manually. > It would be nice to support an automatic recovery. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.