[ 
https://issues.apache.org/jira/browse/HADOOP-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558173#action_12558173
 ] 

Konstantin Shvachko commented on HADOOP-2585:
---------------------------------------------

We had a real example of such failure on one of our clusters.
And we were able to reconstruct the namespace image from the secondary node 
using the following 
manual procedure, which might be useful for those who find themselves in the 
same type of trouble.

h4. Manual recovery procedure from the secondary image.
# Stop the cluster to make sure all data-nodes and *-tracker are down.
# Select a node where you will run a new name-node, and set it up as usually 
for the name-node.
# Format the new name-node.
# cd <dfs.name.dir>/current
# You will see file VERSION in there. You will need to provide namespaceID of 
the old cluster in it. 
The old namespaceID could be obtained from one of the data-nodes 
just copy it from <dfs.data.dir>/current/VERSION.namespaceID
# rm <dfs.name.dir>/current/fsimage
# scp <secondary-node>:<fs.checkpoint.dir>/destimage.tmp ./fsimage
# Start the cluster. Upgrade is recommended, so that you could rollback if 
something goes wrong.
# Run fsck, and remove files with missing blocks if any.

h4. Automatic recovery proposal.
The proposal consists has 2 parts.
# The secondary node should store the latest check-pointed image file in 
compliance with the
name-node storage directory structure. It is best if secondary node uses 
Storage class 
(or FSImage if code re-use makes sense here) in order to maintain the 
checkpoint directory.
This should provide that the checkpointed image is always ready to be read by a 
name-node
if the directory is listed in its "dfs.name.dir" list.
# The name-node should consider the configuration variable "fs.checkpoint.dir" 
as a possible
location of the image available for read-only access during startup.
This means that if name-node finds all directories listed in "dfs.name.dir" 
unavailable or
finds their images corrupted, then it should turn to the "fs.checkpoint.dir" 
directory
and try to fetch the image from there. I think this should not be the default 
behavior but 
rather triggered by a name-node startup option, something like:
{code}
hadoop namenode -fromCheckpoint
{code}
So the name-node can start with the secondary image as long as the secondary 
node drive is mounted.
And the name-node will never attempt to write anything to this drive.

h4. Added bonuses provided by this approach
- One can choose to restart failed name-node directly on the node where the 
secondary node ran.
This brings us a step closer to the hot standby.
- Replication of the image to NFS can be delegated to the secondary name-node 
if we will
support multiple entries in "fs.checkpoint.dir". This is of course if the 
administrator
chooses to accept outdated images in order to boost the name-node performance.


> Automatic namespace recovery from the secondary image.
> ------------------------------------------------------
>
>                 Key: HADOOP-2585
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2585
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>    Affects Versions: 0.15.0
>            Reporter: Konstantin Shvachko
>
> Hadoop has a three way (configuration controlled) protection from loosing the 
> namespace image.
> # image can be replicated on different hard-drives of the same node;
> # image can be replicated on a nfs mounted drive on an independent node;
> # a stale replica of the image is created during periodic checkpointing and 
> stored on the secondary name-node.
> Currently during startup the name-node examines all configured storage 
> directories, selects the
> most up to date image, reads it, merges with the corresponding edits, and 
> writes to the new image back 
> into all storage directories. Everything is done automatically.
> If due to multiple hardware failures none of those images on mounted hard 
> drives (local or remote) 
> are available the secondary image although stale (up to one hour old by 
> default) can be still 
> used in order to recover the majority of the file system data.
> Currently one can reconstruct a valid name-node image from the secondary one 
> manually.
> It would be nice to support an automatic recovery.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to