Please comment on following proposal. Proposal for Replication of DFS Namespace Images and Transaction Logs
Currently, when the namenode starts, it reads the namespace image from dfs.name.dir and from that, initializes the namespace data structures. If the transaction log exists, it merges the transaction logs with the in-memory namespace, and writes out the merged namespace image. It then reinitializes the transaction log file.
As namespace modifications occur, these modifications are logged into the transaction log file. The transaction log file is never flushed, and is closed only if the namenode shuts down normally. In case of error or forced shutdown of namenode, the last few buffered (but not yet written to the disk) transactions may get lost. In addition if the namespace image file is corrupted or is accidentally deleted, or if the disk holding that image file crashes, there is no way to recover the state of DFS.
This proposal suggests a design for replication the DFS namespace image as well as transaction logs, so that even in case of a catastrophic failure, the DFS state can almost always be recovered.
We suggest a two-pronged approach. First, allow multiple copies of image and transaction log on different volumes of namenode. Secondly, have backup "read-only" namenodes, that would allow continuous functioning of DFS even in case of namenode failure.
We propose that dfs.name.dir configuration parameter be allowed to have a comma-separated list of different locations within the namenode where DFS image and logs would be replicated. This allows for a disk failure to not hinder recoverability of DFS state. Each time the image file is updated, as well as each time a transaction is logged, it is written to all the locations specified in dfs.name.dir. The list of locations in dfs.name.dir could include all local disks of the namenode as well as NFS-mounted drives, thus providing a remote backup of DFS state. If the NFS-mounted drive is RAIDed, this itself provides the reliability required.
Currently, the transaction log file is always kept open in write- mode. Thus in case of the namenode failure, or forcibly shuttting down namenode can cause the last few transactions that have been buffered in memory to get lost. The number of transactions lost will depend on the buffer-size. We propose that the DFS administrator control this parameter. Configuration will include a parameter "dfs.namenode.edits.buffer" to specify number of transactions upon which the transaction log will be closed (thus flushing all the buffered transactions to disk), and reopened in append-mode.
In order to determine which image and log files are the snapshot of the latest state, these files should indicate a positive 4-byte "generation number". This can be achieved without even having to modify the image and transaction log file format. The filename can contain the generation number. Each time the namenode restarts, the generation number of both the image file as well as transaction log is incremented to reflect this. Upon startup, the namenode scans all the locations in dfs.name.dir to determine which location contains the latest image and corresponding logs according to the generation number, and loads the latest image and log (from possibly different locations). If in case the sizes of the transaction logs with the same name do not match, one with the larger size is chosen.
Second proposal (which can be in addition to the first multiple- volumes proposal) suggests having multiple backup namenodes. These backup name nodes are started on different machines with an additional command-line parameter "-backup" to the namenode.
The backup namenode functions in approximately the same way as the namenode in safe mode (i.e. read-only), except that upon startup, it connects to the main namenode specified in "fs.default.name", supplies the current generation of its image and transaction log and asks for the latest FSimage and transaction log, stores them on the disk locations in "dfs.name.dir", and accordingly also modifies its internal namesystem data structures. The backup name nodes do not listen to blockreports or heartbeats from datanodes. Their sole task is to keep a backup of DFS state. When the main namenode fails, any of these backup namenodes can be restarted by DFS administrator in normal mode, and DFS can continue functioning.
Later, the backup namenode can also be allowed to entertain read-only requests from DFS clients, thus making DFS more performant and scalable.