Proposal for replicating namenode state and transaction logs

Milind Bhandarkar Wed, 20 Sep 2006 15:05:57 -0700

Please comment on following proposal.

Proposal for Replication of DFS Namespace Images and Transaction Logs

Currently, when the namenode starts, it reads the namespace imagefrom dfs.name.dir and from that, initializes the namespace datastructures. If the transaction log exists, it merges the transactionlogs with the in-memory namespace, and writes out the mergednamespace image. It then reinitializes the transaction log file.

As namespace modifications occur, these modifications are logged intothe transaction log file. The transaction log file is never flushed,and is closed only if the namenode shuts down normally. In case oferror or forced shutdown of namenode, the last few buffered (but notyet written to the disk) transactions may get lost. In addition ifthe namespace image file is corrupted or is accidentally deleted, orif the disk holding that image file crashes, there is no way torecover the state of DFS.

This proposal suggests a design for replication the DFS namespaceimage as well as transaction logs, so that even in case of acatastrophic failure, the DFS state can almost always be recovered.

We suggest a two-pronged approach. First, allow multiple copies ofimage and transaction log on different volumes of namenode. Secondly,have backup "read-only" namenodes, that would allow continuousfunctioning of DFS even in case of namenode failure.

We propose that dfs.name.dir configuration parameter be allowed tohave a comma-separated list of different locations within thenamenode where DFS image and logs would be replicated. This allowsfor a disk failure to not hinder recoverability of DFS state. Eachtime the image file is updated, as well as each time a transaction islogged, it is written to all the locations specified in dfs.name.dir.The list of locations in dfs.name.dir could include all local disksof the namenode as well as NFS-mounted drives, thus providing aremote backup of DFS state. If the NFS-mounted drive is RAIDed, thisitself provides the reliability required.

Currently, the transaction log file is always kept open in write-mode. Thus in case of the namenode failure, or forcibly shutttingdown namenode can cause the last few transactions that have beenbuffered in memory to get lost. The number of transactions lost willdepend on the buffer-size. We propose that the DFS administratorcontrol this parameter. Configuration will include a parameter"dfs.namenode.edits.buffer" to specify number of transactions uponwhich the transaction log will be closed (thus flushing all thebuffered transactions to disk), and reopened in append-mode.

In order to determine which image and log files are the snapshot ofthe latest state, these files should indicate a positive 4-byte"generation number". This can be achieved without even having tomodify the image and transaction log file format. The filename cancontain the generation number. Each time the namenode restarts, thegeneration number of both the image file as well as transaction logis incremented to reflect this. Upon startup, the namenode scans allthe locations in dfs.name.dir to determine which location containsthe latest image and corresponding logs according to the generationnumber, and loads the latest image and log (from possibly differentlocations). If in case the sizes of the transaction logs with thesame name do not match, one with the larger size is chosen.

Second proposal (which can be in addition to the first multiple-volumes proposal) suggests having multiple backup namenodes. Thesebackup name nodes are started on different machines with anadditional command-line parameter "-backup" to the namenode.

The backup namenode functions in approximately the same way as thenamenode in safe mode (i.e. read-only), except that upon startup, itconnects to the main namenode specified in "fs.default.name",supplies the current generation of its image and transaction log andasks for the latest FSimage and transaction log, stores them on thedisk locations in "dfs.name.dir", and accordingly also modifies itsinternal namesystem data structures. The backup name nodes do notlisten to blockreports or heartbeats from datanodes. Their sole taskis to keep a backup of DFS state. When the main namenode fails, anyof these backup namenodes can be restarted by DFS administrator innormal mode, and DFS can continue functioning.

Later, the backup namenode can also be allowed to entertain read-onlyrequests from DFS clients, thus making DFS more performant and scalable.

Proposal for replicating namenode state and transaction logs

Reply via email to