[ http://issues.apache.org/jira/browse/HADOOP-306?page=all ]
Konstantin Shvachko updated HADOOP-306: --------------------------------------- Attachment: (was: FSImageSaveDNInfo.patch) > Safe mode and name node startup procedures > ------------------------------------------ > > Key: HADOOP-306 > URL: http://issues.apache.org/jira/browse/HADOOP-306 > Project: Hadoop > Issue Type: New Feature > Affects Versions: 0.3.2 > Reporter: Konstantin Shvachko > Assigned To: Konstantin Shvachko > Fix For: 0.6.0 > > > This is a proposal to improve DFS cluster startup process. > The data node startup procedures were described and implemented in HADOOP-124. > I'm trying to extend them to the name node here. > The main idea is to introduce safe mode, which can be entered manually for > administration > purposes, or automatically when a configurable threshold of active data nodes > is breached, > or at startup when the node stays in safe mode until the minimal limit of > active > nodes is reached. > This are high level requirements intended to improve the name node and > cluster reliability. > = The name node safe mode means that the name node is not changing the > state of the > file system. Meta data is read-only, and block replication / removal > is not taking place. > = In safe mode the name node accepts data node registrations and > processes their block reports. > = The name node always starts in safe mode and stays safe until the > majority > (a configurable parameter: safemode.threshold) of data nodes (or > blocks?) > is reported. > = The name node can also fall into safe mode when the number of non-active > (heartbeats stopped coming in) data nodes becomes critical. > = The startup "silent period", when the name node is in safe mode and is > not issuing any block requests to the data nodes, is initially set to > a > configurable value safemode.timeout.increment. By the end of the > timeout > the name node checks the safemode.threshold and decides whether to > switch > to the normal mode or to stay in safe. If the normal mode criteria is > not > met, then the silent period is extended by incrementing the safemode > timeout. > = The name node stays in safe mode not longer than a configurable value of > safemode.timeout.max, in which case it logs missing data nodes and > shuts > itself down. > = When the name node switches to normal mode it checks whether all > required > data nodes have actually registered, based on the list of active data > storages > from the last session. Then it logs missing nodes, if any, and starts > replicating and/or deleting blocks as required. > = A historical list of data storages (nodes) ever registered with the > cluster is > persistently stored in the image and log files. The list is used in > two ways: > a) at startup to verify whether all nodes have registered, and to > report > missing nodes; > b) at runtime if a data node registers with a new storage id the > name node verifies that no new blocks are reported from that storage, > which would prevent us from accidentally connecting data nodes from a > different cluster. > = The name node should have an option to run in safe mode. Starting with > that option would mean it never leaves safe mode. > This is useful for testing the cluster. > = Data nodes that can not connect to the name node for a long time > (configurable) > should shut down themselves. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira