[ http://issues.apache.org/jira/browse/HADOOP-681?page=comments#action_12447543 ] Konstantin Shvachko commented on HADOOP-681: --------------------------------------------
1. I'd replace 3 new methods in ClientProtocol by one with a 3 valued parameter. 2. You do not need a new member for counting already re-replicated blocks. You can count blocks in the DatanodeDescriptor. When the set of blocks is empty replication is finished. 3. Name-node can send shutdown command to the data-node after that (in reply to the heartbeat). The command is implemented but has not been used yet. 4. Inside name-node DatanodeDescriptor should be used instead of DatanodeInfo. > Adminstrative hook to pull live nodes out of a HDFS cluster > ----------------------------------------------------------- > > Key: HADOOP-681 > URL: http://issues.apache.org/jira/browse/HADOOP-681 > Project: Hadoop > Issue Type: New Feature > Components: dfs > Affects Versions: 0.8.0 > Reporter: dhruba borthakur > Assigned To: dhruba borthakur > > Introduction > ------------ > An administrator sometimes needs to bring down a datanode for scheduled > maintenance. It would be nice if HDFS can be informed about this event. On > receipt of this event, HDFS can take steps so that HDFS data is not lost when > the node goes down at a later time. > Architecture > ----------- > In the existing architecture, a datanode can be in one of two states: dead or > alive. A datanode is alive if its heartbeats are being processed by the > namenode. Otherwise that datanode is in dead state. We extend the > architecture to introduce the concept of a tranquil state for a datanode. > A datanode is in tranquil state if: > - it cannot be a target for replicating any blocks > - any block replica that it currently contains does not count towards the > target-replication-factor of that block > Thus, a node that is in tranquil state can be brought down without impacting > the guarantees provided by HDFS. > The tranquil state is not persisted across namenode restarts. If the namenode > restarts then that datanode will go back to being in the dead or alive state. > The datanode is completely transparent to the fact that it has been labeled > as being in tranquil state. It can continue to heartbeat and serve read > requests for datablocks. > DFSShell Design > ----------------------- > We extend the DFS Shell utility to specify a list of nodes to the namenode. > hadoop dfs -tranquil {set|clear|get} datanodename1 [,datanodename2] > The DFSShell utility sends this list to the namenode. This DFSShell command > invoked with the "set" option completes when the list is transferred to the > namenode. This command is non-blocking; it returns before the datanode is > actually in the tranquil state. The client can then query the state by > re-issuing the command with the "get" option. This option will indicate > whether the datanode is in tranquil state or is "being tranquiled". The > "clear" option is used to transition a tranquil datanode to the alive state. > The "clear" option is a no-op if the datanode is not in the "tranquil" state. > ClientProtocol Design > -------------------- > The ClientProtocol is the protocol exported by the namenode for its client. > This protocol is extended to incorporate three new methods: > ClientProtocol.setTranquil(String[] datanodes) > ClientProtocol.getTranquil(String datanode) > ClientProtocol.clearTranquil(String[] datanodes) > The ProtocolVersion is incremented to prevent conversations between > imcompatible clients and servers. An old DFSShell cannot talk to the new > NameNode and vice-versa. > NameNode Design > ------------------------- > The namenode does the bulk of the work for supporting this new feature. > The DatanodeInfo object has a new private member named "state". It also has > three new member functions: > datanodeInfo.tranquilStarted(): start the process of tranquilization > datanodeInfo.tranquilCompleted(): node is not in tranquil state > datanodeInfo.clearTranquil() : remove tranquilization from node > The namenode exposes a new API to set and clear tranquil states for a > datanode. On receipt of a "set tranquil" command, it invokes > datanodeInfo.tranquilStarted(). > The FSNamesystem.chooseTarget() method skips over datanodes that are marked > as being in the "tranquil" state. This ensures that tranquil-datanodes are > never chosen as targets of replication. The namenode does *not* record > this operation in either the FsImage or the EditLogs. > The namenode puts all the blocks from a being-tranquiled node into the > neededReplication data structure. Necessary code changes are made to ensure > that these blocks get replicated by the regular replication method. As of > now, the regular replication code does not distinguish between these blocks > and the blocks that are replication candidates because some other datanode > might have died. It might be prudent to give different (lower?) weightage to > this type of replication requests, but that exercise is deferred to a later > date. In this design, replication requests generated because of a node going > to a tranquil state are not distinguished from replication requests generated > by a datanode going to the dead state. > The DatanodeInfo object has another new private member named > "pendingTranquilCount". This field stores the remaining number of blocks that > still remain to be replicated. This field is valid only if the node is in the > ets being-tranquiled state. On receipt of every 'n' heartbeats from the > being-tranquiled datanode, the namenode calculates the amount of data that is > still remaining to be replicated and updates the "pendingTranquilCount". in > the DatanodeInfo.When all the replications complete, the datanode is marked > as tranquiled. The number 'n' is selected in such a way that the average > heartbeat processing time does not increase appreciably. > It is possible that the namenode might stop receving heartbeats from a > datanode that is being-tranquiled. In this case, the tranquil flag of the > datanode gets cleared. It transitions to the dead state and the normal > processing for alive-to-dead transition occurs here. > Web Interface > ------------------- > The dfshealth.jsp displays the live nodes, dead nodes, being-tranquiled and > tranquil nodes. For nodes in the being-tranquiled state, it displays the > percentage of tranquilization completed till now. > Issues > -------- > 1. If a request for tranquilization starts getting processed and there aren't > enough space available in DFS to complete the necessary replication, then > that node might remain in the being-tranquiled state for a long long time. > This is not necessarily a bad thing but is there a better option? > 2. We have opted for not storing cluster configuration information in the > persistent image of the file system. (The tranquil state of a datanode may be > lost if the namenode restarts). > -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira