[ http://issues.apache.org/jira/browse/HADOOP-681?page=comments#action_12454736 ] dhruba borthakur commented on HADOOP-681: -----------------------------------------
Hi Konstantin, Thanks for your comments. My comments marked with <****> 1. Index: src/java/org/apache/hadoop/dfs/ClientProtocol.java public static final long versionID = 4L; You should write a comment <****> Done. 2. Index: src/java/org/apache/hadoop/dfs/DFSClient.java may be it is better to use DatanodeID instead of mere String <****>The user can specify a name or a name:port. The code will match both formats. That's the reason the user input is accepted as a string rather than a DatanodeID. 3. Index: src/java/org/apache/hadoop/dfs/DFSAdmin.java decommission() does not document the return value; boolean mode is never used final String safeModeUsage is never used decommission usage does not specify data-node parameters, <****> Done 4. Index: src/java/org/apache/hadoop/dfs/FSNamesystem.java decommissionInProgress() should start with is***(); replicationInProgress() should start with is***() in startDecommission() and stopDecommission() it is better to call public method getName() Block decommissionblocks[] should be decommissionBlocks. <****> Done 5. Index: src/java/org/apache/hadoop/dfs/DatanodeDescriptor.java decommissioned() should start with is***() I don't think these constants are used anywhere in the code. They could be confused with the enum values having the same names. public static final int NORMAL = 0; <****> Done 6. I propose to rename AdminStates to DecommissionState and eliminate NORMAL state replacing it by null, where applicable. with a clear (imo) semantics: no decommission - no state. <****> I have, on purpose, kept the API to be a little more generic. In the future, there could be more adminstrative states of datanodes, e.g. ReadOnly Datanodes (?) 7. Index: src/java/org/apache/hadoop/dfs/DatanodeReport.java I don't think this class should be introduced at all. DatanodeReport effectively returns an entire DatanodeDescriptor. <****> There is one subtle difference between DatanodeReport and DatanodeInfo: their serialization methods. The Namenode serializes DatanodeInfo while writing it to the FsImage. I do not want any FSImage format change at present. The persistentcy of adminState will be bundled in with other FsImage changes at a later date. The DatanodeReport class has to report the adminState to the UI, thus it has to serialize the adminState too. Please let me know if my counter-proposals sound ok. > Adminstrative hook to pull live nodes out of a HDFS cluster > ----------------------------------------------------------- > > Key: HADOOP-681 > URL: http://issues.apache.org/jira/browse/HADOOP-681 > Project: Hadoop > Issue Type: New Feature > Components: dfs > Affects Versions: 0.8.0 > Reporter: dhruba borthakur > Assigned To: dhruba borthakur > Attachments: nodedecommission2.patch > > > Introduction > ------------ > An administrator sometimes needs to bring down a datanode for scheduled > maintenance. It would be nice if HDFS can be informed about this event. On > receipt of this event, HDFS can take steps so that HDFS data is not lost when > the node goes down at a later time. > Architecture > ----------- > In the existing architecture, a datanode can be in one of two states: dead or > alive. A datanode is alive if its heartbeats are being processed by the > namenode. Otherwise that datanode is in dead state. We extend the > architecture to introduce the concept of a tranquil state for a datanode. > A datanode is in tranquil state if: > - it cannot be a target for replicating any blocks > - any block replica that it currently contains does not count towards the > target-replication-factor of that block > Thus, a node that is in tranquil state can be brought down without impacting > the guarantees provided by HDFS. > The tranquil state is not persisted across namenode restarts. If the namenode > restarts then that datanode will go back to being in the dead or alive state. > The datanode is completely transparent to the fact that it has been labeled > as being in tranquil state. It can continue to heartbeat and serve read > requests for datablocks. > DFSShell Design > ----------------------- > We extend the DFS Shell utility to specify a list of nodes to the namenode. > hadoop dfs -tranquil {set|clear|get} datanodename1 [,datanodename2] > The DFSShell utility sends this list to the namenode. This DFSShell command > invoked with the "set" option completes when the list is transferred to the > namenode. This command is non-blocking; it returns before the datanode is > actually in the tranquil state. The client can then query the state by > re-issuing the command with the "get" option. This option will indicate > whether the datanode is in tranquil state or is "being tranquiled". The > "clear" option is used to transition a tranquil datanode to the alive state. > The "clear" option is a no-op if the datanode is not in the "tranquil" state. > ClientProtocol Design > -------------------- > The ClientProtocol is the protocol exported by the namenode for its client. > This protocol is extended to incorporate three new methods: > ClientProtocol.setTranquil(String[] datanodes) > ClientProtocol.getTranquil(String datanode) > ClientProtocol.clearTranquil(String[] datanodes) > The ProtocolVersion is incremented to prevent conversations between > imcompatible clients and servers. An old DFSShell cannot talk to the new > NameNode and vice-versa. > NameNode Design > ------------------------- > The namenode does the bulk of the work for supporting this new feature. > The DatanodeInfo object has a new private member named "state". It also has > three new member functions: > datanodeInfo.tranquilStarted(): start the process of tranquilization > datanodeInfo.tranquilCompleted(): node is not in tranquil state > datanodeInfo.clearTranquil() : remove tranquilization from node > The namenode exposes a new API to set and clear tranquil states for a > datanode. On receipt of a "set tranquil" command, it invokes > datanodeInfo.tranquilStarted(). > The FSNamesystem.chooseTarget() method skips over datanodes that are marked > as being in the "tranquil" state. This ensures that tranquil-datanodes are > never chosen as targets of replication. The namenode does *not* record > this operation in either the FsImage or the EditLogs. > The namenode puts all the blocks from a being-tranquiled node into the > neededReplication data structure. Necessary code changes are made to ensure > that these blocks get replicated by the regular replication method. As of > now, the regular replication code does not distinguish between these blocks > and the blocks that are replication candidates because some other datanode > might have died. It might be prudent to give different (lower?) weightage to > this type of replication requests, but that exercise is deferred to a later > date. In this design, replication requests generated because of a node going > to a tranquil state are not distinguished from replication requests generated > by a datanode going to the dead state. > The DatanodeInfo object has another new private member named > "pendingTranquilCount". This field stores the remaining number of blocks that > still remain to be replicated. This field is valid only if the node is in the > ets being-tranquiled state. On receipt of every 'n' heartbeats from the > being-tranquiled datanode, the namenode calculates the amount of data that is > still remaining to be replicated and updates the "pendingTranquilCount". in > the DatanodeInfo.When all the replications complete, the datanode is marked > as tranquiled. The number 'n' is selected in such a way that the average > heartbeat processing time does not increase appreciably. > It is possible that the namenode might stop receving heartbeats from a > datanode that is being-tranquiled. In this case, the tranquil flag of the > datanode gets cleared. It transitions to the dead state and the normal > processing for alive-to-dead transition occurs here. > Web Interface > ------------------- > The dfshealth.jsp displays the live nodes, dead nodes, being-tranquiled and > tranquil nodes. For nodes in the being-tranquiled state, it displays the > percentage of tranquilization completed till now. > Issues > -------- > 1. If a request for tranquilization starts getting processed and there aren't > enough space available in DFS to complete the necessary replication, then > that node might remain in the being-tranquiled state for a long long time. > This is not necessarily a bad thing but is there a better option? > 2. We have opted for not storing cluster configuration information in the > persistent image of the file system. (The tranquil state of a datanode may be > lost if the namenode restarts). > -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira