[
http://issues.apache.org/jira/browse/HADOOP-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
dhruba borthakur updated HADOOP-681:
------------------------------------
Attachment: (was: nodedecommission3.patch)
> Adminstrative hook to pull live nodes out of a HDFS cluster
> -----------------------------------------------------------
>
> Key: HADOOP-681
> URL: http://issues.apache.org/jira/browse/HADOOP-681
> Project: Hadoop
> Issue Type: New Feature
> Components: dfs
> Affects Versions: 0.8.0
> Reporter: dhruba borthakur
> Assigned To: dhruba borthakur
> Attachments: nodedecommission5.patch
>
>
> Introduction
> ------------
> An administrator sometimes needs to bring down a datanode for scheduled
> maintenance. It would be nice if HDFS can be informed about this event. On
> receipt of this event, HDFS can take steps so that HDFS data is not lost when
> the node goes down at a later time.
> Architecture
> -----------
> In the existing architecture, a datanode can be in one of two states: dead or
> alive. A datanode is alive if its heartbeats are being processed by the
> namenode. Otherwise that datanode is in dead state. We extend the
> architecture to introduce the concept of a tranquil state for a datanode.
> A datanode is in tranquil state if:
> - it cannot be a target for replicating any blocks
> - any block replica that it currently contains does not count towards the
> target-replication-factor of that block
> Thus, a node that is in tranquil state can be brought down without impacting
> the guarantees provided by HDFS.
> The tranquil state is not persisted across namenode restarts. If the namenode
> restarts then that datanode will go back to being in the dead or alive state.
> The datanode is completely transparent to the fact that it has been labeled
> as being in tranquil state. It can continue to heartbeat and serve read
> requests for datablocks.
> DFSShell Design
> -----------------------
> We extend the DFS Shell utility to specify a list of nodes to the namenode.
> hadoop dfs -tranquil {set|clear|get} datanodename1 [,datanodename2]
> The DFSShell utility sends this list to the namenode. This DFSShell command
> invoked with the "set" option completes when the list is transferred to the
> namenode. This command is non-blocking; it returns before the datanode is
> actually in the tranquil state. The client can then query the state by
> re-issuing the command with the "get" option. This option will indicate
> whether the datanode is in tranquil state or is "being tranquiled". The
> "clear" option is used to transition a tranquil datanode to the alive state.
> The "clear" option is a no-op if the datanode is not in the "tranquil" state.
> ClientProtocol Design
> --------------------
> The ClientProtocol is the protocol exported by the namenode for its client.
> This protocol is extended to incorporate three new methods:
> ClientProtocol.setTranquil(String[] datanodes)
> ClientProtocol.getTranquil(String datanode)
> ClientProtocol.clearTranquil(String[] datanodes)
> The ProtocolVersion is incremented to prevent conversations between
> imcompatible clients and servers. An old DFSShell cannot talk to the new
> NameNode and vice-versa.
> NameNode Design
> -------------------------
> The namenode does the bulk of the work for supporting this new feature.
> The DatanodeInfo object has a new private member named "state". It also has
> three new member functions:
> datanodeInfo.tranquilStarted(): start the process of tranquilization
> datanodeInfo.tranquilCompleted(): node is not in tranquil state
> datanodeInfo.clearTranquil() : remove tranquilization from node
> The namenode exposes a new API to set and clear tranquil states for a
> datanode. On receipt of a "set tranquil" command, it invokes
> datanodeInfo.tranquilStarted().
> The FSNamesystem.chooseTarget() method skips over datanodes that are marked
> as being in the "tranquil" state. This ensures that tranquil-datanodes are
> never chosen as targets of replication. The namenode does *not* record
> this operation in either the FsImage or the EditLogs.
> The namenode puts all the blocks from a being-tranquiled node into the
> neededReplication data structure. Necessary code changes are made to ensure
> that these blocks get replicated by the regular replication method. As of
> now, the regular replication code does not distinguish between these blocks
> and the blocks that are replication candidates because some other datanode
> might have died. It might be prudent to give different (lower?) weightage to
> this type of replication requests, but that exercise is deferred to a later
> date. In this design, replication requests generated because of a node going
> to a tranquil state are not distinguished from replication requests generated
> by a datanode going to the dead state.
> The DatanodeInfo object has another new private member named
> "pendingTranquilCount". This field stores the remaining number of blocks that
> still remain to be replicated. This field is valid only if the node is in the
> ets being-tranquiled state. On receipt of every 'n' heartbeats from the
> being-tranquiled datanode, the namenode calculates the amount of data that is
> still remaining to be replicated and updates the "pendingTranquilCount". in
> the DatanodeInfo.When all the replications complete, the datanode is marked
> as tranquiled. The number 'n' is selected in such a way that the average
> heartbeat processing time does not increase appreciably.
> It is possible that the namenode might stop receving heartbeats from a
> datanode that is being-tranquiled. In this case, the tranquil flag of the
> datanode gets cleared. It transitions to the dead state and the normal
> processing for alive-to-dead transition occurs here.
> Web Interface
> -------------------
> The dfshealth.jsp displays the live nodes, dead nodes, being-tranquiled and
> tranquil nodes. For nodes in the being-tranquiled state, it displays the
> percentage of tranquilization completed till now.
> Issues
> --------
> 1. If a request for tranquilization starts getting processed and there aren't
> enough space available in DFS to complete the necessary replication, then
> that node might remain in the being-tranquiled state for a long long time.
> This is not necessarily a bad thing but is there a better option?
> 2. We have opted for not storing cluster configuration information in the
> persistent image of the file system. (The tranquil state of a datanode may be
> lost if the namenode restarts).
>
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira