sounds good.
On Nov 6, 2006, at 3:10 PM, dhruba borthakur (JIRA) wrote:
[ http://issues.apache.org/jira/browse/HADOOP-681?
page=comments#action_12447576 ]
dhruba borthakur commented on HADOOP-681:
-----------------------------------------
Thanks a bunch: Eric, Yoram & Kons for your comments. Here are my
take-aways:
1. I will change the name of this new state from tranquil to
decommission.
2. I will make the "decommission" a persistent state. This will be
done in a follow-on patch submission (after the disk format upgrade
patch) is incorporated.
3. I will create a new command called dfsadmin. It will be invoked as
bin/hadoop --config conf dfsadmin -decommision
4. I will incorprate most of Konstantin's comments. However, the
namenode will not send shutdown command to the datanode after all
replications are completed. In fact, we endeavour to keep that
decommissioned datanode alive and use it (if needed) to serve read
requests.
4. I will create a separate defect to move "dfs -report" and "dfs -
safemode" to the dfsadmin command. I will create another separate
defect to support "dfsadmin -removenode" to completely remove a
datanodes's information from the namenode.
Adminstrative hook to pull live nodes out of a HDFS cluster
-----------------------------------------------------------
Key: HADOOP-681
URL: http://issues.apache.org/jira/browse/HADOOP-681
Project: Hadoop
Issue Type: New Feature
Components: dfs
Affects Versions: 0.8.0
Reporter: dhruba borthakur
Assigned To: dhruba borthakur
Introduction
------------
An administrator sometimes needs to bring down a datanode for
scheduled maintenance. It would be nice if HDFS can be informed
about this event. On receipt of this event, HDFS can take steps so
that HDFS data is not lost when the node goes down at a later time.
Architecture
-----------
In the existing architecture, a datanode can be in one of two
states: dead or alive. A datanode is alive if its heartbeats are
being processed by the namenode. Otherwise that datanode is in
dead state. We extend the architecture to introduce the concept of
a tranquil state for a datanode.
A datanode is in tranquil state if:
- it cannot be a target for replicating any blocks
- any block replica that it currently contains does not count
towards the target-replication-factor of that block
Thus, a node that is in tranquil state can be brought down without
impacting the guarantees provided by HDFS.
The tranquil state is not persisted across namenode restarts. If
the namenode restarts then that datanode will go back to being in
the dead or alive state.
The datanode is completely transparent to the fact that it has
been labeled as being in tranquil state. It can continue to
heartbeat and serve read requests for datablocks.
DFSShell Design
-----------------------
We extend the DFS Shell utility to specify a list of nodes to the
namenode.
hadoop dfs -tranquil {set|clear|get} datanodename1
[,datanodename2]
The DFSShell utility sends this list to the namenode. This
DFSShell command invoked with the "set" option completes when the
list is transferred to the namenode. This command is non-blocking;
it returns before the datanode is actually in the tranquil state.
The client can then query the state by re-issuing the command with
the "get" option. This option will indicate whether the datanode
is in tranquil state or is "being tranquiled". The "clear" option
is used to transition a tranquil datanode to the alive state. The
"clear" option is a no-op if the datanode is not in the "tranquil"
state.
ClientProtocol Design
--------------------
The ClientProtocol is the protocol exported by the namenode for
its client.
This protocol is extended to incorporate three new methods:
ClientProtocol.setTranquil(String[] datanodes)
ClientProtocol.getTranquil(String datanode)
ClientProtocol.clearTranquil(String[] datanodes)
The ProtocolVersion is incremented to prevent conversations
between imcompatible clients and servers. An old DFSShell cannot
talk to the new NameNode and vice-versa.
NameNode Design
-------------------------
The namenode does the bulk of the work for supporting this new
feature.
The DatanodeInfo object has a new private member named "state". It
also has three new member functions:
datanodeInfo.tranquilStarted(): start the process of
tranquilization
datanodeInfo.tranquilCompleted(): node is not in tranquil state
datanodeInfo.clearTranquil() : remove tranquilization from node
The namenode exposes a new API to set and clear tranquil states
for a datanode. On receipt of a "set tranquil" command, it invokes
datanodeInfo.tranquilStarted().
The FSNamesystem.chooseTarget() method skips over datanodes that
are marked as being in the "tranquil" state. This ensures that
tranquil-datanodes are never chosen as targets of replication. The
namenode does *not* record
this operation in either the FsImage or the EditLogs.
The namenode puts all the blocks from a being-tranquiled node into
the neededReplication data structure. Necessary code changes are
made to ensure that these blocks get replicated by the regular
replication method. As of now, the regular replication code does
not distinguish between these blocks and the blocks that are
replication candidates because some other datanode might have
died. It might be prudent to give different (lower?) weightage to
this type of replication requests, but that exercise is deferred
to a later date. In this design, replication requests generated
because of a node going to a tranquil state are not distinguished
from replication requests generated by a datanode going to the
dead state.
The DatanodeInfo object has another new private member named
"pendingTranquilCount". This field stores the remaining number of
blocks that still remain to be replicated. This field is valid
only if the node is in the ets being-tranquiled state. On receipt
of every 'n' heartbeats from the being-tranquiled datanode, the
namenode calculates the amount of data that is still remaining to
be replicated and updates the "pendingTranquilCount". in the
DatanodeInfo.When all the replications complete, the datanode is
marked as tranquiled. The number 'n' is selected in such a way
that the average heartbeat processing time does not increase
appreciably.
It is possible that the namenode might stop receving heartbeats
from a datanode that is being-tranquiled. In this case, the
tranquil flag of the datanode gets cleared. It transitions to the
dead state and the normal processing for alive-to-dead transition
occurs here.
Web Interface
-------------------
The dfshealth.jsp displays the live nodes, dead nodes, being-
tranquiled and tranquil nodes. For nodes in the being-tranquiled
state, it displays the percentage of tranquilization completed
till now.
Issues
--------
1. If a request for tranquilization starts getting processed and
there aren't enough space available in DFS to complete the
necessary replication, then that node might remain in the being-
tranquiled state for a long long time. This is not necessarily a
bad thing but is there a better option?
2. We have opted for not storing cluster configuration information
in the persistent image of the file system. (The tranquil state of
a datanode may be lost if the namenode restarts).
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/
Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/
software/jira