Re: [jira] Commented: (HADOOP-681) Adminstrative hook to pull live nodes out of a HDFS cluster

Eric Baldeschwieler Mon, 06 Nov 2006 19:35:04 -0800

sounds good.


On Nov 6, 2006, at 3:10 PM, dhruba borthakur (JIRA) wrote:

[ http://issues.apache.org/jira/browse/HADOOP-681?page=comments#action_12447576 ]
dhruba borthakur commented on HADOOP-681:
-----------------------------------------
Thanks a bunch: Eric, Yoram & Kons for your comments. Here are mytake-aways:
1. I will change the name of this new state from tranquil todecommission.
2. I will make the "decommission" a persistent state. This will bedone in a follow-on patch submission (after the disk format upgradepatch) is incorporated.
3. I will create a new command called dfsadmin. It will be invoked as
               bin/hadoop --config conf dfsadmin -decommision
4. I will incorprate most of Konstantin's comments. However, thenamenode will not send shutdown command to the datanode after allreplications are completed. In fact, we endeavour to keep thatdecommissioned datanode alive and use it (if needed) to serve readrequests.
4. I will create a separate defect to move "dfs -report" and "dfs -safemode" to the dfsadmin command. I will create another separatedefect to support "dfsadmin -removenode" to completely remove adatanodes's information from the namenode.
Adminstrative hook to pull live nodes out of a HDFS cluster
-----------------------------------------------------------

                Key: HADOOP-681
                URL: http://issues.apache.org/jira/browse/HADOOP-681
            Project: Hadoop
         Issue Type: New Feature
         Components: dfs
   Affects Versions: 0.8.0
           Reporter: dhruba borthakur
        Assigned To: dhruba borthakur

Introduction
------------
An administrator sometimes needs to bring down a datanode forscheduled maintenance. It would be nice if HDFS can be informedabout this event. On receipt of this event, HDFS can take steps sothat HDFS data is not lost when the node goes down at a later time.
Architecture
-----------
In the existing architecture, a datanode can be in one of twostates: dead or alive. A datanode is alive if its heartbeats arebeing processed by the namenode. Otherwise that datanode is indead state. We extend the architecture to introduce the concept ofa tranquil state for a datanode.
A datanode is in tranquil state if:
    - it cannot be a target for replicating any blocks
- any block replica that it currently contains does not counttowards the target-replication-factor of that blockThus, a node that is in tranquil state can be brought down withoutimpacting the guarantees provided by HDFS.The tranquil state is not persisted across namenode restarts. Ifthe namenode restarts then that datanode will go back to being inthe dead or alive state.The datanode is completely transparent to the fact that it hasbeen labeled as being in tranquil state. It can continue toheartbeat and serve read requests for datablocks.
DFSShell Design
-----------------------
We extend the DFS Shell utility to specify a list of nodes to thenamenode.hadoop dfs -tranquil {set|clear|get} datanodename1[,datanodename2]The DFSShell utility sends this list to the namenode. ThisDFSShell command invoked with the "set" option completes when thelist is transferred to the namenode. This command is non-blocking;it returns before the datanode is actually in the tranquil state.The client can then query the state by re-issuing the command withthe "get" option. This option will indicate whether the datanodeis in tranquil state or is "being tranquiled". The "clear" optionis used to transition a tranquil datanode to the alive state. The"clear" option is a no-op if the datanode is not in the "tranquil"state.
ClientProtocol Design
--------------------
The ClientProtocol is the protocol exported by the namenode forits client.
This protocol is extended to incorporate three new methods:
   ClientProtocol.setTranquil(String[] datanodes)
   ClientProtocol.getTranquil(String datanode)
   ClientProtocol.clearTranquil(String[] datanodes)
The ProtocolVersion is incremented to prevent conversationsbetween imcompatible clients and servers. An old DFSShell cannottalk to the new NameNode and vice-versa.
NameNode Design
-------------------------
The namenode does the bulk of the work for supporting this newfeature.The DatanodeInfo object has a new private member named "state". Italso has three new member functions:datanodeInfo.tranquilStarted(): start the process oftranquilization
    datanodeInfo.tranquilCompleted(): node is not in tranquil state
    datanodeInfo.clearTranquil() : remove tranquilization from node
The namenode exposes a new API to set and clear tranquil statesfor a datanode. On receipt of a "set tranquil" command, it invokesdatanodeInfo.tranquilStarted().The FSNamesystem.chooseTarget() method skips over datanodes thatare marked as being in the "tranquil" state. This ensures thattranquil-datanodes are never chosen as targets of replication. Thenamenode does *not* record
this operation in either the FsImage or the EditLogs.
The namenode puts all the blocks from a being-tranquiled node intothe neededReplication data structure. Necessary code changes aremade to ensure that these blocks get replicated by the regularreplication method. As of now, the regular replication code doesnot distinguish between these blocks and the blocks that arereplication candidates because some other datanode might havedied. It might be prudent to give different (lower?) weightage tothis type of replication requests, but that exercise is deferredto a later date. In this design, replication requests generatedbecause of a node going to a tranquil state are not distinguishedfrom replication requests generated by a datanode going to thedead state.The DatanodeInfo object has another new private member named"pendingTranquilCount". This field stores the remaining number ofblocks that still remain to be replicated. This field is validonly if the node is in the ets being-tranquiled state. On receiptof every 'n' heartbeats from the being-tranquiled datanode, thenamenode calculates the amount of data that is still remaining tobe replicated and updates the "pendingTranquilCount". in theDatanodeInfo.When all the replications complete, the datanode ismarked as tranquiled. The number 'n' is selected in such a waythat the average heartbeat processing time does not increaseappreciably.It is possible that the namenode might stop receving heartbeatsfrom a datanode that is being-tranquiled. In this case, thetranquil flag of the datanode gets cleared. It transitions to thedead state and the normal processing for alive-to-dead transitionoccurs here.
Web Interface
-------------------
The dfshealth.jsp displays the live nodes, dead nodes, being-tranquiled and tranquil nodes. For nodes in the being-tranquiledstate, it displays the percentage of tranquilization completedtill now.
Issues
--------
1. If a request for tranquilization starts getting processed andthere aren't enough space available in DFS to complete thenecessary replication, then that node might remain in the being-tranquiled state for a long long time. This is not necessarily abad thing but is there a better option?2. We have opted for not storing cluster configuration informationin the persistent image of the file system. (The tranquil state ofa datanode may be lost if the namenode restarts).
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of theadministrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] Commented: (HADOOP-681) Adminstrative hook to pull live nodes out of a HDFS cluster

Reply via email to