[jira] [Commented] (HDFS-6507) Improve DFSAdmin to support HA cluster better

Vinayakumar B (JIRA) Thu, 12 Jun 2014 02:18:27 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028957#comment-14028957
 ]


Vinayakumar B commented on HDFS-6507:
-------------------------------------

{quote}Make the default action correctly and naturally: a. Commands which 
should take effect on ANN, will take effect on ANN by default; b. Commands 
which should take effect on both ANN and SNN, will take effect on both ANN and 
SNN by default
Improve usability: a. Users do not need to care that which NN is active and 
what the NN's host:port is, just run the command; b. Commands that should take 
effect on both ANN and SNN needn't run twice with respective host:port.{quote}

Yes.  Specific commands can be executed on all nodes by iterating over all 
nodes available by default.

Only worry is, if commands pass on one namenode and fails on another namenode, 
how to handle the failures.? whether just log the errors or rollback?

If manual operations required before command execution, then its user's 
responsibility to make sure that configurations updated at both namenodes 
before command execution.

For commands such as "safemode enter" or "safemode leave" execution failure on 
one of the namenode could result in some unwanted results. ex. Only Active 
leave safemode,  which on failover again puts the cluster into safemode.

So better handling of failures is required in this.

> Improve DFSAdmin to support HA cluster better
> ---------------------------------------------
>
>                 Key: HDFS-6507
>                 URL: https://issues.apache.org/jira/browse/HDFS-6507
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: tools
>    Affects Versions: 2.4.0
>            Reporter: Zesheng Wu
>            Assignee: Zesheng Wu
>
> Currently, the commands supported in DFSAdmin can be classified into three 
> categories according to the protocol used:
> 1. ClientProtocol
> Commands in this category generally implement by calling the corresponding 
> function of the DFSClient class, and will call the corresponding remote 
> implementation function at the NN side finally. At the NN side, all these 
> operations are classified into five categories: UNCHECKED, READ, WRITE, 
> CHECKPOINT, JOURNAL. Active NN will allow all operations, and Standby NN only 
> allows UNCHECKED operations. In the current implementation of DFSClient, it 
> will connect one NN first, if the first NN is not Active and the operation is 
> not allowed, it will failover to the second NN. So here comes the problem, 
> some of the commands(setSafeMode, saveNameSpace, restoreFailedStorage, 
> refreshNodes, setBalancerBandwidth, metaSave) in DFSAdmin are classified as 
> UNCHECKED operations, and when executing these commands in the DFSAdmin 
> command line, they will be sent to a definite NN, no matter it is Active or 
> Standby. This may result in two problems: 
> a. If the first tried NN is standby, and the operation takes effect only on 
> Standby NN, which is not the expected result.
> b. If the operation needs to take effect on both NN, but it takes effect on 
> only one NN. In the future, when there is a NN failover, there may have 
> problems.
> Here I propose the following improvements:
> a. If the command can be classified as one of READ/WRITE/CHECKPOINT/JOURNAL 
> operations, we should classify it clearly.
> b. If the command can not be classified as one of the above four operations, 
> or if the command needs to take effect on both NN, we should send the request 
> to both Active and Standby NNs.
> 2. Refresh protocols: RefreshAuthorizationPolicyProtocol, 
> RefreshUserMappingsProtocol, RefreshUserMappingsProtocol, 
> RefreshCallQueueProtocol
> Commands in this category, including refreshServiceAcl, 
> refreshUserToGroupMapping, refreshSuperUserGroupsConfiguration and 
> refreshCallQueue, are implemented by creating a corresponding RPC proxy and 
> sending the request to remote NN. In the current implementation, these 
> requests will be sent to a definite NN, no matter it is Active or Standby. 
> Here I propose that we sent these requests to both NNs.
> 3. ClientDatanodeProtocol
> Commands in this category are handled correctly, no need to improve.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6507) Improve DFSAdmin to support HA cluster better

Reply via email to