[jira] [Commented] (HDFS-7008) xlator should be closed upon exit from DFSAdmin#genericRefresh()
[ https://issues.apache.org/jira/browse/HDFS-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208739#comment-14208739 ] Chris Li commented on HDFS-7008: Linking issue xlator should be closed upon exit from DFSAdmin#genericRefresh() Key: HDFS-7008 URL: https://issues.apache.org/jira/browse/HDFS-7008 Project: Hadoop HDFS Issue Type: Bug Reporter: Ted Yu Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: HDFS-7008.1.patch {code} GenericRefreshProtocol xlator = new GenericRefreshProtocolClientSideTranslatorPB(proxy); // Refresh CollectionRefreshResponse responses = xlator.refresh(identifier, args); {code} GenericRefreshProtocolClientSideTranslatorPB#close() should be called on xlator before return. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6507) Improve DFSAdmin to support HA cluster better
[ https://issues.apache.org/jira/browse/HDFS-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028758#comment-14028758 ] Chris Li commented on HDFS-6507: Yea that's something we discussed: how to handle HA. Current refreshprotos go only to the active, and HADOOP-10376 requires the user to manually specify each NN to target, so they could refresh one NN or X number of NNs by running X number of refresh commands. In order to make it more convenient to refresh the NN/RM in an HA configuration, we can add a special option to do so, maybe like `dfsadmin -refresh allNamenodes key [arg1..argn]` or maybe just `namenode` (and all is implicit with HA). As far as the old refreshProtocols, it seems like a good idea, it would be bad to have the standby NN take over with outdated configs Improve DFSAdmin to support HA cluster better - Key: HDFS-6507 URL: https://issues.apache.org/jira/browse/HDFS-6507 Project: Hadoop HDFS Issue Type: Improvement Components: tools Affects Versions: 2.4.0 Reporter: Zesheng Wu Assignee: Zesheng Wu Currently, the commands supported in DFSAdmin can be classified into three categories according to the protocol used: 1. ClientProtocol Commands in this category generally implement by calling the corresponding function of the DFSClient class, and will call the corresponding remote implementation function at the NN side finally. At the NN side, all these operations are classified into five categories: UNCHECKED, READ, WRITE, CHECKPOINT, JOURNAL. Active NN will allow all operations, and Standby NN only allows UNCHECKED operations. In the current implementation of DFSClient, it will connect one NN first, if the first NN is not Active and the operation is not allowed, it will failover to the second NN. So here comes the problem, some of the commands(setSafeMode, saveNameSpace, restoreFailedStorage, refreshNodes, setBalancerBandwidth, metaSave) in DFSAdmin are classified as UNCHECKED operations, and when executing these commands in the DFSAdmin command line, they will be sent to a definite NN, no matter it is Active or Standby. This may result in two problems: a. If the first tried NN is standby, and the operation takes effect only on Standby NN, which is not the expected result. b. If the operation needs to take effect on both NN, but it takes effect on only one NN. In the future, when there is a NN failover, there may have problems. Here I propose the following improvements: a. If the command can be classified as one of READ/WRITE/CHECKPOINT/JOURNAL operations, we should classify it clearly. b. If the command can not be classified as one of the above four operations, or if the command needs to take effect on both NN, we should send the request to both Active and Standby NNs. 2. Refresh protocols: RefreshAuthorizationPolicyProtocol, RefreshUserMappingsProtocol, RefreshUserMappingsProtocol, RefreshCallQueueProtocol Commands in this category, including refreshServiceAcl, refreshUserToGroupMapping, refreshSuperUserGroupsConfiguration and refreshCallQueue, are implemented by creating a corresponding RPC proxy and sending the request to remote NN. In the current implementation, these requests will be sent to a definite NN, no matter it is Active or Standby. Here I propose that we sent these requests to both NNs. 3. ClientDatanodeProtocol Commands in this category are handled correctly, no need to improve. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-3544) Ability to use SimpleRegeratingCode to fix missing blocks
[ https://issues.apache.org/jira/browse/HDFS-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13871357#comment-13871357 ] Chris Li commented on HDFS-3544: Okay, we're very interested in SRC in our clusters and we will work on this feature if we can get some momentum. Either way I'll keep you posted. Ability to use SimpleRegeratingCode to fix missing blocks - Key: HDFS-3544 URL: https://issues.apache.org/jira/browse/HDFS-3544 Project: Hadoop HDFS Issue Type: Improvement Components: contrib/raid Reporter: dhruba borthakur ReedSolomon encoding (n, k) has n storage nodes and can tolerate n-k failures. Regenerating a block needs to access k blocks. This is a problem when n and k are large. Instead, we can use simple regenerating codes (n, k, f) that does first does ReedSolomon (n,k) and then does XOR with f stripe size. Then, a single disk failure needs to access only f nodes and f can be very small. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-3544) Ability to use SimpleRegeratingCode to fix missing blocks
[ https://issues.apache.org/jira/browse/HDFS-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13866170#comment-13866170 ] Chris Li commented on HDFS-3544: Any updates on this issue? We're interested in trying this out to save space on our cold files. Ability to use SimpleRegeratingCode to fix missing blocks - Key: HDFS-3544 URL: https://issues.apache.org/jira/browse/HDFS-3544 Project: Hadoop HDFS Issue Type: Improvement Components: contrib/raid Reporter: dhruba borthakur Assignee: Weiyan Wang ReedSolomon encoding (n, k) has n storage nodes and can tolerate n-k failures. Regenerating a block needs to access k blocks. This is a problem when n and k are large. Instead, we can use simple regenerating codes (n, k, f) that does first does ReedSolomon (n,k) and then does XOR with f stripe size. Then, a single disk failure needs to access only f nodes and f can be very small. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5639) rpc scheduler abstraction
[ https://issues.apache.org/jira/browse/HDFS-5639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843606#comment-13843606 ] Chris Li commented on HDFS-5639: Something like this will be needed down the road if HADOOP-9640 is adopted; I'll open separate jiras for these enhancements when we're ready. rpc scheduler abstraction - Key: HDFS-5639 URL: https://issues.apache.org/jira/browse/HDFS-5639 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ming Ma Attachments: HDFS-5639-2.patch, HDFS-5639.patch We have run into various issues in namenode and hbase w.r.t. rpc handling in multi-tenant clusters. The examples are https://issues.apache.org/jira/i#browse/HADOOP-9640 https://issues.apache.org/jira/i#browse/HBASE-8836 There are different ideas on how to prioritize rpc requests. It could be based on user id, or whether it is read request or write request, or it could use specific rule like datanode's RPC is more important than client RPC. We want to enable people to implement and experiiment different rpc schedulers. -- This message was sent by Atlassian JIRA (v6.1.4#6159)