[jira] [Commented] (HDFS-5723) Append failed FINALIZED replica should not be accepted as valid when that block is underconstruction
[ https://issues.apache.org/jira/browse/HDFS-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026152#comment-14026152 ] Vinayakumar B commented on HDFS-5723: - Hi Stanley, If the issue is different, then you can file a separate Jira for that. Append failed FINALIZED replica should not be accepted as valid when that block is underconstruction Key: HDFS-5723 URL: https://issues.apache.org/jira/browse/HDFS-5723 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.2.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Attachments: HDFS-5723.patch, HDFS-5723.patch Scenario: 1. 3 node cluster with dfs.client.block.write.replace-datanode-on-failure.enable set to false. 2. One file is written with 3 replicas, blk_id_gs1 3. One of the datanode DN1 is down. 4. File was opened with append and some more data is added to the file and synced. (to only 2 live nodes DN2 and DN3)-- blk_id_gs2 5. Now DN1 restarted 6. In this block report, DN1 reported FINALIZED block blk_id_gs1, this should be marked corrupted. but since NN having appended block state as UnderConstruction, at this time its not detecting this block as corrupt and adding to valid block locations. As long as the namenode is alive, this datanode also will be considered as valid replica and read/append will fail in that datanode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6494) In some case, the hedged read will lead to client infinite wait.
[ https://issues.apache.org/jira/browse/HDFS-6494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] LiuLei updated HDFS-6494: - Attachment: hedged-read-test-case.patch One test cace base on HDFS-6231 patch. In some case, the hedged read will lead to client infinite wait. -- Key: HDFS-6494 URL: https://issues.apache.org/jira/browse/HDFS-6494 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 2.4.0 Reporter: LiuLei Assignee: Liang Xie Attachments: hedged-read-bug.patch, hedged-read-test-case.patch When I use hedged read, If there is only one live datanode, the reading from the datanode throw TimeoutException and ChecksumException., the Client will infinite wait. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (HDFS-6506) Newly moved block replica been invalidated and deleted
[ https://issues.apache.org/jira/browse/HDFS-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Binglin Chang reassigned HDFS-6506: --- Assignee: Binglin Chang Newly moved block replica been invalidated and deleted -- Key: HDFS-6506 URL: https://issues.apache.org/jira/browse/HDFS-6506 Project: Hadoop HDFS Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang TestBalancerWithNodeGroup#testBalancerWithNodeGroup fails recently https://builds.apache.org/job/PreCommit-HDFS-Build/7045//testReport/ from the error log, the reason seems to be that newly moved block replicas been invalidated and deleted, so some work of the balancer are reversed. {noformat} 2014-06-06 18:15:51,681 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741834_1010 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741833_1009 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741830_1006 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741831_1007 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,682 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741832_1008 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741827_1003 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741828_1004 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,701 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741829_1005 with size=100 fr 2014-06-06 18:15:54,706 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741833_1009) is added to invalidated blocks set 2014-06-06 18:15:54,709 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741834_1010) is added to invalidated blocks set 2014-06-06 18:15:56,421 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741833_1009, blk_1073741834_1010] 2014-06-06 18:15:57,717 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741832_1008) is added to invalidated blocks set 2014-06-06 18:15:57,720 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741827_1003) is added to invalidated blocks set 2014-06-06 18:15:57,721 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741830_1006) is added to invalidated blocks set 2014-06-06 18:15:57,722 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741831_1007) is added to invalidated blocks set 2014-06-06 18:15:57,723 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741829_1005) is added to invalidated blocks set 2014-06-06 18:15:59,422 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741827_1003, blk_1073741829_1005, blk_1073741830_1006, blk_1073741831_1007, blk_1073741832_1008] 2014-06-06 18:16:02,423 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741845_1021] {noformat} Normally this should not happen, when moving a block from src to dest, replica on src should be invalided not the dest, there should be bug inside related logic. I don't think TestBalancerWithNodeGroup#testBalancerWithNodeGroup caused this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6475) WebHdfs clients fail without retry because incorrect handling of StandbyException
[ https://issues.apache.org/jira/browse/HDFS-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjun Zhang updated HDFS-6475: Attachment: HDFS-6475.002.patch WebHdfs clients fail without retry because incorrect handling of StandbyException - Key: HDFS-6475 URL: https://issues.apache.org/jira/browse/HDFS-6475 Project: Hadoop HDFS Issue Type: Bug Components: ha, webhdfs Affects Versions: 2.4.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-6475.001.patch, HDFS-6475.002.patch With WebHdfs clients connected to a HA HDFS service, the delegation token is previously initialized with the active NN. When clients try to issue request, the NN it contacts is stored in a map returned by DFSUtil.getNNServiceRpcAddresses(conf). And the client contact the NN based on the order, so likely the first one it runs into is StandbyNN. If the StandbyNN doesn't have the updated client crediential, it will throw a s SecurityException that wraps StandbyException. The client is expected to retry another NN, but due to the insufficient handling of SecurityException mentioned above, it failed. Example message: {code} {RemoteException={message=Failed to obtain user group information: org.apache.hadoop.security.token.SecretManager$InvalidToken: StandbyException, javaCl assName=java.lang.SecurityException, exception=SecurityException}} org.apache.hadoop.ipc.RemoteException(java.lang.SecurityException): Failed to obtain user group information: org.apache.hadoop.security.token.SecretManager$InvalidToken: StandbyException at org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:159) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:325) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$700(WebHdfsFileSystem.java:107) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.getResponse(WebHdfsFileSystem.java:635) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:542) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.run(WebHdfsFileSystem.java:431) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:685) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:696) at kclient1.kclient$1.run(kclient.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:356) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528) at kclient1.kclient.main(kclient.java:58) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6506) Newly moved block replica been invalidated and deleted
[ https://issues.apache.org/jira/browse/HDFS-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026188#comment-14026188 ] Binglin Chang commented on HDFS-6506: - Look at the log and code more throughly. The reason some block replica is invalidated is: 1. balancer round 1: move blk0 from dn0 to dn1, at this time block map haven't updated yet(so dn0 still have blk0) 2. balancer round 2 starts, and try to move blk0 from dn0 to dn2 3. dn2 copy data from dn0 4. dn0 heartbeat and get cmd to delete blk0 5. try to move blk0 from dn0 to dn2 , it canot find dn0, but it has to delete a replica, so it delete dn1 To prevent this, balancer need to wait some time to make sure the block movements in last round is fully committed, otherwise the movements in last round may be invalided. Newly moved block replica been invalidated and deleted -- Key: HDFS-6506 URL: https://issues.apache.org/jira/browse/HDFS-6506 Project: Hadoop HDFS Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang TestBalancerWithNodeGroup#testBalancerWithNodeGroup fails recently https://builds.apache.org/job/PreCommit-HDFS-Build/7045//testReport/ from the error log, the reason seems to be that newly moved block replicas been invalidated and deleted, so some work of the balancer are reversed. {noformat} 2014-06-06 18:15:51,681 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741834_1010 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741833_1009 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741830_1006 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741831_1007 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,682 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741832_1008 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741827_1003 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741828_1004 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,701 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741829_1005 with size=100 fr 2014-06-06 18:15:54,706 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741833_1009) is added to invalidated blocks set 2014-06-06 18:15:54,709 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741834_1010) is added to invalidated blocks set 2014-06-06 18:15:56,421 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741833_1009, blk_1073741834_1010] 2014-06-06 18:15:57,717 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741832_1008) is added to invalidated blocks set 2014-06-06 18:15:57,720 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741827_1003) is added to invalidated blocks set 2014-06-06 18:15:57,721 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741830_1006) is added to invalidated blocks set 2014-06-06 18:15:57,722 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741831_1007) is added to invalidated blocks set 2014-06-06 18:15:57,723 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741829_1005) is added to invalidated blocks set 2014-06-06 18:15:59,422 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741827_1003, blk_1073741829_1005, blk_1073741830_1006, blk_1073741831_1007,
[jira] [Created] (HDFS-6507) Improve DFSAdmin to support HA cluster better
Zesheng Wu created HDFS-6507: Summary: Improve DFSAdmin to support HA cluster better Key: HDFS-6507 URL: https://issues.apache.org/jira/browse/HDFS-6507 Project: Hadoop HDFS Issue Type: Improvement Components: tools Affects Versions: 2.4.0 Reporter: Zesheng Wu Assignee: Zesheng Wu Currently, the commands supported in DFSAdmin can be classified into three categories according to the protocol used: 1.ClientProtocol Commands in this category generally implement by calling the corresponding function of the DFSClient class, and will call the corresponding remote implementation function at the NN side finally. At the NN side, all these operations are classified into five categories: UNCHECKED, READ, WRITE, CHECKPOINT, JOURNAL. Active NN will allow all operations, and Standby NN only allows UNCHECKED operations. In the current implementation of DFSClient, it will connect one NN first, if the first NN is not Active and the operation is not allowed, it will failover to the second NN. So here comes the problem, some of the commands(setSafeMode, saveNameSpace, restoreFailedStorage, refreshNodes, setBalancerBandwidth, metaSave) in DFSAdmin are classified as UNCHECKED operations, and when executing these commands in the DFSAdmin command line, they will be sent to a definite NN, no matter it is Active or Standby. This may result in two problems: a. If the first tried NN is standby, and the operation takes effect only on Standby NN, which is not the expected result. b. If the operation needs to take effect on both NN, but it takes effect on only one NN. In the future, when there is a NN failover, there may have problems. Here I propose the following improvements: a. If the command can be classified as one of READ/WRITE/CHECKPOINT/JOURNAL operations, we should classify it clearly. b. If the command can not be classified as one of the above four operations, or if the command needs to take effect on both NN, we should send the request to both Active and Standby NNs. 2.Refresh protocols: RefreshAuthorizationPolicyProtocol, RefreshUserMappingsProtocol, RefreshUserMappingsProtocol, RefreshCallQueueProtocol Commands in this category, including refreshServiceAcl, refreshUserToGroupMapping, refreshSuperUserGroupsConfiguration and refreshCallQueue, are implemented by creating a corresponding RPC proxy and sending the request to remote NN. In the current implementation, these requests will be sent to a definite NN, no matter it is Active or Standby. Here I propose that we sent these requests to both NNs. 3.ClientDatanodeProtocol Commands in this category are handled correctly, no need to improve. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6475) WebHdfs clients fail without retry because incorrect handling of StandbyException
[ https://issues.apache.org/jira/browse/HDFS-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026198#comment-14026198 ] Yongjun Zhang commented on HDFS-6475: - Hello [~jingzhao], Thanks for your earlier suggestion. Sorry for a bit delay to take care of it. I just uploaded a patch. I was able to verify it works with a real cluster where I saw the problem and to see the patch fixed the issue. However, I was not successful creating a testcase for it. Since this new patch reused the method getTrueCause() in Server.java, the remaining thing to be checked by a unit test would be the change I made in ExceptionHandler. The change in ExceptionHandler is, for ContainerException and SecurityException, call getTrueCause() to find the real exception based on the cause chain of the ContainerException/SecurityException. The original code in ExceptionHandler only does one level of cause-seeking for ContainerException. Would you please help take a look at the patch to see if this patch can be committed without a unit testcase? or if you have any other advice? Thanks a lot. WebHdfs clients fail without retry because incorrect handling of StandbyException - Key: HDFS-6475 URL: https://issues.apache.org/jira/browse/HDFS-6475 Project: Hadoop HDFS Issue Type: Bug Components: ha, webhdfs Affects Versions: 2.4.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-6475.001.patch, HDFS-6475.002.patch With WebHdfs clients connected to a HA HDFS service, the delegation token is previously initialized with the active NN. When clients try to issue request, the NN it contacts is stored in a map returned by DFSUtil.getNNServiceRpcAddresses(conf). And the client contact the NN based on the order, so likely the first one it runs into is StandbyNN. If the StandbyNN doesn't have the updated client crediential, it will throw a s SecurityException that wraps StandbyException. The client is expected to retry another NN, but due to the insufficient handling of SecurityException mentioned above, it failed. Example message: {code} {RemoteException={message=Failed to obtain user group information: org.apache.hadoop.security.token.SecretManager$InvalidToken: StandbyException, javaCl assName=java.lang.SecurityException, exception=SecurityException}} org.apache.hadoop.ipc.RemoteException(java.lang.SecurityException): Failed to obtain user group information: org.apache.hadoop.security.token.SecretManager$InvalidToken: StandbyException at org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:159) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:325) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$700(WebHdfsFileSystem.java:107) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.getResponse(WebHdfsFileSystem.java:635) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:542) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.run(WebHdfsFileSystem.java:431) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:685) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:696) at kclient1.kclient$1.run(kclient.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:356) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528) at kclient1.kclient.main(kclient.java:58) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6506) Newly moved block replica been invalidated and deleted
[ https://issues.apache.org/jira/browse/HDFS-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026215#comment-14026215 ] Binglin Chang commented on HDFS-6506: - Balancer already sleep 2*DFS_HEARTBEAT_INTERVAL seconds between rounds, but in TestBalancer.java: {code} conf.setLong(DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY, 1L); {code} replica state update speed is related to DFS_NAMENODE_REPLICATION_INTERVAL too, which is 3 by default. TestBalancer only change heartbeat interval(which changes heartbeat interval and balancer iteration sleep time), but doesn't change ReplicationMonitor check interval, so the sleep time is too small to wait for movements getting committed. The other thing is 2*DFS_HEARTBEAT_INTERVAL still seems a little dangerous. maybe change it to 2*DFS_HEARTBEAT_INTERVAL + DFS_NAMENODE_REPLICATION_INTERVAL Newly moved block replica been invalidated and deleted -- Key: HDFS-6506 URL: https://issues.apache.org/jira/browse/HDFS-6506 Project: Hadoop HDFS Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang TestBalancerWithNodeGroup#testBalancerWithNodeGroup fails recently https://builds.apache.org/job/PreCommit-HDFS-Build/7045//testReport/ from the error log, the reason seems to be that newly moved block replicas been invalidated and deleted, so some work of the balancer are reversed. {noformat} 2014-06-06 18:15:51,681 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741834_1010 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741833_1009 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741830_1006 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741831_1007 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,682 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741832_1008 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741827_1003 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741828_1004 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,701 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741829_1005 with size=100 fr 2014-06-06 18:15:54,706 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741833_1009) is added to invalidated blocks set 2014-06-06 18:15:54,709 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741834_1010) is added to invalidated blocks set 2014-06-06 18:15:56,421 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741833_1009, blk_1073741834_1010] 2014-06-06 18:15:57,717 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741832_1008) is added to invalidated blocks set 2014-06-06 18:15:57,720 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741827_1003) is added to invalidated blocks set 2014-06-06 18:15:57,721 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741830_1006) is added to invalidated blocks set 2014-06-06 18:15:57,722 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741831_1007) is added to invalidated blocks set 2014-06-06 18:15:57,723 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741829_1005) is added to invalidated blocks set 2014-06-06 18:15:59,422 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741827_1003,
[jira] [Updated] (HDFS-6506) Newly moved block replica been invalidated and deleted
[ https://issues.apache.org/jira/browse/HDFS-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Binglin Chang updated HDFS-6506: Attachment: HDFS-6506.v1.patch Newly moved block replica been invalidated and deleted -- Key: HDFS-6506 URL: https://issues.apache.org/jira/browse/HDFS-6506 Project: Hadoop HDFS Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Attachments: HDFS-6506.v1.patch TestBalancerWithNodeGroup#testBalancerWithNodeGroup fails recently https://builds.apache.org/job/PreCommit-HDFS-Build/7045//testReport/ from the error log, the reason seems to be that newly moved block replicas been invalidated and deleted, so some work of the balancer are reversed. {noformat} 2014-06-06 18:15:51,681 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741834_1010 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741833_1009 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741830_1006 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741831_1007 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,682 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741832_1008 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741827_1003 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741828_1004 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,701 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741829_1005 with size=100 fr 2014-06-06 18:15:54,706 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741833_1009) is added to invalidated blocks set 2014-06-06 18:15:54,709 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741834_1010) is added to invalidated blocks set 2014-06-06 18:15:56,421 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741833_1009, blk_1073741834_1010] 2014-06-06 18:15:57,717 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741832_1008) is added to invalidated blocks set 2014-06-06 18:15:57,720 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741827_1003) is added to invalidated blocks set 2014-06-06 18:15:57,721 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741830_1006) is added to invalidated blocks set 2014-06-06 18:15:57,722 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741831_1007) is added to invalidated blocks set 2014-06-06 18:15:57,723 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741829_1005) is added to invalidated blocks set 2014-06-06 18:15:59,422 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741827_1003, blk_1073741829_1005, blk_1073741830_1006, blk_1073741831_1007, blk_1073741832_1008] 2014-06-06 18:16:02,423 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741845_1021] {noformat} Normally this should not happen, when moving a block from src to dest, replica on src should be invalided not the dest, there should be bug inside related logic. I don't think TestBalancerWithNodeGroup#testBalancerWithNodeGroup caused this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6506) Newly moved block replica been invalidated and deleted
[ https://issues.apache.org/jira/browse/HDFS-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Binglin Chang updated HDFS-6506: Status: Patch Available (was: Open) Newly moved block replica been invalidated and deleted -- Key: HDFS-6506 URL: https://issues.apache.org/jira/browse/HDFS-6506 Project: Hadoop HDFS Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Attachments: HDFS-6506.v1.patch TestBalancerWithNodeGroup#testBalancerWithNodeGroup fails recently https://builds.apache.org/job/PreCommit-HDFS-Build/7045//testReport/ from the error log, the reason seems to be that newly moved block replicas been invalidated and deleted, so some work of the balancer are reversed. {noformat} 2014-06-06 18:15:51,681 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741834_1010 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741833_1009 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741830_1006 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741831_1007 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,682 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741832_1008 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741827_1003 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741828_1004 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,701 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741829_1005 with size=100 fr 2014-06-06 18:15:54,706 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741833_1009) is added to invalidated blocks set 2014-06-06 18:15:54,709 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741834_1010) is added to invalidated blocks set 2014-06-06 18:15:56,421 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741833_1009, blk_1073741834_1010] 2014-06-06 18:15:57,717 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741832_1008) is added to invalidated blocks set 2014-06-06 18:15:57,720 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741827_1003) is added to invalidated blocks set 2014-06-06 18:15:57,721 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741830_1006) is added to invalidated blocks set 2014-06-06 18:15:57,722 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741831_1007) is added to invalidated blocks set 2014-06-06 18:15:57,723 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741829_1005) is added to invalidated blocks set 2014-06-06 18:15:59,422 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741827_1003, blk_1073741829_1005, blk_1073741830_1006, blk_1073741831_1007, blk_1073741832_1008] 2014-06-06 18:16:02,423 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741845_1021] {noformat} Normally this should not happen, when moving a block from src to dest, replica on src should be invalided not the dest, there should be bug inside related logic. I don't think TestBalancerWithNodeGroup#testBalancerWithNodeGroup caused this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6507) Improve DFSAdmin to support HA cluster better
[ https://issues.apache.org/jira/browse/HDFS-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zesheng Wu updated HDFS-6507: - Description: Currently, the commands supported in DFSAdmin can be classified into three categories according to the protocol used: # ClientProtocol Commands in this category generally implement by calling the corresponding function of the DFSClient class, and will call the corresponding remote implementation function at the NN side finally. At the NN side, all these operations are classified into five categories: UNCHECKED, READ, WRITE, CHECKPOINT, JOURNAL. Active NN will allow all operations, and Standby NN only allows UNCHECKED operations. In the current implementation of DFSClient, it will connect one NN first, if the first NN is not Active and the operation is not allowed, it will failover to the second NN. So here comes the problem, some of the commands(setSafeMode, saveNameSpace, restoreFailedStorage, refreshNodes, setBalancerBandwidth, metaSave) in DFSAdmin are classified as UNCHECKED operations, and when executing these commands in the DFSAdmin command line, they will be sent to a definite NN, no matter it is Active or Standby. This may result in two problems: #* If the first tried NN is standby, and the operation takes effect only on Standby NN, which is not the expected result. #* If the operation needs to take effect on both NN, but it takes effect on only one NN. In the future, when there is a NN failover, there may have problems. Here I propose the following improvements: a. If the command can be classified as one of READ/WRITE/CHECKPOINT/JOURNAL operations, we should classify it clearly. b. If the command can not be classified as one of the above four operations, or if the command needs to take effect on both NN, we should send the request to both Active and Standby NNs. # Refresh protocols: RefreshAuthorizationPolicyProtocol, RefreshUserMappingsProtocol, RefreshUserMappingsProtocol, RefreshCallQueueProtocol Commands in this category, including refreshServiceAcl, refreshUserToGroupMapping, refreshSuperUserGroupsConfiguration and refreshCallQueue, are implemented by creating a corresponding RPC proxy and sending the request to remote NN. In the current implementation, these requests will be sent to a definite NN, no matter it is Active or Standby. Here I propose that we sent these requests to both NNs. # ClientDatanodeProtocol Commands in this category are handled correctly, no need to improve. was: Currently, the commands supported in DFSAdmin can be classified into three categories according to the protocol used: 1.ClientProtocol Commands in this category generally implement by calling the corresponding function of the DFSClient class, and will call the corresponding remote implementation function at the NN side finally. At the NN side, all these operations are classified into five categories: UNCHECKED, READ, WRITE, CHECKPOINT, JOURNAL. Active NN will allow all operations, and Standby NN only allows UNCHECKED operations. In the current implementation of DFSClient, it will connect one NN first, if the first NN is not Active and the operation is not allowed, it will failover to the second NN. So here comes the problem, some of the commands(setSafeMode, saveNameSpace, restoreFailedStorage, refreshNodes, setBalancerBandwidth, metaSave) in DFSAdmin are classified as UNCHECKED operations, and when executing these commands in the DFSAdmin command line, they will be sent to a definite NN, no matter it is Active or Standby. This may result in two problems: a. If the first tried NN is standby, and the operation takes effect only on Standby NN, which is not the expected result. b. If the operation needs to take effect on both NN, but it takes effect on only one NN. In the future, when there is a NN failover, there may have problems. Here I propose the following improvements: a. If the command can be classified as one of READ/WRITE/CHECKPOINT/JOURNAL operations, we should classify it clearly. b. If the command can not be classified as one of the above four operations, or if the command needs to take effect on both NN, we should send the request to both Active and Standby NNs. 2.Refresh protocols: RefreshAuthorizationPolicyProtocol, RefreshUserMappingsProtocol, RefreshUserMappingsProtocol, RefreshCallQueueProtocol Commands in this category, including refreshServiceAcl, refreshUserToGroupMapping, refreshSuperUserGroupsConfiguration and refreshCallQueue, are implemented by creating a corresponding RPC proxy and sending the request to remote NN. In the current implementation, these requests will be sent to a definite NN, no matter it is Active or Standby. Here I propose that we sent these requests to both NNs. 3.ClientDatanodeProtocol Commands in this category are handled correctly, no need to improve. Improve DFSAdmin to support HA cluster better
[jira] [Updated] (HDFS-6507) Improve DFSAdmin to support HA cluster better
[ https://issues.apache.org/jira/browse/HDFS-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zesheng Wu updated HDFS-6507: - Description: Currently, the commands supported in DFSAdmin can be classified into three categories according to the protocol used: 1. ClientProtocol Commands in this category generally implement by calling the corresponding function of the DFSClient class, and will call the corresponding remote implementation function at the NN side finally. At the NN side, all these operations are classified into five categories: UNCHECKED, READ, WRITE, CHECKPOINT, JOURNAL. Active NN will allow all operations, and Standby NN only allows UNCHECKED operations. In the current implementation of DFSClient, it will connect one NN first, if the first NN is not Active and the operation is not allowed, it will failover to the second NN. So here comes the problem, some of the commands(setSafeMode, saveNameSpace, restoreFailedStorage, refreshNodes, setBalancerBandwidth, metaSave) in DFSAdmin are classified as UNCHECKED operations, and when executing these commands in the DFSAdmin command line, they will be sent to a definite NN, no matter it is Active or Standby. This may result in two problems: a. If the first tried NN is standby, and the operation takes effect only on Standby NN, which is not the expected result. b. If the operation needs to take effect on both NN, but it takes effect on only one NN. In the future, when there is a NN failover, there may have problems. Here I propose the following improvements: a. If the command can be classified as one of READ/WRITE/CHECKPOINT/JOURNAL operations, we should classify it clearly. b. If the command can not be classified as one of the above four operations, or if the command needs to take effect on both NN, we should send the request to both Active and Standby NNs. 2. Refresh protocols: RefreshAuthorizationPolicyProtocol, RefreshUserMappingsProtocol, RefreshUserMappingsProtocol, RefreshCallQueueProtocol Commands in this category, including refreshServiceAcl, refreshUserToGroupMapping, refreshSuperUserGroupsConfiguration and refreshCallQueue, are implemented by creating a corresponding RPC proxy and sending the request to remote NN. In the current implementation, these requests will be sent to a definite NN, no matter it is Active or Standby. Here I propose that we sent these requests to both NNs. 3. ClientDatanodeProtocol Commands in this category are handled correctly, no need to improve. was: Currently, the commands supported in DFSAdmin can be classified into three categories according to the protocol used: # ClientProtocol Commands in this category generally implement by calling the corresponding function of the DFSClient class, and will call the corresponding remote implementation function at the NN side finally. At the NN side, all these operations are classified into five categories: UNCHECKED, READ, WRITE, CHECKPOINT, JOURNAL. Active NN will allow all operations, and Standby NN only allows UNCHECKED operations. In the current implementation of DFSClient, it will connect one NN first, if the first NN is not Active and the operation is not allowed, it will failover to the second NN. So here comes the problem, some of the commands(setSafeMode, saveNameSpace, restoreFailedStorage, refreshNodes, setBalancerBandwidth, metaSave) in DFSAdmin are classified as UNCHECKED operations, and when executing these commands in the DFSAdmin command line, they will be sent to a definite NN, no matter it is Active or Standby. This may result in two problems: #* If the first tried NN is standby, and the operation takes effect only on Standby NN, which is not the expected result. #* If the operation needs to take effect on both NN, but it takes effect on only one NN. In the future, when there is a NN failover, there may have problems. Here I propose the following improvements: a. If the command can be classified as one of READ/WRITE/CHECKPOINT/JOURNAL operations, we should classify it clearly. b. If the command can not be classified as one of the above four operations, or if the command needs to take effect on both NN, we should send the request to both Active and Standby NNs. # Refresh protocols: RefreshAuthorizationPolicyProtocol, RefreshUserMappingsProtocol, RefreshUserMappingsProtocol, RefreshCallQueueProtocol Commands in this category, including refreshServiceAcl, refreshUserToGroupMapping, refreshSuperUserGroupsConfiguration and refreshCallQueue, are implemented by creating a corresponding RPC proxy and sending the request to remote NN. In the current implementation, these requests will be sent to a definite NN, no matter it is Active or Standby. Here I propose that we sent these requests to both NNs. # ClientDatanodeProtocol Commands in this category are handled correctly, no need to improve. Improve DFSAdmin to support HA cluster better
[jira] [Commented] (HDFS-6475) WebHdfs clients fail without retry because incorrect handling of StandbyException
[ https://issues.apache.org/jira/browse/HDFS-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026288#comment-14026288 ] Hadoop QA commented on HDFS-6475: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12649538/HDFS-6475.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.web.TestWebHDFS {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7071//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7071//console This message is automatically generated. WebHdfs clients fail without retry because incorrect handling of StandbyException - Key: HDFS-6475 URL: https://issues.apache.org/jira/browse/HDFS-6475 Project: Hadoop HDFS Issue Type: Bug Components: ha, webhdfs Affects Versions: 2.4.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-6475.001.patch, HDFS-6475.002.patch With WebHdfs clients connected to a HA HDFS service, the delegation token is previously initialized with the active NN. When clients try to issue request, the NN it contacts is stored in a map returned by DFSUtil.getNNServiceRpcAddresses(conf). And the client contact the NN based on the order, so likely the first one it runs into is StandbyNN. If the StandbyNN doesn't have the updated client crediential, it will throw a s SecurityException that wraps StandbyException. The client is expected to retry another NN, but due to the insufficient handling of SecurityException mentioned above, it failed. Example message: {code} {RemoteException={message=Failed to obtain user group information: org.apache.hadoop.security.token.SecretManager$InvalidToken: StandbyException, javaCl assName=java.lang.SecurityException, exception=SecurityException}} org.apache.hadoop.ipc.RemoteException(java.lang.SecurityException): Failed to obtain user group information: org.apache.hadoop.security.token.SecretManager$InvalidToken: StandbyException at org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:159) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:325) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$700(WebHdfsFileSystem.java:107) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.getResponse(WebHdfsFileSystem.java:635) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:542) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.run(WebHdfsFileSystem.java:431) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:685) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:696) at kclient1.kclient$1.run(kclient.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:356) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528) at kclient1.kclient.main(kclient.java:58) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) {code} -- This message was sent by Atlassian
[jira] [Commented] (HDFS-6506) Newly moved block replica been invalidated and deleted
[ https://issues.apache.org/jira/browse/HDFS-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026314#comment-14026314 ] Hadoop QA commented on HDFS-6506: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12649548/HDFS-6506.v1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.datanode.TestBPOfferService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7072//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7072//console This message is automatically generated. Newly moved block replica been invalidated and deleted -- Key: HDFS-6506 URL: https://issues.apache.org/jira/browse/HDFS-6506 Project: Hadoop HDFS Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Attachments: HDFS-6506.v1.patch TestBalancerWithNodeGroup#testBalancerWithNodeGroup fails recently https://builds.apache.org/job/PreCommit-HDFS-Build/7045//testReport/ from the error log, the reason seems to be that newly moved block replicas been invalidated and deleted, so some work of the balancer are reversed. {noformat} 2014-06-06 18:15:51,681 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741834_1010 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741833_1009 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741830_1006 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741831_1007 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,682 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741832_1008 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741827_1003 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741828_1004 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,701 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741829_1005 with size=100 fr 2014-06-06 18:15:54,706 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741833_1009) is added to invalidated blocks set 2014-06-06 18:15:54,709 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741834_1010) is added to invalidated blocks set 2014-06-06 18:15:56,421 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741833_1009, blk_1073741834_1010] 2014-06-06 18:15:57,717 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741832_1008) is added to invalidated blocks set 2014-06-06 18:15:57,720 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741827_1003) is added to invalidated blocks set 2014-06-06 18:15:57,721 INFO BlockStateChange
[jira] [Commented] (HDFS-6257) TestCacheDirectives#testExceedsCapacity fails occasionally
[ https://issues.apache.org/jira/browse/HDFS-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026340#comment-14026340 ] Hudson commented on HDFS-6257: -- FAILURE: Integrated in Hadoop-Yarn-trunk #579 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/579/]) HDFS-6257. TestCacheDirectives#testExceedsCapacity fails occasionally (cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1601473) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java TestCacheDirectives#testExceedsCapacity fails occasionally -- Key: HDFS-6257 URL: https://issues.apache.org/jira/browse/HDFS-6257 Project: Hadoop HDFS Issue Type: Test Components: caching Affects Versions: 2.4.0 Reporter: Ted Yu Assignee: Colin Patrick McCabe Priority: Minor Fix For: 2.5.0 Attachments: HDFS-6257.001.patch From https://builds.apache.org/job/Hadoop-Hdfs-trunk/1736/ : REGRESSION: org.apache.hadoop.hdfs.server.namenode.TestCacheDirectives.testExceedsCapacity {code} Error Message: Namenode should not send extra CACHE commands expected:0 but was:2 Stack Trace: java.lang.AssertionError: Namenode should not send extra CACHE commands expected:0 but was:2 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.apache.hadoop.hdfs.server.namenode.TestCacheDirectives.testExceedsCapacity(TestCacheDirectives.java:1419) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6399) Add note about setfacl in HDFS permissions guide
[ https://issues.apache.org/jira/browse/HDFS-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026338#comment-14026338 ] Hudson commented on HDFS-6399: -- FAILURE: Integrated in Hadoop-Yarn-trunk #579 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/579/]) HDFS-6399. Add note about setfacl in HDFS permissions guide. Contributed by Chris Nauroth. (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1601476) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsPermissionsGuide.apt.vm Add note about setfacl in HDFS permissions guide Key: HDFS-6399 URL: https://issues.apache.org/jira/browse/HDFS-6399 Project: Hadoop HDFS Issue Type: Bug Components: documentation, namenode Affects Versions: 2.4.0 Reporter: Charles Lamb Assignee: Chris Nauroth Priority: Minor Fix For: 2.5.0 Attachments: HDFS-6399.1.patch, HDFS-6399.2.patch The ACL operations in FSNamesystem don't currently check isPermissionEnabled before calling checkOwner(). This patch corrects that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6460) Ignore stale and decommissioned nodes in NetworkTopology#sortByDistance
[ https://issues.apache.org/jira/browse/HDFS-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026339#comment-14026339 ] Hudson commented on HDFS-6460: -- FAILURE: Integrated in Hadoop-Yarn-trunk #579 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/579/]) HDFS-6460. Ignore stale and decommissioned nodes in NetworkTopology#sortByDistance. Contributed by Yongjun Zhang. (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1601535) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopologyWithNodeGroup.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/TestNetworkTopologyWithNodeGroup.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LocatedBlock.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/net/TestNetworkTopology.java Ignore stale and decommissioned nodes in NetworkTopology#sortByDistance --- Key: HDFS-6460 URL: https://issues.apache.org/jira/browse/HDFS-6460 Project: Hadoop HDFS Issue Type: Improvement Reporter: Yongjun Zhang Assignee: Yongjun Zhang Priority: Minor Attachments: HDFS-6460-branch2.001.patch, HDFS-6460.001.patch, HDFS-6460.002.patch Per discussion in HDFS-6268, filing this jira as a follow-up, so that we can improve the sorting result and save a bit runtime. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6508) Add an XAttr to specify the cipher mode
Charles Lamb created HDFS-6508: -- Summary: Add an XAttr to specify the cipher mode Key: HDFS-6508 URL: https://issues.apache.org/jira/browse/HDFS-6508 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode, security Reporter: Charles Lamb Assignee: Charles Lamb We should specify the cipher mode in the xattrs for compatibility sake. Crypto changes over time and we need to prepare for that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (HDFS-6509) distcp vs Data At Rest Encryption
[ https://issues.apache.org/jira/browse/HDFS-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb reassigned HDFS-6509: -- Assignee: Charles Lamb distcp vs Data At Rest Encryption - Key: HDFS-6509 URL: https://issues.apache.org/jira/browse/HDFS-6509 Project: Hadoop HDFS Issue Type: Sub-task Components: security Reporter: Charles Lamb Assignee: Charles Lamb distcp needs to work with Data At Rest Encryption -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6509) distcp vs Data At Rest Encryption
Charles Lamb created HDFS-6509: -- Summary: distcp vs Data At Rest Encryption Key: HDFS-6509 URL: https://issues.apache.org/jira/browse/HDFS-6509 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Charles Lamb distcp needs to work with Data At Rest Encryption -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6475) WebHdfs clients fail without retry because incorrect handling of StandbyException
[ https://issues.apache.org/jira/browse/HDFS-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026440#comment-14026440 ] Daryn Sharp commented on HDFS-6475: --- I had to hack around this problem in HDFS-6222... I'm a bit uneasy about tying the webhdfs servlet to the IPC server. I'd rather see the logic contained within webhdfs. I think {{UserProvider}} should throw a different exception that its {{ExceptionHandler}} specially knows to unwrap. WebHdfs clients fail without retry because incorrect handling of StandbyException - Key: HDFS-6475 URL: https://issues.apache.org/jira/browse/HDFS-6475 Project: Hadoop HDFS Issue Type: Bug Components: ha, webhdfs Affects Versions: 2.4.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-6475.001.patch, HDFS-6475.002.patch With WebHdfs clients connected to a HA HDFS service, the delegation token is previously initialized with the active NN. When clients try to issue request, the NN it contacts is stored in a map returned by DFSUtil.getNNServiceRpcAddresses(conf). And the client contact the NN based on the order, so likely the first one it runs into is StandbyNN. If the StandbyNN doesn't have the updated client crediential, it will throw a s SecurityException that wraps StandbyException. The client is expected to retry another NN, but due to the insufficient handling of SecurityException mentioned above, it failed. Example message: {code} {RemoteException={message=Failed to obtain user group information: org.apache.hadoop.security.token.SecretManager$InvalidToken: StandbyException, javaCl assName=java.lang.SecurityException, exception=SecurityException}} org.apache.hadoop.ipc.RemoteException(java.lang.SecurityException): Failed to obtain user group information: org.apache.hadoop.security.token.SecretManager$InvalidToken: StandbyException at org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:159) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:325) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$700(WebHdfsFileSystem.java:107) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.getResponse(WebHdfsFileSystem.java:635) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:542) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.run(WebHdfsFileSystem.java:431) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:685) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:696) at kclient1.kclient$1.run(kclient.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:356) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528) at kclient1.kclient.main(kclient.java:58) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6222) Remove background token renewer from webhdfs
[ https://issues.apache.org/jira/browse/HDFS-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026441#comment-14026441 ] Daryn Sharp commented on HDFS-6222: --- The test failure is unrelated. Remove background token renewer from webhdfs Key: HDFS-6222 URL: https://issues.apache.org/jira/browse/HDFS-6222 Project: Hadoop HDFS Issue Type: Bug Components: webhdfs Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Attachments: HDFS-6222.branch-2.patch, HDFS-6222.branch-2.patch, HDFS-6222.trunk.patch, HDFS-6222.trunk.patch The background token renewer is a source of problems for long-running daemons. Webhdfs should lazy fetch a new token when it receives an InvalidToken exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6460) Ignore stale and decommissioned nodes in NetworkTopology#sortByDistance
[ https://issues.apache.org/jira/browse/HDFS-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026456#comment-14026456 ] Hudson commented on HDFS-6460: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1770 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1770/]) HDFS-6460. Ignore stale and decommissioned nodes in NetworkTopology#sortByDistance. Contributed by Yongjun Zhang. (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1601535) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopologyWithNodeGroup.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/TestNetworkTopologyWithNodeGroup.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LocatedBlock.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/net/TestNetworkTopology.java Ignore stale and decommissioned nodes in NetworkTopology#sortByDistance --- Key: HDFS-6460 URL: https://issues.apache.org/jira/browse/HDFS-6460 Project: Hadoop HDFS Issue Type: Improvement Reporter: Yongjun Zhang Assignee: Yongjun Zhang Priority: Minor Attachments: HDFS-6460-branch2.001.patch, HDFS-6460.001.patch, HDFS-6460.002.patch Per discussion in HDFS-6268, filing this jira as a follow-up, so that we can improve the sorting result and save a bit runtime. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6399) Add note about setfacl in HDFS permissions guide
[ https://issues.apache.org/jira/browse/HDFS-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026455#comment-14026455 ] Hudson commented on HDFS-6399: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1770 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1770/]) HDFS-6399. Add note about setfacl in HDFS permissions guide. Contributed by Chris Nauroth. (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1601476) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsPermissionsGuide.apt.vm Add note about setfacl in HDFS permissions guide Key: HDFS-6399 URL: https://issues.apache.org/jira/browse/HDFS-6399 Project: Hadoop HDFS Issue Type: Bug Components: documentation, namenode Affects Versions: 2.4.0 Reporter: Charles Lamb Assignee: Chris Nauroth Priority: Minor Fix For: 2.5.0 Attachments: HDFS-6399.1.patch, HDFS-6399.2.patch The ACL operations in FSNamesystem don't currently check isPermissionEnabled before calling checkOwner(). This patch corrects that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6257) TestCacheDirectives#testExceedsCapacity fails occasionally
[ https://issues.apache.org/jira/browse/HDFS-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026457#comment-14026457 ] Hudson commented on HDFS-6257: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1770 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1770/]) HDFS-6257. TestCacheDirectives#testExceedsCapacity fails occasionally (cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1601473) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java TestCacheDirectives#testExceedsCapacity fails occasionally -- Key: HDFS-6257 URL: https://issues.apache.org/jira/browse/HDFS-6257 Project: Hadoop HDFS Issue Type: Test Components: caching Affects Versions: 2.4.0 Reporter: Ted Yu Assignee: Colin Patrick McCabe Priority: Minor Fix For: 2.5.0 Attachments: HDFS-6257.001.patch From https://builds.apache.org/job/Hadoop-Hdfs-trunk/1736/ : REGRESSION: org.apache.hadoop.hdfs.server.namenode.TestCacheDirectives.testExceedsCapacity {code} Error Message: Namenode should not send extra CACHE commands expected:0 but was:2 Stack Trace: java.lang.AssertionError: Namenode should not send extra CACHE commands expected:0 but was:2 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.apache.hadoop.hdfs.server.namenode.TestCacheDirectives.testExceedsCapacity(TestCacheDirectives.java:1419) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6506) Newly moved block replica been invalidated and deleted
[ https://issues.apache.org/jira/browse/HDFS-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026512#comment-14026512 ] Binglin Chang commented on HDFS-6506: - The failed test is not related and is tracked in HDFS-3930, actually recent build also failed because of this. https://builds.apache.org/job/Hadoop-Hdfs-trunk/1770/consoleText Newly moved block replica been invalidated and deleted -- Key: HDFS-6506 URL: https://issues.apache.org/jira/browse/HDFS-6506 Project: Hadoop HDFS Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Attachments: HDFS-6506.v1.patch TestBalancerWithNodeGroup#testBalancerWithNodeGroup fails recently https://builds.apache.org/job/PreCommit-HDFS-Build/7045//testReport/ from the error log, the reason seems to be that newly moved block replicas been invalidated and deleted, so some work of the balancer are reversed. {noformat} 2014-06-06 18:15:51,681 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741834_1010 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741833_1009 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741830_1006 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,683 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741831_1007 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:51,682 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741832_1008 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741827_1003 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,702 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741828_1004 with size=100 from 127.0.0.1:49159 to 127.0.0.1:55468 through 127.0.0.1:49159 2014-06-06 18:15:54,701 INFO balancer.Balancer (Balancer.java:dispatch(370)) - Successfully moved blk_1073741829_1005 with size=100 fr 2014-06-06 18:15:54,706 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741833_1009) is added to invalidated blocks set 2014-06-06 18:15:54,709 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741834_1010) is added to invalidated blocks set 2014-06-06 18:15:56,421 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741833_1009, blk_1073741834_1010] 2014-06-06 18:15:57,717 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741832_1008) is added to invalidated blocks set 2014-06-06 18:15:57,720 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741827_1003) is added to invalidated blocks set 2014-06-06 18:15:57,721 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741830_1006) is added to invalidated blocks set 2014-06-06 18:15:57,722 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741831_1007) is added to invalidated blocks set 2014-06-06 18:15:57,723 INFO BlockStateChange (BlockManager.java:chooseExcessReplicates(2711)) - BLOCK* chooseExcessReplicates: (127.0.0.1:55468, blk_1073741829_1005) is added to invalidated blocks set 2014-06-06 18:15:59,422 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741827_1003, blk_1073741829_1005, blk_1073741830_1006, blk_1073741831_1007, blk_1073741832_1008] 2014-06-06 18:16:02,423 INFO BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3242)) - BLOCK* BlockManager: ask 127.0.0.1:55468 to delete [blk_1073741845_1021] {noformat} Normally this should not happen, when moving a block from src to dest, replica on src should be invalided not the dest, there should be bug inside related logic. I don't think
[jira] [Commented] (HDFS-6493) Propose to change dfs.namenode.startup.delay.block.deletion to second instead of millisecond
[ https://issues.apache.org/jira/browse/HDFS-6493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026522#comment-14026522 ] Juan Yu commented on HDFS-6493: --- Thanks for the suggestion. Propose to change dfs.namenode.startup.delay.block.deletion to second instead of millisecond -- Key: HDFS-6493 URL: https://issues.apache.org/jira/browse/HDFS-6493 Project: Hadoop HDFS Issue Type: Bug Reporter: Juan Yu Assignee: Juan Yu Priority: Trivial Based on the discussion in https://issues.apache.org/jira/browse/HDFS-6186, the delay will be at least 30 minutes or even hours. it's not very user friendly to use milliseconds when it's likely measured in hours. I suggest to make the following change 1. change the unit of this config to second 2. rename the config key from dfs.namenode.startup.delay.block.deletion.ms to dfs.namenode.startup.delay.block.deletion.sec 3. add the default value to hdfs-default.xml, what's the reasonable value, 30 minutes, one hour? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6460) Ignore stale and decommissioned nodes in NetworkTopology#sortByDistance
[ https://issues.apache.org/jira/browse/HDFS-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026558#comment-14026558 ] Hudson commented on HDFS-6460: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1797 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1797/]) HDFS-6460. Ignore stale and decommissioned nodes in NetworkTopology#sortByDistance. Contributed by Yongjun Zhang. (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1601535) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopologyWithNodeGroup.java * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/TestNetworkTopologyWithNodeGroup.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LocatedBlock.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/net/TestNetworkTopology.java Ignore stale and decommissioned nodes in NetworkTopology#sortByDistance --- Key: HDFS-6460 URL: https://issues.apache.org/jira/browse/HDFS-6460 Project: Hadoop HDFS Issue Type: Improvement Reporter: Yongjun Zhang Assignee: Yongjun Zhang Priority: Minor Attachments: HDFS-6460-branch2.001.patch, HDFS-6460.001.patch, HDFS-6460.002.patch Per discussion in HDFS-6268, filing this jira as a follow-up, so that we can improve the sorting result and save a bit runtime. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6493) Propose to change dfs.namenode.startup.delay.block.deletion to second instead of millisecond
[ https://issues.apache.org/jira/browse/HDFS-6493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Yu updated HDFS-6493: -- Attachment: HDFS-6493.001.patch Propose to change dfs.namenode.startup.delay.block.deletion to second instead of millisecond -- Key: HDFS-6493 URL: https://issues.apache.org/jira/browse/HDFS-6493 Project: Hadoop HDFS Issue Type: Bug Reporter: Juan Yu Assignee: Juan Yu Priority: Trivial Attachments: HDFS-6493.001.patch Based on the discussion in https://issues.apache.org/jira/browse/HDFS-6186, the delay will be at least 30 minutes or even hours. it's not very user friendly to use milliseconds when it's likely measured in hours. I suggest to make the following change 1. change the unit of this config to second 2. rename the config key from dfs.namenode.startup.delay.block.deletion.ms to dfs.namenode.startup.delay.block.deletion.sec 3. add the default value to hdfs-default.xml, what's the reasonable value, 30 minutes, one hour? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6399) Add note about setfacl in HDFS permissions guide
[ https://issues.apache.org/jira/browse/HDFS-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026557#comment-14026557 ] Hudson commented on HDFS-6399: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1797 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1797/]) HDFS-6399. Add note about setfacl in HDFS permissions guide. Contributed by Chris Nauroth. (wang: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1601476) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsPermissionsGuide.apt.vm Add note about setfacl in HDFS permissions guide Key: HDFS-6399 URL: https://issues.apache.org/jira/browse/HDFS-6399 Project: Hadoop HDFS Issue Type: Bug Components: documentation, namenode Affects Versions: 2.4.0 Reporter: Charles Lamb Assignee: Chris Nauroth Priority: Minor Fix For: 2.5.0 Attachments: HDFS-6399.1.patch, HDFS-6399.2.patch The ACL operations in FSNamesystem don't currently check isPermissionEnabled before calling checkOwner(). This patch corrects that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6257) TestCacheDirectives#testExceedsCapacity fails occasionally
[ https://issues.apache.org/jira/browse/HDFS-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026559#comment-14026559 ] Hudson commented on HDFS-6257: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1797 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1797/]) HDFS-6257. TestCacheDirectives#testExceedsCapacity fails occasionally (cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1601473) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java TestCacheDirectives#testExceedsCapacity fails occasionally -- Key: HDFS-6257 URL: https://issues.apache.org/jira/browse/HDFS-6257 Project: Hadoop HDFS Issue Type: Test Components: caching Affects Versions: 2.4.0 Reporter: Ted Yu Assignee: Colin Patrick McCabe Priority: Minor Fix For: 2.5.0 Attachments: HDFS-6257.001.patch From https://builds.apache.org/job/Hadoop-Hdfs-trunk/1736/ : REGRESSION: org.apache.hadoop.hdfs.server.namenode.TestCacheDirectives.testExceedsCapacity {code} Error Message: Namenode should not send extra CACHE commands expected:0 but was:2 Stack Trace: java.lang.AssertionError: Namenode should not send extra CACHE commands expected:0 but was:2 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.apache.hadoop.hdfs.server.namenode.TestCacheDirectives.testExceedsCapacity(TestCacheDirectives.java:1419) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6508) Add an XAttr to specify the cipher mode
[ https://issues.apache.org/jira/browse/HDFS-6508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026554#comment-14026554 ] Charles Lamb commented on HDFS-6508: [~tucu00] says: Our current work is implementing standard AES-CTR streaming, this is pluggable to support multiple implementations (pure Java -current-, OpenSSL backed -based on Diceros-, etc). We should store the encryption mode enum, for now simply AES-CTR. To support future impls and backward/forward compatibility we should do something like: * On create/open RPC request, client sends set of supported encryption modes to NN. * On create RPC, if the NN does support any of the modes specified by the client, EXCEPTION. * On open RPC, if the NN determines the client does not support the encryption mode used in the file on encryption, EXCEPTION. * On create RPC response, NN sends back encryption initialization data (ie key, IV) plus the encryption mode mode the client must use. * On open RPC response, NN send back encryption initialization data (ie key, IV) plus the encryption mode mode the client must use (data has been encrypted with this mode). * The client would have a switch/case statement on the encryption mode to wrap the data streams with the right impl. At the moment here is one choice only. Note that the implementation we use for a given encryption mode (first paragraph of this email) is independent of the encryption mode selection logic just described. Add an XAttr to specify the cipher mode --- Key: HDFS-6508 URL: https://issues.apache.org/jira/browse/HDFS-6508 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode, security Reporter: Charles Lamb Assignee: Charles Lamb We should specify the cipher mode in the xattrs for compatibility sake. Crypto changes over time and we need to prepare for that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6493) Propose to change dfs.namenode.startup.delay.block.deletion to second instead of millisecond
[ https://issues.apache.org/jira/browse/HDFS-6493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juan Yu updated HDFS-6493: -- Status: Patch Available (was: In Progress) Propose to change dfs.namenode.startup.delay.block.deletion to second instead of millisecond -- Key: HDFS-6493 URL: https://issues.apache.org/jira/browse/HDFS-6493 Project: Hadoop HDFS Issue Type: Bug Reporter: Juan Yu Assignee: Juan Yu Priority: Trivial Attachments: HDFS-6493.001.patch Based on the discussion in https://issues.apache.org/jira/browse/HDFS-6186, the delay will be at least 30 minutes or even hours. it's not very user friendly to use milliseconds when it's likely measured in hours. I suggest to make the following change 1. change the unit of this config to second 2. rename the config key from dfs.namenode.startup.delay.block.deletion.ms to dfs.namenode.startup.delay.block.deletion.sec 3. add the default value to hdfs-default.xml, what's the reasonable value, 30 minutes, one hour? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-742) A down DataNode makes Balancer to hang on repeatingly asking NameNode its partial block list
[ https://issues.apache.org/jira/browse/HDFS-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated HDFS-742: --- Attachment: HDFS-742.patch A down DataNode makes Balancer to hang on repeatingly asking NameNode its partial block list Key: HDFS-742 URL: https://issues.apache.org/jira/browse/HDFS-742 Project: Hadoop HDFS Issue Type: Bug Components: balancer Reporter: Hairong Kuang Assignee: Mit Desai Attachments: HDFS-742.patch We had a balancer that had not made any progress for a long time. It turned out it was repeatingly asking Namenode for a partial block list of one datanode, which was done while the balancer was running. NameNode should notify Balancer that the datanode is not available and Balancer should stop asking for the datanode's block list. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-2006) ability to support storing extended attributes per file
[ https://issues.apache.org/jira/browse/HDFS-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uma Maheswara Rao G updated HDFS-2006: -- Attachment: HDFS-2006-Branch-2-Merge.patch Sure. Thanks a lot, Andrew and Chris for your opinions. I have created a branch-2 merge patch and attached here for the reference. Andrew, Please use this patch if you plan to run the jenkins on it. I am running tests locally on it and will do some basic testing. Please note, this patch contains only HDFS-2006 subtasks. Remaining top-level jiras will be merged in their respective jiras for tracking correctly. ability to support storing extended attributes per file --- Key: HDFS-2006 URL: https://issues.apache.org/jira/browse/HDFS-2006 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: HDFS XAttrs (HDFS-2006) Reporter: dhruba borthakur Assignee: Yi Liu Fix For: 3.0.0 Attachments: ExtendedAttributes.html, HDFS-2006-Branch-2-Merge.patch, HDFS-2006-Merge-1.patch, HDFS-2006-Merge-2.patch, HDFS-XAttrs-Design-1.pdf, HDFS-XAttrs-Design-2.pdf, HDFS-XAttrs-Design-3.pdf, Test-Plan-for-Extended-Attributes-1.pdf, xattrs.1.patch, xattrs.patch It would be nice if HDFS provides a feature to store extended attributes for files, similar to the one described here: http://en.wikipedia.org/wiki/Extended_file_attributes. The challenge is that it has to be done in such a way that a site not using this feature does not waste precious memory resources in the namenode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6508) Add an XAttr to specify the cipher mode
[ https://issues.apache.org/jira/browse/HDFS-6508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026592#comment-14026592 ] Charles Lamb commented on HDFS-6508: bq. On create RPC, if the NN does support any of the modes specified by the client, EXCEPTION. s/does support/does not support/ Add an XAttr to specify the cipher mode --- Key: HDFS-6508 URL: https://issues.apache.org/jira/browse/HDFS-6508 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode, security Reporter: Charles Lamb Assignee: Charles Lamb We should specify the cipher mode in the xattrs for compatibility sake. Crypto changes over time and we need to prepare for that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2006) ability to support storing extended attributes per file
[ https://issues.apache.org/jira/browse/HDFS-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026606#comment-14026606 ] Hadoop QA commented on HDFS-2006: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12649598/HDFS-2006-Branch-2-Merge.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7074//console This message is automatically generated. ability to support storing extended attributes per file --- Key: HDFS-2006 URL: https://issues.apache.org/jira/browse/HDFS-2006 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: HDFS XAttrs (HDFS-2006) Reporter: dhruba borthakur Assignee: Yi Liu Fix For: 3.0.0 Attachments: ExtendedAttributes.html, HDFS-2006-Branch-2-Merge.patch, HDFS-2006-Merge-1.patch, HDFS-2006-Merge-2.patch, HDFS-XAttrs-Design-1.pdf, HDFS-XAttrs-Design-2.pdf, HDFS-XAttrs-Design-3.pdf, Test-Plan-for-Extended-Attributes-1.pdf, xattrs.1.patch, xattrs.patch It would be nice if HDFS provides a feature to store extended attributes for files, similar to the one described here: http://en.wikipedia.org/wiki/Extended_file_attributes. The challenge is that it has to be done in such a way that a site not using this feature does not waste precious memory resources in the namenode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-742) A down DataNode makes Balancer to hang on repeatingly asking NameNode its partial block list
[ https://issues.apache.org/jira/browse/HDFS-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026612#comment-14026612 ] Mit Desai commented on HDFS-742: Attaching the patch. Unfortunately I do not have a way to reproduce the issue so I'm unable to have a test to verify the change. Here is the explanation of the part of the Balancer code makes it hang forever. In the following while loop in Balancer.java, when the Balancer figures out that it should fetch more blocks, it gets the BlockList and decrements the blockToReceive by that many blocks. It again starts from the top of the loop after that. {code} while(!isTimeUp getScheduledSize()0 (!srcBlockList.isEmpty() || blocksToReceive0)) { ## SOME LINES OMITTED ## filterMovedBlocks(); // filter already moved blocks if (shouldFetchMoreBlocks()) { // fetch new blocks try { blocksToReceive -= getBlockList(); continue; } catch (IOException e) { ## SOME LINES OMITTED ## // check if time is up or not if (Time.now()-startTime MAX_ITERATION_TIME) { isTimeUp = true; continue; } ## SOME LINES OMITTED ## } {code} The problem here is, if the datanode is decommissioned, the {{getBlockList()}} method will not return anything and the {{blocksToReceive}} will not be changed. It will keep on doing this indefinitely as the {{blocksToReceive}} will always be greater than 0. The {{isTimeUp}} will never be set to true as it will never reach that part of the code. In the patch that is submitted, the Time up condition is moved to the top of the loop. So it will check if {{isTimeUp}} is set and proceed ahead only if time up is not encountered. A down DataNode makes Balancer to hang on repeatingly asking NameNode its partial block list Key: HDFS-742 URL: https://issues.apache.org/jira/browse/HDFS-742 Project: Hadoop HDFS Issue Type: Bug Components: balancer Reporter: Hairong Kuang Assignee: Mit Desai Attachments: HDFS-742.patch We had a balancer that had not made any progress for a long time. It turned out it was repeatingly asking Namenode for a partial block list of one datanode, which was done while the balancer was running. NameNode should notify Balancer that the datanode is not available and Balancer should stop asking for the datanode's block list. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6475) WebHdfs clients fail without retry because incorrect handling of StandbyException
[ https://issues.apache.org/jira/browse/HDFS-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026642#comment-14026642 ] Yongjun Zhang commented on HDFS-6475: - HI [~daryn], Thanks a lot for the review and comments. I attempted to let UserProvider class throw different exception earlier, and found it inherits from classes of jersey package, which we won't be able to change the interface spec. I could just make a duplicated copy of the IPC server code to the ExceptionHandler area so they are not tied to each other. I'm open for that. WebHdfs clients fail without retry because incorrect handling of StandbyException - Key: HDFS-6475 URL: https://issues.apache.org/jira/browse/HDFS-6475 Project: Hadoop HDFS Issue Type: Bug Components: ha, webhdfs Affects Versions: 2.4.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-6475.001.patch, HDFS-6475.002.patch With WebHdfs clients connected to a HA HDFS service, the delegation token is previously initialized with the active NN. When clients try to issue request, the NN it contacts is stored in a map returned by DFSUtil.getNNServiceRpcAddresses(conf). And the client contact the NN based on the order, so likely the first one it runs into is StandbyNN. If the StandbyNN doesn't have the updated client crediential, it will throw a s SecurityException that wraps StandbyException. The client is expected to retry another NN, but due to the insufficient handling of SecurityException mentioned above, it failed. Example message: {code} {RemoteException={message=Failed to obtain user group information: org.apache.hadoop.security.token.SecretManager$InvalidToken: StandbyException, javaCl assName=java.lang.SecurityException, exception=SecurityException}} org.apache.hadoop.ipc.RemoteException(java.lang.SecurityException): Failed to obtain user group information: org.apache.hadoop.security.token.SecretManager$InvalidToken: StandbyException at org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:159) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:325) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$700(WebHdfsFileSystem.java:107) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.getResponse(WebHdfsFileSystem.java:635) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:542) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.run(WebHdfsFileSystem.java:431) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:685) at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:696) at kclient1.kclient$1.run(kclient.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:356) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528) at kclient1.kclient.main(kclient.java:58) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6510) WebHdfs clients clears the delegation token on retry (for HA), thus failing retry requests
Yongjun Zhang created HDFS-6510: --- Summary: WebHdfs clients clears the delegation token on retry (for HA), thus failing retry requests Key: HDFS-6510 URL: https://issues.apache.org/jira/browse/HDFS-6510 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang In WebHdfs clients connected to a HA HDFS service, when a failure (that is inferred as a failover) happens and retry is attempted, the delegation token previously initialized is cleared. For token based auth, this causes all the subsequent retries to fail due to auth errors. See delegationToken = null in method WebHdfsFileSystem.resetStateToFailOver. This issue would not only show up on failover, but happens more commonly when the first NN specified in the config is not reachable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6502) incorrect description in distcp2 document
[ https://issues.apache.org/jira/browse/HDFS-6502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026658#comment-14026658 ] Yongjun Zhang commented on HDFS-6502: - HI [~ajisakaa], thanks a lot for the quick patch. I will try to review the changes as soon as I can. incorrect description in distcp2 document - Key: HDFS-6502 URL: https://issues.apache.org/jira/browse/HDFS-6502 Project: Hadoop HDFS Issue Type: Bug Components: documentation Affects Versions: 1.2.1, 2.5.0 Reporter: Yongjun Zhang Assignee: Akira AJISAKA Labels: newbie Attachments: HDFS-6502.patch In http://hadoop.apache.org/docs/r1.2.1/distcp2.html#UpdateAndOverwrite The first statement of the Update and Overwrite section says: {quote} -update is used to copy files from source that don't exist at the target, or have different contents. -overwrite overwrites target-files even if they exist at the source, or have the same contents. {quote} The Command Line Options table says : {quote} -overwrite: Overwrite destination -update: Overwrite if src size different from dst size {quote} Based on the implementation, making the following modification would be more accurate: The first statement of the Update and Overwrite section: {code} -update is used to copy files from source that don't exist at the target, or have different contents. -overwrite overwrites target-files if they exist at the target. {code} The Command Line Options table: {code} -overwrite: Overwrite destination -update: Overwrite destination if source and destination have different contents {code} Thanks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6511) BlockManager#computeInvalidateWork() could do nothing
Juan Yu created HDFS-6511: - Summary: BlockManager#computeInvalidateWork() could do nothing Key: HDFS-6511 URL: https://issues.apache.org/jira/browse/HDFS-6511 Project: Hadoop HDFS Issue Type: Improvement Reporter: Juan Yu Assignee: Juan Yu Priority: Minor BlockManager#computeInvalidateWork() uses a for loop to check certain number of DNs to do invalidation work. but it's possible that a DN has nothing to invalidate. computeInvalidateWork() should loop until really invalidate certain number of DNs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6395) Assorted improvements to xattr limit checking
[ https://issues.apache.org/jira/browse/HDFS-6395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026678#comment-14026678 ] Chris Nauroth commented on HDFS-6395: - {code} if (xAttr.getNameSpace() == XAttr.NameSpace.USER || xAttr.getNameSpace() == XAttr.NameSpace.TRUSTED) { {code} Minor nit-pick: this piece of logic is duplicated. It might be worthwhile to put this in a helper method or possibly add an {{isUserVisible}} method on the {{NameSpace}} enum to document the intent. I agree with Andrew's feedback about removing these prints. If it's too tricky to log warnings during loading right now, then let's proceed with the other valuable changes in this patch. We can always revisit the logging later if needed. Thank you for working on this, Yi. Assorted improvements to xattr limit checking - Key: HDFS-6395 URL: https://issues.apache.org/jira/browse/HDFS-6395 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.0.0 Reporter: Andrew Wang Assignee: Yi Liu Attachments: HDFS-6395.patch It'd be nice to print messages during fsimage and editlog loading if we hit either the # of xattrs per inode or the xattr size limits. We should also consider making the # of xattrs limit only apply to the user namespace, or to each namespace separately, to prevent users from locking out access to other namespaces. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6510) WebHdfs clients clears the delegation token on retry (for HA), thus failing retry requests
[ https://issues.apache.org/jira/browse/HDFS-6510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026694#comment-14026694 ] Haohui Mai commented on HDFS-6510: -- This should be a duplicate of HDFS-6312. WebHdfs clients clears the delegation token on retry (for HA), thus failing retry requests -- Key: HDFS-6510 URL: https://issues.apache.org/jira/browse/HDFS-6510 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang In WebHdfs clients connected to a HA HDFS service, when a failure (that is inferred as a failover) happens and retry is attempted, the delegation token previously initialized is cleared. For token based auth, this causes all the subsequent retries to fail due to auth errors. See delegationToken = null in method WebHdfsFileSystem.resetStateToFailOver. This issue would not only show up on failover, but happens more commonly when the first NN specified in the config is not reachable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6488) HDFS superuser unable to access user's Trash files using NFSv3 mount
[ https://issues.apache.org/jira/browse/HDFS-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026716#comment-14026716 ] Colin Patrick McCabe commented on HDFS-6488: I think the issue is that on your local Linux, the {{hdfs}} user doesn't have any special permissions attached to it. So Linux sees a file owned by a different user ({{schu}}) with mode {{0700}} and thinks that you just don't have permission to read it. I don't know if there is a good resolution for this, since Linux's behavior probably can't be changed. You're basically asking {{schu}} to behave like root inside the NFS mount, but not elsewhere, and that would require kernel changes to implement. Maybe I'm missing something, but I don't see how we can implement that... What's the behavior with the actual {{root}} user? Do we implement {{root_squash}}? HDFS superuser unable to access user's Trash files using NFSv3 mount Key: HDFS-6488 URL: https://issues.apache.org/jira/browse/HDFS-6488 Project: Hadoop HDFS Issue Type: Bug Components: nfs Affects Versions: 2.3.0 Reporter: Stephen Chu As hdfs superuseruser on the NFS mount, I cannot cd or ls the /user/schu/.Trash directory: {code} bash-4.1$ cd .Trash/ bash: cd: .Trash/: Permission denied bash-4.1$ ls -la total 2 drwxr-xr-x 4 schu 2584148964 128 Jan 7 10:42 . drwxr-xr-x 4 hdfs 2584148964 128 Jan 6 16:59 .. drwx-- 2 schu 2584148964 64 Jan 7 10:45 .Trash drwxr-xr-x 2 hdfs hdfs64 Jan 7 10:42 tt bash-4.1$ ls .Trash ls: cannot open directory .Trash: Permission denied bash-4.1$ {code} When using FsShell as hdfs superuser, I have superuser permissions to schu's .Trash contents: {code} bash-4.1$ hdfs dfs -ls -R /user/schu/.Trash drwx-- - schu supergroup 0 2014-01-07 10:48 /user/schu/.Trash/Current drwx-- - schu supergroup 0 2014-01-07 10:48 /user/schu/.Trash/Current/user drwx-- - schu supergroup 0 2014-01-07 10:48 /user/schu/.Trash/Current/user/schu -rw-r--r-- 1 schu supergroup 4 2014-01-07 10:48 /user/schu/.Trash/Current/user/schu/tf1 {code} The NFSv3 logs don't produce any error when superuser tries to access schu Trash contents. However, for other permission errors (e.g. schu tries to delete a directory owned by hdfs), there will be a permission error in the logs. I think this is not specific to the .Trash directory perhaps. I created a /user/schu/dir1 which has the same permissions as .Trash (700). When I try cd'ing into the directory from the NFSv3 mount as hdfs superuser, I get the same permission denied. {code} [schu@hdfs-nfs ~]$ hdfs dfs -ls Found 4 items drwx-- - schu supergroup 0 2014-01-07 10:57 .Trash drwx-- - schu supergroup 0 2014-01-07 11:05 dir1 -rw-r--r-- 1 schu supergroup 4 2014-01-07 11:05 tf1 drwxr-xr-x - hdfs hdfs0 2014-01-07 10:42 tt bash-4.1$ whoami hdfs bash-4.1$ pwd /hdfs_nfs_mount/user/schu bash-4.1$ cd dir1 bash: cd: dir1: Permission denied bash-4.1$ {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6391) Get the Key/IV from the NameNode for encrypted files in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026721#comment-14026721 ] Charles Lamb commented on HDFS-6391: This has been subsumed by HDFS-6386. I will close this Jira when HDFS-6386 gets committed. Get the Key/IV from the NameNode for encrypted files in DFSClient - Key: HDFS-6391 URL: https://issues.apache.org/jira/browse/HDFS-6391 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode, security Reporter: Alejandro Abdelnur Assignee: Charles Lamb Attachments: HDFS-6391.1.patch When creating/opening and encrypted file, the DFSClient should get the encryption key material and the IV for the file in the create/open RPC call. HDFS admin users would never get key material/IV on encrypted files create/open. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6474) Namenode needs to get the actual keys and iv from the KeyProvider
[ https://issues.apache.org/jira/browse/HDFS-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026722#comment-14026722 ] Charles Lamb commented on HDFS-6474: This will be subsumed by HDFS-6386. I will close this when HDFS-6386 is committed. Namenode needs to get the actual keys and iv from the KeyProvider - Key: HDFS-6474 URL: https://issues.apache.org/jira/browse/HDFS-6474 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode, security Reporter: Charles Lamb Assignee: Charles Lamb The Namenode has code to connect to the KeyProvider, but it needs to actually get the keys and return them to the client at the right time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6492) Support create-time xattrs and atomically setting multiple xattrs
[ https://issues.apache.org/jira/browse/HDFS-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026736#comment-14026736 ] Colin Patrick McCabe commented on HDFS-6492: Technically it could still be atomic if we create all the separate edit log ops under the FSN lock. The only cases I can see that might be problematic are where the edit log got truncated (how?) or there was corruption. But it's simple enough to add this to the ops, and perhaps it will be helpful when making the FSN lock finer grained... I agree that we don't have to do any client-side API work here, since this is intended for encryption at the moment. Support create-time xattrs and atomically setting multiple xattrs - Key: HDFS-6492 URL: https://issues.apache.org/jira/browse/HDFS-6492 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 3.0.0 Reporter: Andrew Wang Assignee: Andrew Wang Ongoing work in HDFS-6134 requires being able to set system namespace extended attributes at create and mkdir time, as well as being able to atomically set multiple xattrs at once. There's currently no need to expose this functionality in the client API, so let's not unless we have to. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6439) NFS should not reject NFS requests to the NULL procedure whether port monitoring is enabled or not
[ https://issues.apache.org/jira/browse/HDFS-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Li updated HDFS-6439: - Attachment: HDFS-6439.003.patch NFS should not reject NFS requests to the NULL procedure whether port monitoring is enabled or not -- Key: HDFS-6439 URL: https://issues.apache.org/jira/browse/HDFS-6439 Project: Hadoop HDFS Issue Type: Bug Components: nfs Affects Versions: 2.4.0 Reporter: Brandon Li Assignee: Aaron T. Myers Attachments: HDFS-6439.003.patch, HDFS-6439.patch, HDFS-6439.patch, linux-nfs-disallow-request-from-nonsecure-port.pcapng, mount-nfs-requests.pcapng As discussed in HDFS-6406, this JIRA is to track the follow update: 1. Port monitoring is the feature name with traditional NFS server and we may want to make the config property (along with related variable allowInsecurePorts) something as dfs.nfs.port.monitoring. 2 . According to RFC2623 (http://www.rfc-editor.org/rfc/rfc2623.txt): {quote}Whether port monitoring is enabled or not, NFS servers SHOULD NOT reject NFS requests to the NULL procedure (procedure number 0). See subsection 2.3.1, NULL procedure for a complete explanation. {quote} I do notice that NFS clients (most time) send mount NULL and nfs NULL from non-privileged port. If we deny NULL call in mountd or nfs server, the client can't mount the export even as user root. 3. it would be nice to have the user guide updated for the port monitoring feature. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6312) WebHdfs HA failover is broken on secure clusters
[ https://issues.apache.org/jira/browse/HDFS-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026764#comment-14026764 ] Yongjun Zhang commented on HDFS-6312: - HI [~daryn], I filed HDFS-6510 for the same issue and [~wheat9] sent me here (thanks Haohui). I wonder if we can dedicate this jira for this particular issue instead of bundling with other fix? Thanks. WebHdfs HA failover is broken on secure clusters Key: HDFS-6312 URL: https://issues.apache.org/jira/browse/HDFS-6312 Project: Hadoop HDFS Issue Type: Bug Components: webhdfs Affects Versions: 3.0.0, 2.4.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker When webhdfs does a failover, it blanks out the delegation token. This will cause subsequent operations against the other NN to acquire a new token. Tasks cannot acquire a token (no kerberos credentials) so jobs will fail. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6510) WebHdfs clients clears the delegation token on retry (for HA), thus failing retry requests
[ https://issues.apache.org/jira/browse/HDFS-6510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026763#comment-14026763 ] Yongjun Zhang commented on HDFS-6510: - Thanks [~wheat9]! WebHdfs clients clears the delegation token on retry (for HA), thus failing retry requests -- Key: HDFS-6510 URL: https://issues.apache.org/jira/browse/HDFS-6510 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang In WebHdfs clients connected to a HA HDFS service, when a failure (that is inferred as a failover) happens and retry is attempted, the delegation token previously initialized is cleared. For token based auth, this causes all the subsequent retries to fail due to auth errors. See delegationToken = null in method WebHdfsFileSystem.resetStateToFailOver. This issue would not only show up on failover, but happens more commonly when the first NN specified in the config is not reachable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6482) Use block ID-based block layout on datanodes
[ https://issues.apache.org/jira/browse/HDFS-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Thomas updated HDFS-6482: --- Attachment: HDFS-6482.3.patch Made changes suggested by Arpit. I don't think that deletion of empty directories is necessary -- it was not done in the previous scheme and the benefit in terms of faster directory listings and lookups seems marginal (and there is some chance that the directory will be recreated at a later time). I have added a third subdir level (with the 25th to 32nd bits of the block ID) to further reduce the likelihood of directory blowup in large clusters. For a cluster with N blocks (to clarify, this means that N blocks have been created over the lifetime of the cluster, but some may have been deleted), the upper bound on the number of files in any DN directory is now N/2^24, so even for clusters with 2^30 (~1 billion) blocks created over their lifetimes we should have fairly small directories. I don't think there's any need to implement further logic to prevent a directory from exceeding 256 entries, since this can't happen anyway with clusters with fewer than 2^32 blocks created, and even then the probability is very small. Use block ID-based block layout on datanodes Key: HDFS-6482 URL: https://issues.apache.org/jira/browse/HDFS-6482 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.0 Reporter: James Thomas Assignee: James Thomas Attachments: HDFS-6482.1.patch, HDFS-6482.2.patch, HDFS-6482.3.patch, HDFS-6482.patch Right now blocks are placed into directories that are split into many subdirectories when capacity is reached. Instead we can use a block's ID to determine the path it should go in. This eliminates the need for the LDir data structure that facilitates the splitting of directories when they reach capacity as well as fields in ReplicaInfo that keep track of a replica's location. An extension of the work in HDFS-3290. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6493) Propose to change dfs.namenode.startup.delay.block.deletion to second instead of millisecond
[ https://issues.apache.org/jira/browse/HDFS-6493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026774#comment-14026774 ] Hadoop QA commented on HDFS-6493: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12649593/HDFS-6493.001.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7073//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7073//console This message is automatically generated. Propose to change dfs.namenode.startup.delay.block.deletion to second instead of millisecond -- Key: HDFS-6493 URL: https://issues.apache.org/jira/browse/HDFS-6493 Project: Hadoop HDFS Issue Type: Bug Reporter: Juan Yu Assignee: Juan Yu Priority: Trivial Attachments: HDFS-6493.001.patch Based on the discussion in https://issues.apache.org/jira/browse/HDFS-6186, the delay will be at least 30 minutes or even hours. it's not very user friendly to use milliseconds when it's likely measured in hours. I suggest to make the following change 1. change the unit of this config to second 2. rename the config key from dfs.namenode.startup.delay.block.deletion.ms to dfs.namenode.startup.delay.block.deletion.sec 3. add the default value to hdfs-default.xml, what's the reasonable value, 30 minutes, one hour? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL
[ https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026772#comment-14026772 ] Colin Patrick McCabe commented on HDFS-6382: bq. You mean that we scan the whole namespace at first and then split it into 5 pieces according to hash of the path, why do we just complete the work during the first scanning process? If I misunderstand your meaning, please point out. You need to make one RPC for each file or directory you delete. In contrast, when listing a directory you make only one RPC for every {{dfs.ls.limit}} elements (by default 1000). So if you have 5 workers all listing all directories, but only calling delete on some of the files, you still might come out ahead in terms of number of RPCs, provided you had a high ratio of files to directories. There are other ways to partition the namespace which are smarter, but rely on some knowledge of what is in it, which you'd have to keep track of. A single node design will work for now, though. Considering that you probably want rate-limiting anyway. bq. For the simplicity purpose, in the initial version, we will use logs to record which file/directory is deleted by TTL, and errors during the deleting process. Even if it's not implemented at first, we should think about the configuration required here. I think we want the ability to email the admins when things go wrong. Possibly the notifier could be pluggable or have several policies. There was nothing in the doc about configuration in general, which I think we need to fix. For example, how is rate limiting configurable? How do we notify admins that the rate is too slow to finish in the time given? bq. It doesn't need to be an administrator command, user only can setTtl on file/directory that they have write permission, and can getTtl on file/directory that they have read permission. You can't delete a file in HDFS unless you have write permission on the containing directory. Whether you have write permission on the file itself is not relevant. So I would expect the same semantics here (probably enforced by setfacl itself). HDFS File/Directory TTL --- Key: HDFS-6382 URL: https://issues.apache.org/jira/browse/HDFS-6382 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client, namenode Affects Versions: 2.4.0 Reporter: Zesheng Wu Assignee: Zesheng Wu Attachments: HDFS-TTL-Design.pdf In production environment, we always have scenario like this, we want to backup files on hdfs for some time and then hope to delete these files automatically. For example, we keep only 1 day's logs on local disk due to limited disk space, but we need to keep about 1 month's logs in order to debug program bugs, so we keep all the logs on hdfs and delete logs which are older than 1 month. This is a typical scenario of HDFS TTL. So here we propose that hdfs can support TTL. Following are some details of this proposal: 1. HDFS can support TTL on a specified file or directory 2. If a TTL is set on a file, the file will be deleted automatically after the TTL is expired 3. If a TTL is set on a directory, the child files and directories will be deleted automatically after the TTL is expired 4. The child file/directory's TTL configuration should override its parent directory's 5. A global configuration is needed to configure that whether the deleted files/directories should go to the trash or not 6. A global configuration is needed to configure that whether a directory with TTL should be deleted when it is emptied by TTL mechanism or not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6364) Incorrect check for unknown datanode in Balancer
[ https://issues.apache.org/jira/browse/HDFS-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoy Antony updated HDFS-6364: --- Attachment: HDFS-6364.patch Attaching the newer patch Incorrect check for unknown datanode in Balancer Key: HDFS-6364 URL: https://issues.apache.org/jira/browse/HDFS-6364 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 2.4.0 Reporter: Benoy Antony Assignee: Benoy Antony Attachments: HDFS-6364-6441.patch, HDFS-6364.patch, HDFS-6364.patch The Balancer makes a check to see if a block's location is known datanode. But the variable it uses to check is wrong. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6488) HDFS superuser unable to access user's Trash files using NFSv3 mount
[ https://issues.apache.org/jira/browse/HDFS-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026781#comment-14026781 ] Colin Patrick McCabe commented on HDFS-6488: HDFS-6498 might offer a way to do this HDFS superuser unable to access user's Trash files using NFSv3 mount Key: HDFS-6488 URL: https://issues.apache.org/jira/browse/HDFS-6488 Project: Hadoop HDFS Issue Type: Bug Components: nfs Affects Versions: 2.3.0 Reporter: Stephen Chu As hdfs superuseruser on the NFS mount, I cannot cd or ls the /user/schu/.Trash directory: {code} bash-4.1$ cd .Trash/ bash: cd: .Trash/: Permission denied bash-4.1$ ls -la total 2 drwxr-xr-x 4 schu 2584148964 128 Jan 7 10:42 . drwxr-xr-x 4 hdfs 2584148964 128 Jan 6 16:59 .. drwx-- 2 schu 2584148964 64 Jan 7 10:45 .Trash drwxr-xr-x 2 hdfs hdfs64 Jan 7 10:42 tt bash-4.1$ ls .Trash ls: cannot open directory .Trash: Permission denied bash-4.1$ {code} When using FsShell as hdfs superuser, I have superuser permissions to schu's .Trash contents: {code} bash-4.1$ hdfs dfs -ls -R /user/schu/.Trash drwx-- - schu supergroup 0 2014-01-07 10:48 /user/schu/.Trash/Current drwx-- - schu supergroup 0 2014-01-07 10:48 /user/schu/.Trash/Current/user drwx-- - schu supergroup 0 2014-01-07 10:48 /user/schu/.Trash/Current/user/schu -rw-r--r-- 1 schu supergroup 4 2014-01-07 10:48 /user/schu/.Trash/Current/user/schu/tf1 {code} The NFSv3 logs don't produce any error when superuser tries to access schu Trash contents. However, for other permission errors (e.g. schu tries to delete a directory owned by hdfs), there will be a permission error in the logs. I think this is not specific to the .Trash directory perhaps. I created a /user/schu/dir1 which has the same permissions as .Trash (700). When I try cd'ing into the directory from the NFSv3 mount as hdfs superuser, I get the same permission denied. {code} [schu@hdfs-nfs ~]$ hdfs dfs -ls Found 4 items drwx-- - schu supergroup 0 2014-01-07 10:57 .Trash drwx-- - schu supergroup 0 2014-01-07 11:05 dir1 -rw-r--r-- 1 schu supergroup 4 2014-01-07 11:05 tf1 drwxr-xr-x - hdfs hdfs0 2014-01-07 10:42 tt bash-4.1$ whoami hdfs bash-4.1$ pwd /hdfs_nfs_mount/user/schu bash-4.1$ cd dir1 bash: cd: dir1: Permission denied bash-4.1$ {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6439) NFS should not reject NFS requests to the NULL procedure whether port monitoring is enabled or not
[ https://issues.apache.org/jira/browse/HDFS-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026784#comment-14026784 ] Brandon Li commented on HDFS-6439: -- Thank you, [~atm]. I've rebased your patch and uploaded a new one. Basically, the new patch does the port monitoring in each NFS handlers. If we deny the request at RPC level, some NFS client might keep sending the same NFS request(e.g., GETATTR). For mountd, it only does the check for MNT request since some utilities (e.g., showmount) sends EXPORT request using non-privileged port which we don't want to fail. I also used the opportunity to do a cleanup for the NFS3Interface. Port monitor is by default disabled to not make the gateway easier to use, especially for Windows/MacOS NFS client and developers. Please review. NFS should not reject NFS requests to the NULL procedure whether port monitoring is enabled or not -- Key: HDFS-6439 URL: https://issues.apache.org/jira/browse/HDFS-6439 Project: Hadoop HDFS Issue Type: Bug Components: nfs Affects Versions: 2.4.0 Reporter: Brandon Li Assignee: Aaron T. Myers Attachments: HDFS-6439.003.patch, HDFS-6439.patch, HDFS-6439.patch, linux-nfs-disallow-request-from-nonsecure-port.pcapng, mount-nfs-requests.pcapng As discussed in HDFS-6406, this JIRA is to track the follow update: 1. Port monitoring is the feature name with traditional NFS server and we may want to make the config property (along with related variable allowInsecurePorts) something as dfs.nfs.port.monitoring. 2 . According to RFC2623 (http://www.rfc-editor.org/rfc/rfc2623.txt): {quote}Whether port monitoring is enabled or not, NFS servers SHOULD NOT reject NFS requests to the NULL procedure (procedure number 0). See subsection 2.3.1, NULL procedure for a complete explanation. {quote} I do notice that NFS clients (most time) send mount NULL and nfs NULL from non-privileged port. If we deny NULL call in mountd or nfs server, the client can't mount the export even as user root. 3. it would be nice to have the user guide updated for the port monitoring feature. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6364) Incorrect check for unknown datanode in Balancer
[ https://issues.apache.org/jira/browse/HDFS-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026788#comment-14026788 ] Arpit Agarwal commented on HDFS-6364: - +1 pending Jenkins. Per discussion with Benoy he will include the test case in a separate Jira due to the dependence on HDFS-6441. Incorrect check for unknown datanode in Balancer Key: HDFS-6364 URL: https://issues.apache.org/jira/browse/HDFS-6364 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 2.4.0 Reporter: Benoy Antony Assignee: Benoy Antony Attachments: HDFS-6364-6441.patch, HDFS-6364.patch, HDFS-6364.patch The Balancer makes a check to see if a block's location is known datanode. But the variable it uses to check is wrong. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (HDFS-6349) Bound checks missing in FSEditLogOp.AclEditLogUtil
[ https://issues.apache.org/jira/browse/HDFS-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth resolved HDFS-6349. - Resolution: Duplicate Hi, [~kihwal]. This same issue is tracked in HDFS-5995, so I'm resolving HDFS-6349 as duplicate. There is a lot of discussion and a proposed patch on the other issue, but no concrete resolution yet. Bound checks missing in FSEditLogOp.AclEditLogUtil -- Key: HDFS-6349 URL: https://issues.apache.org/jira/browse/HDFS-6349 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.5.0 Reporter: Kihwal Lee AclEditLogUtil.read() can throw OutOfMemoryException when it encounters a certain corrupt entry. This is because it reads the size and blindly tries to allocate an ArrayList of that size. Because of this a test case always dumps the heap. The test doesn't fail since edit log loading catches and handles the exception. {panel} Running org.apache.hadoop.hdfs.server.namenode.TestNameNodeRecovery java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid18667.hprof ... Heap dump file created \[43870820 bytes in 0.154 secs\] {panel} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6315) Decouple recording edit logs from FSDirectory
[ https://issues.apache.org/jira/browse/HDFS-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026835#comment-14026835 ] Haohui Mai commented on HDFS-6315: -- [~daryn], any updates? The patch is ready to go in. There are several patches depending on this one therefore I'd like to keep moving forward. Thanks. Decouple recording edit logs from FSDirectory - Key: HDFS-6315 URL: https://issues.apache.org/jira/browse/HDFS-6315 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Haohui Mai Assignee: Haohui Mai Attachments: HDFS-6315.000.patch, HDFS-6315.001.patch, HDFS-6315.002.patch, HDFS-6315.003.patch, HDFS-6315.004.patch, HDFS-6315.005.patch Currently both FSNamesystem and FSDirectory record edit logs. This design requires both FSNamesystem and FSDirectory to be tightly coupled together to implement a durable namespace. This jira proposes to separate the responsibility of implementing the namespace and providing durability with edit logs. Specifically, FSDirectory implements the namespace (which should have no edit log operations), and FSNamesystem implement durability by recording the edit logs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6512) Unnecessary synchronization on collection used for test
Benoy Antony created HDFS-6512: -- Summary: Unnecessary synchronization on collection used for test Key: HDFS-6512 URL: https://issues.apache.org/jira/browse/HDFS-6512 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Benoy Antony Assignee: Benoy Antony Priority: Minor The function _SecurityUtil.getKerberosInfo()_ is a function used during authentication and authorization. It has two synchronized blocks and one of them is on testProviders. This is an unnecessary lock given that the testProviders is empty in real scenario. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6323) Compress files in transit for fs -get and -put operations
[ https://issues.apache.org/jira/browse/HDFS-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026863#comment-14026863 ] Tom Panning commented on HDFS-6323: --- Hi Andrew, It could be related to those two issues, depending on how they are implemented. If files are compressed transparently, and they remain compressed until the -get and -put commands uncompress them on the local machine, that would solve my problem. But if the files are transparently uncompressed as they are read of the HDFS disk, then that wouldn't. Allowing webhdfs to use compression would also solve my problem. Compress files in transit for fs -get and -put operations - Key: HDFS-6323 URL: https://issues.apache.org/jira/browse/HDFS-6323 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client Reporter: Tom Panning Priority: Minor For the {{hadoop fs -get}} and {{hadoop fs -put}} commands, it would be nice if there was an option to compress the file(s) in transit. For some people, the Hadoop cluster is far away (in terms of the network) or must be accessed through a VPN, and many files that are put on or retrieved from the cluster are very large compared to the available bandwidth. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6482) Use block ID-based block layout on datanodes
[ https://issues.apache.org/jira/browse/HDFS-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Thomas updated HDFS-6482: --- Release Note: The directory structure for finalized replicas on DNs has been changed. Now, the directory that a finalized replica goes in is determined uniquely by its ID. Specifically, we use a three-level directory structure, with the 32nd through 25th bits of a block ID identifying the correct directory at the first level, the 24th through 17th bits identifying the correct directory at the second level, and the 16th through 8th identifying the correct directory at the third level. Status: Patch Available (was: Open) Use block ID-based block layout on datanodes Key: HDFS-6482 URL: https://issues.apache.org/jira/browse/HDFS-6482 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.0 Reporter: James Thomas Assignee: James Thomas Attachments: HDFS-6482.1.patch, HDFS-6482.2.patch, HDFS-6482.3.patch, HDFS-6482.patch Right now blocks are placed into directories that are split into many subdirectories when capacity is reached. Instead we can use a block's ID to determine the path it should go in. This eliminates the need for the LDir data structure that facilitates the splitting of directories when they reach capacity as well as fields in ReplicaInfo that keep track of a replica's location. An extension of the work in HDFS-3290. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6364) Incorrect check for unknown datanode in Balancer
[ https://issues.apache.org/jira/browse/HDFS-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026985#comment-14026985 ] Hadoop QA commented on HDFS-6364: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12649629/HDFS-6364.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7076//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7076//console This message is automatically generated. Incorrect check for unknown datanode in Balancer Key: HDFS-6364 URL: https://issues.apache.org/jira/browse/HDFS-6364 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 2.4.0 Reporter: Benoy Antony Assignee: Benoy Antony Attachments: HDFS-6364-6441.patch, HDFS-6364.patch, HDFS-6364.patch The Balancer makes a check to see if a block's location is known datanode. But the variable it uses to check is wrong. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6439) NFS should not reject NFS requests to the NULL procedure whether port monitoring is enabled or not
[ https://issues.apache.org/jira/browse/HDFS-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027018#comment-14027018 ] Hadoop QA commented on HDFS-6439: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12649625/HDFS-6439.003.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-nfs hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs-nfs: org.apache.hadoop.fs.TestHdfsNativeCodeLoader The following test timeouts occurred in hadoop-common-project/hadoop-nfs hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs-nfs: org.apache.hadoop.oncrpc.TestFrameDecoder {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7075//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7075//console This message is automatically generated. NFS should not reject NFS requests to the NULL procedure whether port monitoring is enabled or not -- Key: HDFS-6439 URL: https://issues.apache.org/jira/browse/HDFS-6439 Project: Hadoop HDFS Issue Type: Bug Components: nfs Affects Versions: 2.4.0 Reporter: Brandon Li Assignee: Aaron T. Myers Attachments: HDFS-6439.003.patch, HDFS-6439.patch, HDFS-6439.patch, linux-nfs-disallow-request-from-nonsecure-port.pcapng, mount-nfs-requests.pcapng As discussed in HDFS-6406, this JIRA is to track the follow update: 1. Port monitoring is the feature name with traditional NFS server and we may want to make the config property (along with related variable allowInsecurePorts) something as dfs.nfs.port.monitoring. 2 . According to RFC2623 (http://www.rfc-editor.org/rfc/rfc2623.txt): {quote}Whether port monitoring is enabled or not, NFS servers SHOULD NOT reject NFS requests to the NULL procedure (procedure number 0). See subsection 2.3.1, NULL procedure for a complete explanation. {quote} I do notice that NFS clients (most time) send mount NULL and nfs NULL from non-privileged port. If we deny NULL call in mountd or nfs server, the client can't mount the export even as user root. 3. it would be nice to have the user guide updated for the port monitoring feature. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6364) Incorrect check for unknown datanode in Balancer
[ https://issues.apache.org/jira/browse/HDFS-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated HDFS-6364: Resolution: Fixed Fix Version/s: 2.5.0 3.0.0 Target Version/s: 2.5.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I committed this to trunk and branch-2. Thanks for finding and fixing this [~benoyantony]! Incorrect check for unknown datanode in Balancer Key: HDFS-6364 URL: https://issues.apache.org/jira/browse/HDFS-6364 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 2.4.0 Reporter: Benoy Antony Assignee: Benoy Antony Fix For: 3.0.0, 2.5.0 Attachments: HDFS-6364-6441.patch, HDFS-6364.patch, HDFS-6364.patch The Balancer makes a check to see if a block's location is known datanode. But the variable it uses to check is wrong. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6364) Incorrect check for unknown datanode in Balancer
[ https://issues.apache.org/jira/browse/HDFS-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027037#comment-14027037 ] Hudson commented on HDFS-6364: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5678 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5678/]) HDFS-6364. Incorrect check for unknown datanode in Balancer. (Contributed by Benoy Antony) (arp: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1601771) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java Incorrect check for unknown datanode in Balancer Key: HDFS-6364 URL: https://issues.apache.org/jira/browse/HDFS-6364 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 2.4.0 Reporter: Benoy Antony Assignee: Benoy Antony Fix For: 3.0.0, 2.5.0 Attachments: HDFS-6364-6441.patch, HDFS-6364.patch, HDFS-6364.patch The Balancer makes a check to see if a block's location is known datanode. But the variable it uses to check is wrong. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-5995) TestFSEditLogLoader#testValidateEditLogWithCorruptBody gets OutOfMemoryError and dumps heap.
[ https://issues.apache.org/jira/browse/HDFS-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027038#comment-14027038 ] Hadoop QA commented on HDFS-5995: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12630406/HDFS-5995.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7077//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7077//console This message is automatically generated. TestFSEditLogLoader#testValidateEditLogWithCorruptBody gets OutOfMemoryError and dumps heap. Key: HDFS-5995 URL: https://issues.apache.org/jira/browse/HDFS-5995 Project: Hadoop HDFS Issue Type: Test Components: namenode, test Affects Versions: 2.4.0 Reporter: Chris Nauroth Assignee: Chris Nauroth Priority: Minor Attachments: HDFS-5995.1.patch {{TestFSEditLogLoader#testValidateEditLogWithCorruptBody}} is experiencing {{OutOfMemoryError}} and dumping heap since the merge of HDFS-4685. This doesn't actually cause the test to fail, because it's a failure test that corrupts an edit log intentionally. Still, this might cause confusion if someone reviews the build logs and thinks this is a more serious problem. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6315) Decouple recording edit logs from FSDirectory
[ https://issues.apache.org/jira/browse/HDFS-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027172#comment-14027172 ] Jing Zhao commented on HDFS-6315: - +1 for the latest patch. Any other comments [~daryn]? Decouple recording edit logs from FSDirectory - Key: HDFS-6315 URL: https://issues.apache.org/jira/browse/HDFS-6315 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Haohui Mai Assignee: Haohui Mai Attachments: HDFS-6315.000.patch, HDFS-6315.001.patch, HDFS-6315.002.patch, HDFS-6315.003.patch, HDFS-6315.004.patch, HDFS-6315.005.patch Currently both FSNamesystem and FSDirectory record edit logs. This design requires both FSNamesystem and FSDirectory to be tightly coupled together to implement a durable namespace. This jira proposes to separate the responsibility of implementing the namespace and providing durability with edit logs. Specifically, FSDirectory implements the namespace (which should have no edit log operations), and FSNamesystem implement durability by recording the edit logs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6469) Coordinated replication of the namespace using ConsensusNode
[ https://issues.apache.org/jira/browse/HDFS-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027179#comment-14027179 ] Konstantin Shvachko commented on HDFS-6469: --- h4. Coordinated reads The motivation for a configurable design based on file names is to make coordinated reads available to other applications without changing them. An alternative is to provide a new option (parameter) for read operations specifying whether the read should be coordinated or not. Then the application developers rather than administrators will be in full control of what they coordinate. * Journals everywhere* If I follow your logic correctly Coordinated replication of the namespace using ConsensusNode Key: HDFS-6469 URL: https://issues.apache.org/jira/browse/HDFS-6469 Project: Hadoop HDFS Issue Type: New Feature Components: namenode Affects Versions: 3.0.0 Reporter: Konstantin Shvachko Assignee: Konstantin Shvachko Attachments: CNodeDesign.pdf This is a proposal to introduce ConsensusNode - an evolution of the NameNode, which enables replication of the namespace on multiple nodes of an HDFS cluster by means of a Coordination Engine. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (HDFS-6469) Coordinated replication of the namespace using ConsensusNode
[ https://issues.apache.org/jira/browse/HDFS-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027179#comment-14027179 ] Konstantin Shvachko edited comment on HDFS-6469 at 6/10/14 11:04 PM: - Sorry hit the submit button too early. Will repost shortly. was (Author: shv): h4. Coordinated reads The motivation for a configurable design based on file names is to make coordinated reads available to other applications without changing them. An alternative is to provide a new option (parameter) for read operations specifying whether the read should be coordinated or not. Then the application developers rather than administrators will be in full control of what they coordinate. * Journals everywhere* If I follow your logic correctly Coordinated replication of the namespace using ConsensusNode Key: HDFS-6469 URL: https://issues.apache.org/jira/browse/HDFS-6469 Project: Hadoop HDFS Issue Type: New Feature Components: namenode Affects Versions: 3.0.0 Reporter: Konstantin Shvachko Assignee: Konstantin Shvachko Attachments: CNodeDesign.pdf This is a proposal to introduce ConsensusNode - an evolution of the NameNode, which enables replication of the namespace on multiple nodes of an HDFS cluster by means of a Coordination Engine. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6386) HDFS Encryption Zones
[ https://issues.apache.org/jira/browse/HDFS-6386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb updated HDFS-6386: --- Status: Patch Available (was: Reopened) The .4 patch implements encryption zones on the server side. Included in this is (1) setting the xattr for an EZ, validating that an EZ being created or deleted is empty, existing, and the root of an EZ, (2) setting the appropriate xattr for any files created within an EZ, (3) on the client side, determining if a file refers to an encrypted file and if so, setting up the right Crypto{Input,Output}Streams for encrypting/decrypting the data, (4) removing the earlier (temporary) KEY and IV constants, (5) adds several unit tests for the above. This patch allows us to demonstrate end-to-end encryption. HDFS Encryption Zones - Key: HDFS-6386 URL: https://issues.apache.org/jira/browse/HDFS-6386 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode, security Reporter: Alejandro Abdelnur Assignee: Charles Lamb Fix For: fs-encryption (HADOOP-10150 and HDFS-6134) Define the required security xAttributes for directories and files within an encryption zone and how they propagate to children. Implement the logic to create/delete encryption zones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6379) HTTPFS - Implement ACLs support
[ https://issues.apache.org/jira/browse/HDFS-6379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Yoder updated HDFS-6379: - Status: In Progress (was: Patch Available) HTTPFS - Implement ACLs support --- Key: HDFS-6379 URL: https://issues.apache.org/jira/browse/HDFS-6379 Project: Hadoop HDFS Issue Type: Bug Reporter: Alejandro Abdelnur Assignee: Mike Yoder Fix For: 2.4.0 Attachments: jira-HDFS-6379.patch HDFS-4685 added ACLs support to WebHDFS but missed adding them to HttpFS. This JIRA is for such. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6379) HTTPFS - Implement ACLs support
[ https://issues.apache.org/jira/browse/HDFS-6379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Yoder updated HDFS-6379: - Attachment: (was: jira-HDFS-6379.patch) HTTPFS - Implement ACLs support --- Key: HDFS-6379 URL: https://issues.apache.org/jira/browse/HDFS-6379 Project: Hadoop HDFS Issue Type: Bug Reporter: Alejandro Abdelnur Assignee: Mike Yoder Fix For: 2.4.0 Attachments: jira-HDFS-6379.patch HDFS-4685 added ACLs support to WebHDFS but missed adding them to HttpFS. This JIRA is for such. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6379) HTTPFS - Implement ACLs support
[ https://issues.apache.org/jira/browse/HDFS-6379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Yoder updated HDFS-6379: - Attachment: jira-HDFS-6379.patch HTTPFS - Implement ACLs support --- Key: HDFS-6379 URL: https://issues.apache.org/jira/browse/HDFS-6379 Project: Hadoop HDFS Issue Type: Bug Reporter: Alejandro Abdelnur Assignee: Mike Yoder Fix For: 2.4.0 Attachments: jira-HDFS-6379.patch HDFS-4685 added ACLs support to WebHDFS but missed adding them to HttpFS. This JIRA is for such. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6508) Add an XAttr to specify the cipher mode
[ https://issues.apache.org/jira/browse/HDFS-6508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027213#comment-14027213 ] Charles Lamb commented on HDFS-6508: [~tucu00] also says: The IV length may depend on the algorithm being used (AES-GCM allows arbitrary lengths starting a 16bytes, if 16 it does the same logic as AES-CTR mode -uses first 8 bytes as counter-, if greater than 16 it does a funny hash computation of it to get 16bytes and then the counter logic). Building on my previous email about the enum for the encryption mode, we could put the length of the IV in the encryption-mode enum itself. Then we can remove it from the the CryptoCodec and use the constant itself. Add an XAttr to specify the cipher mode --- Key: HDFS-6508 URL: https://issues.apache.org/jira/browse/HDFS-6508 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode, security Reporter: Charles Lamb Assignee: Charles Lamb We should specify the cipher mode in the xattrs for compatibility sake. Crypto changes over time and we need to prepare for that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6379) HTTPFS - Implement ACLs support
[ https://issues.apache.org/jira/browse/HDFS-6379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Yoder updated HDFS-6379: - Attachment: jira-HDFS-6379.patch HTTPFS - Implement ACLs support --- Key: HDFS-6379 URL: https://issues.apache.org/jira/browse/HDFS-6379 Project: Hadoop HDFS Issue Type: Bug Reporter: Alejandro Abdelnur Assignee: Mike Yoder Fix For: 2.4.0 Attachments: jira-HDFS-6379.patch, jira-HDFS-6379.patch HDFS-4685 added ACLs support to WebHDFS but missed adding them to HttpFS. This JIRA is for such. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6379) HTTPFS - Implement ACLs support
[ https://issues.apache.org/jira/browse/HDFS-6379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Yoder updated HDFS-6379: - Status: Patch Available (was: In Progress) Updated patch with test case for ACLs turned off. HTTPFS - Implement ACLs support --- Key: HDFS-6379 URL: https://issues.apache.org/jira/browse/HDFS-6379 Project: Hadoop HDFS Issue Type: Bug Reporter: Alejandro Abdelnur Assignee: Mike Yoder Fix For: 2.4.0 Attachments: jira-HDFS-6379.patch, jira-HDFS-6379.patch HDFS-4685 added ACLs support to WebHDFS but missed adding them to HttpFS. This JIRA is for such. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6482) Use block ID-based block layout on datanodes
[ https://issues.apache.org/jira/browse/HDFS-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027226#comment-14027226 ] Hadoop QA commented on HDFS-6482: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12649627/HDFS-6482.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestBlockMissingException org.apache.hadoop.hdfs.TestCrcCorruption org.apache.hadoop.hdfs.TestMissingBlocksAlert org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure org.apache.hadoop.hdfs.TestBlockReaderLocalLegacy org.apache.hadoop.hdfs.server.namenode.ha.TestPendingCorruptDnMessages org.apache.hadoop.hdfs.protocol.TestLayoutVersion org.apache.hadoop.hdfs.server.namenode.TestProcessCorruptBlocks org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestDatanodeRestart org.apache.hadoop.hdfs.server.namenode.TestXAttrConfigFlag org.apache.hadoop.hdfs.server.namenode.TestFSEditLogLoader org.apache.hadoop.hdfs.TestFileCorruption org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks org.apache.hadoop.hdfs.TestBlockReaderLocal org.apache.hadoop.hdfs.server.namenode.TestListCorruptFileBlocks org.apache.hadoop.hdfs.server.datanode.TestCachingStrategy org.apache.hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead org.apache.hadoop.hdfs.TestDFSClientRetries org.apache.hadoop.hdfs.server.blockmanagement.TestOverReplicatedBlocks The following test timeouts occurred in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.datanode.TestDeleteBlockPool org.apache.hadoop.hdfs.server.namenode.TestFsck org.apache.hadoop.hdfs.TestReplication org.apache.hadoop.hdfs.TestDatanodeBlockScanner {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7078//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7078//console This message is automatically generated. Use block ID-based block layout on datanodes Key: HDFS-6482 URL: https://issues.apache.org/jira/browse/HDFS-6482 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.5.0 Reporter: James Thomas Assignee: James Thomas Attachments: HDFS-6482.1.patch, HDFS-6482.2.patch, HDFS-6482.3.patch, HDFS-6482.patch Right now blocks are placed into directories that are split into many subdirectories when capacity is reached. Instead we can use a block's ID to determine the path it should go in. This eliminates the need for the LDir data structure that facilitates the splitting of directories when they reach capacity as well as fields in ReplicaInfo that keep track of a replica's location. An extension of the work in HDFS-3290. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-3493) Replication is not happened for the block (which is recovered and in finalized) to the Datanode which has got the same block with old generation timestamp in RBW
[ https://issues.apache.org/jira/browse/HDFS-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027252#comment-14027252 ] Andrew Wang commented on HDFS-3493: --- Thanks for taking this on Juan, fix looks good. Just a few nitty comments: * some whitespace only changes in BlockManager * some lines longer than 80 chars * Maybe fold {{minReplicationSatisfied }} into {{corruptedDuringWrite}} like how you assign {{hasMoreCorruptReplicas}} for parity. * In the test, do we need that sleep(1)? I'm always wary of sleeps, since they lead to test flakiness. * I think the comment should also read something like DNs will detect new dummy blocks on restart. Would also be good to drop a comment about what you're doing with creating dummy blocks. * I like to put nice conservative timeouts on my tests, e.g. {{@Test(timeout=12}}. +1 pending these though. [~vinayrpet], maybe you'd like to take a look too? Replication is not happened for the block (which is recovered and in finalized) to the Datanode which has got the same block with old generation timestamp in RBW - Key: HDFS-3493 URL: https://issues.apache.org/jira/browse/HDFS-3493 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.0.0-alpha, 2.0.5-alpha Reporter: J.Andreina Assignee: Juan Yu Attachments: HDFS-3493.002.patch, HDFS-3493.003.patch, HDFS-3493.patch replication factor= 3, block report interval= 1min and start NN and 3DN Step 1:Write a file without close and do hflush (Dn1,DN2,DN3 has blk_ts1) Step 2:Stopped DN3 Step 3:recovery happens and time stamp updated(blk_ts2) Step 4:close the file Step 5:blk_ts2 is finalized and available in DN1 and Dn2 Step 6:now restarted DN3(which has got blk_ts1 in rbw) From the NN side there is no cmd issued to DN3 to delete the blk_ts1 . But ask DN3 to make the block as corrupt . Replication of blk_ts2 to DN3 is not happened. NN logs: {noformat} INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_3927215081484173742 to add as corrupt on XX.XX.XX.XX:50276 by /XX.XX.XX.XX because reported RWR replica with genstamp 1007 does not match COMPLETE block's genstamp in block map 1008 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* processReport: from DatanodeRegistration(XX.XX.XX.XX, storageID=DS-443871816-XX.XX.XX.XX-50276-1336829714197, infoPort=50275, ipcPort=50277, storageInfo=lv=-40;cid=CID-e654ac13-92dc-4f82-a22b-c0b6861d06d7;nsid=2063001898;c=0), blocks: 2, processing time: 1 msecs INFO org.apache.hadoop.hdfs.StateChange: BLOCK* Removing block blk_3927215081484173742_1008 from neededReplications as it has enough replicas. INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_3927215081484173742 to add as corrupt on XX.XX.XX.XX:50276 by /XX.XX.XX.XX because reported RWR replica with genstamp 1007 does not match COMPLETE block's genstamp in block map 1008 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* processReport: from DatanodeRegistration(XX.XX.XX.XX, storageID=DS-443871816-XX.XX.XX.XX-50276-1336829714197, infoPort=50275, ipcPort=50277, storageInfo=lv=-40;cid=CID-e654ac13-92dc-4f82-a22b-c0b6861d06d7;nsid=2063001898;c=0), blocks: 2, processing time: 1 msecs WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1 For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy {noformat} fsck Report === {noformat} /file21: Under replicated BP-1008469586-XX.XX.XX.XX-1336829603103:blk_3927215081484173742_1008. Target Replicas is 3 but found 2 replica(s). .Status: HEALTHY Total size: 495 B Total dirs: 1 Total files: 3 Total blocks (validated):3 (avg. block size 165 B) Minimally replicated blocks: 3 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 1 (33.32 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 1 Average block replication: 2.0 Corrupt blocks: 0 Missing replicas:1 (14.285714 %) Number of data-nodes:3 Number of racks: 1 FSCK ended at Sun May 13 09:49:05 IST 2012 in 9 milliseconds The filesystem under path '/' is HEALTHY {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6379) HTTPFS - Implement ACLs support
[ https://issues.apache.org/jira/browse/HDFS-6379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027265#comment-14027265 ] Hadoop QA commented on HDFS-6379: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12649701/jira-HDFS-6379.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-httpfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7079//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7079//console This message is automatically generated. HTTPFS - Implement ACLs support --- Key: HDFS-6379 URL: https://issues.apache.org/jira/browse/HDFS-6379 Project: Hadoop HDFS Issue Type: Bug Reporter: Alejandro Abdelnur Assignee: Mike Yoder Fix For: 2.4.0 Attachments: jira-HDFS-6379.patch, jira-HDFS-6379.patch HDFS-4685 added ACLs support to WebHDFS but missed adding them to HttpFS. This JIRA is for such. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (HDFS-6469) Coordinated replication of the namespace using ConsensusNode
[ https://issues.apache.org/jira/browse/HDFS-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027179#comment-14027179 ] Konstantin Shvachko edited comment on HDFS-6469 at 6/11/14 12:32 AM: - Todd, interesting points and you actually answered most of them yourself in your comments. h4. Coordinated reads The motivation for a configurable design based on file names is to make coordinated reads available to other applications without changing them. An alternative is to provide a new option (parameter) for read operations specifying whether the read should be coordinated or not. Then the application developers rather than administrators will be in full control of what they coordinate. h4. Journals everywhere If I follow your logic correctly QJM being Paxos-based uses a journal by itself, so we are not increasing journaling here. When you look at the bigger picture we see more journals around. HBase uses WAL along with NN edits, which by itself persisted in ext4 a journaling file system. As you said if you need to separate them one can use different drives, or SSDs. Besides as I said in a previous comment one can choose to eliminate CNode edits completely. h4. Determinism Determinism is not as hard as it may seem. Not harder than multi-grain locking. Say, you need to watch that incremental counters like genStamp are called only in agreements and never in proposals. Similar to how you watch that object locks are acquired in the right order. Most of that is already in the NameNode, thanks to StandbyNode implementation. h4. AA vs AS HA There are several advantages of AA approach over the evolution of the current one outlined in your comment. All CNodes are writeable while in current approach only the Active NN is. * If GC hits the active NN service is interrupted. CNodes can continue as other nodes can process writes. * Reads from an SBN are always stale. And in order to write everybody should go to the active. So they will be writing to the namespace from the future so to speak. I think with mixed workloads all clients will end up working with the active NN only, and there won't be enough load balancing. As you mentioned your design will be addressing the same problems as ConsensusNode. The question is what you would rather have in the end: active-active or active-read-only-standby. h4. why this design makes it any easier to implement a distributed namespace? I meant the advantage of introducing of a coordination engine in general, not the CNode itself. I meant that making coordinated updates involving different parts of a distributed namespace is rather trivial with a coordination engine. Like an atomic rename, that is a move of a file from one parent to another when the parents are on different partitions. h6. locking ~This should really belong to a different issue. I just want to mention here that I don't see that one contradicts another. On the contrary they can be very much in sync. I remember we talked about an optimization schema for coordinated namepsace when different parts of the namespace are coordinated independently (and in parallel under different state machines.~ was (Author: shv): Sorry hit the submit button too early. Will repost shortly. Coordinated replication of the namespace using ConsensusNode Key: HDFS-6469 URL: https://issues.apache.org/jira/browse/HDFS-6469 Project: Hadoop HDFS Issue Type: New Feature Components: namenode Affects Versions: 3.0.0 Reporter: Konstantin Shvachko Assignee: Konstantin Shvachko Attachments: CNodeDesign.pdf This is a proposal to introduce ConsensusNode - an evolution of the NameNode, which enables replication of the namespace on multiple nodes of an HDFS cluster by means of a Coordination Engine. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6382) HDFS File/Directory TTL
[ https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027325#comment-14027325 ] Zesheng Wu commented on HDFS-6382: -- bq. Even if it's not implemented at first, we should think about the configuration required here. I think we want the ability to email the admins when things go wrong. Possibly the notifier could be pluggable or have several policies. There was nothing in the doc about configuration in general, which I think we need to fix. For example, how is rate limiting configurable? How do we notify admins that the rate is too slow to finish in the time given? OK, I will update the document and post a new version soon. bq. You can't delete a file in HDFS unless you have write permission on the containing directory. Whether you have write permission on the file itself is not relevant. So I would expect the same semantics here (probably enforced by setfacl itself). That's reasonable, I'll figure it out clearly in the document. HDFS File/Directory TTL --- Key: HDFS-6382 URL: https://issues.apache.org/jira/browse/HDFS-6382 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client, namenode Affects Versions: 2.4.0 Reporter: Zesheng Wu Assignee: Zesheng Wu Attachments: HDFS-TTL-Design.pdf In production environment, we always have scenario like this, we want to backup files on hdfs for some time and then hope to delete these files automatically. For example, we keep only 1 day's logs on local disk due to limited disk space, but we need to keep about 1 month's logs in order to debug program bugs, so we keep all the logs on hdfs and delete logs which are older than 1 month. This is a typical scenario of HDFS TTL. So here we propose that hdfs can support TTL. Following are some details of this proposal: 1. HDFS can support TTL on a specified file or directory 2. If a TTL is set on a file, the file will be deleted automatically after the TTL is expired 3. If a TTL is set on a directory, the child files and directories will be deleted automatically after the TTL is expired 4. The child file/directory's TTL configuration should override its parent directory's 5. A global configuration is needed to configure that whether the deleted files/directories should go to the trash or not 6. A global configuration is needed to configure that whether a directory with TTL should be deleted when it is emptied by TTL mechanism or not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6382) HDFS File/Directory TTL
[ https://issues.apache.org/jira/browse/HDFS-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zesheng Wu updated HDFS-6382: - Attachment: HDFS-TTL-Design -2.pdf Updated the documents to address Colin's suggestions. Thanks Colin for your valuable suggestions:) HDFS File/Directory TTL --- Key: HDFS-6382 URL: https://issues.apache.org/jira/browse/HDFS-6382 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client, namenode Affects Versions: 2.4.0 Reporter: Zesheng Wu Assignee: Zesheng Wu Attachments: HDFS-TTL-Design -2.pdf, HDFS-TTL-Design.pdf In production environment, we always have scenario like this, we want to backup files on hdfs for some time and then hope to delete these files automatically. For example, we keep only 1 day's logs on local disk due to limited disk space, but we need to keep about 1 month's logs in order to debug program bugs, so we keep all the logs on hdfs and delete logs which are older than 1 month. This is a typical scenario of HDFS TTL. So here we propose that hdfs can support TTL. Following are some details of this proposal: 1. HDFS can support TTL on a specified file or directory 2. If a TTL is set on a file, the file will be deleted automatically after the TTL is expired 3. If a TTL is set on a directory, the child files and directories will be deleted automatically after the TTL is expired 4. The child file/directory's TTL configuration should override its parent directory's 5. A global configuration is needed to configure that whether the deleted files/directories should go to the trash or not 6. A global configuration is needed to configure that whether a directory with TTL should be deleted when it is emptied by TTL mechanism or not. -- This message was sent by Atlassian JIRA (v6.2#6252)