[jira] [Commented] (HDFS-7240) Object store in HDFS
[ https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173423#comment-14173423 ] Edward Bortnikov commented on HDFS-7240: Very interested to follow. How is this related to the previous jira and design on Block-Management-as-a-Service (HDFS-5477)? Object store in HDFS Key: HDFS-7240 URL: https://issues.apache.org/jira/browse/HDFS-7240 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jitendra Nath Pandey Assignee: Jitendra Nath Pandey This jira proposes to add object store capabilities into HDFS. As part of the federation work (HDFS-1052) we separated block storage as a generic storage layer. Using the Block Pool abstraction, new kinds of namespaces can be built on top of the storage layer i.e. datanodes. In this jira I will explore building an object store using the datanode storage, but independent of namespace metadata. I will soon update with a detailed design document. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7252) small refinement to the use of isInAnEZ in FSNamesystem
[ https://issues.apache.org/jira/browse/HDFS-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173530#comment-14173530 ] Hadoop QA commented on HDFS-7252: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675225/HDFS-7252.002.patch against trunk revision 2894433. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8439//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8439//console This message is automatically generated. small refinement to the use of isInAnEZ in FSNamesystem --- Key: HDFS-7252 URL: https://issues.apache.org/jira/browse/HDFS-7252 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Yi Liu Assignee: Yi Liu Priority: Trivial Attachments: HDFS-7252.001.patch, HDFS-7252.002.patch In {{FSN#startFileInt}}, _EncryptionZoneManager#getEncryptionZoneForPath_ is invoked 3 times (_dir.isInAnEZ(iip)_, _dir.getEZForPath(iip)_, _dir.getKeyName(iip)_) in following code, actually we just need one. {code} if (dir.isInAnEZ(iip)) { EncryptionZone zone = dir.getEZForPath(iip); protocolVersion = chooseProtocolVersion(zone, supportedVersions); suite = zone.getSuite(); ezKeyName = dir.getKeyName(iip); Preconditions.checkNotNull(protocolVersion); Preconditions.checkNotNull(suite); Preconditions.checkArgument(!suite.equals(CipherSuite.UNKNOWN), Chose an UNKNOWN CipherSuite!); Preconditions.checkNotNull(ezKeyName); } {code} Also there are 2 times in following code, but just need one {code} if (dir.isInAnEZ(iip)) { // The path is now within an EZ, but we're missing encryption parameters if (suite == null || edek == null) { throw new RetryStartFileException(); } // Path is within an EZ and we have provided encryption parameters. // Make sure that the generated EDEK matches the settings of the EZ. String ezKeyName = dir.getKeyName(iip); if (!ezKeyName.equals(edek.getEncryptionKeyName())) { throw new RetryStartFileException(); } feInfo = new FileEncryptionInfo(suite, version, edek.getEncryptedKeyVersion().getMaterial(), edek.getEncryptedKeyIv(), ezKeyName, edek.getEncryptionKeyVersionName()); Preconditions.checkNotNull(feInfo); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails
[ https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173639#comment-14173639 ] Hudson commented on HDFS-7208: -- FAILURE: Integrated in Hadoop-Yarn-trunk #713 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/713/]) HDFS-7208. NN doesn't schedule replication when a DN storage fails. Contributed by Ming Ma (szetszwo: rev 41980c56d3c01d7a0ddc7deea2d89b7f28026722) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeStorageInfo.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/DatanodeStorage.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java NN doesn't schedule replication when a DN storage fails --- Key: HDFS-7208 URL: https://issues.apache.org/jira/browse/HDFS-7208 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Ming Ma Assignee: Ming Ma Fix For: 2.6.0 Attachments: HDFS-7208-2.patch, HDFS-7208-3.patch, HDFS-7208.patch We found the following problem. When a storage device on a DN fails, NN continues to believe replicas of those blocks on that storage are valid and doesn't schedule replication. A DN has 12 storage disks. So there is one blockReport for each storage. When a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given dfs.datanode.failed.volumes.tolerated is configured to be 0, NN still considers that DN healthy. 1. A disk failed. All blocks of that disk are removed from DN dataset. {noformat} 2014-10-04 02:11:12,626 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume /data/disk6/dfs/current {noformat} 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN remove the DN and the replicas from the BlocksMap. In addition, blockReport doesn't provide the diff given that is done per storage. {noformat} 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: Disk error on DatanodeRegistration(xx.xx.xx.xxx, datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, ipcPort=50020, storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939): DataNode failed volumes:/data/disk6/dfs/current {noformat} 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7185) The active NameNode will not accept an fsimage sent from the standby during rolling upgrade
[ https://issues.apache.org/jira/browse/HDFS-7185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173641#comment-14173641 ] Hudson commented on HDFS-7185: -- FAILURE: Integrated in Hadoop-Yarn-trunk #713 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/713/]) HDFS-7185. The active NameNode will not accept an fsimage sent from the standby during rolling upgrade. Contributed by Jing Zhao. (jing9: rev 18620649f96d9e378fb7ea40de216284a9d525c7) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSImageWithSnapshot.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormat.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormatProtobuf.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java The active NameNode will not accept an fsimage sent from the standby during rolling upgrade --- Key: HDFS-7185 URL: https://issues.apache.org/jira/browse/HDFS-7185 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Colin Patrick McCabe Assignee: Jing Zhao Fix For: 2.6.0 Attachments: HDFS-7185.000.patch, HDFS-7185.001.patch, HDFS-7185.002.patch, HDFS-7185.003.patch, HDFS-7185.004.patch The active NameNode will not accept an fsimage sent from the standby during rolling upgrade. The active fails with the exception: {code} 18:25:07,620 WARN ImageServlet:198 - Received an invalid request file transfer request from a secondary with storage info -59:65195028:0:CID-385de4d7-64e4-4dde-9f5d-0a6e431987f6 18:25:07,620 WARN log:76 - Committed before 410 PutImage failed. java.io.IOException: This namenode has storage info -55:65195028:0:CID-385de4d7-64e4-4dde-9f5d-0a6e431987f6 but the secondary expected -59:65195028:0:CID-385de4d7-64e4-4dde-9f5d- 0a6e431987f6 at org.apache.hadoop.hdfs.server.namenode.ImageServlet.validateRequest(ImageServlet.java:200) at org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:443) at javax.servlet.http.HttpServlet.service(HttpServlet.java:730) {code} On the standby, the exception is: {code} java.io.IOException: Exception during image upload: org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: This namenode has storage info -55:65195028:0:CID-385de4d7-64e4-4dde-9f5d-0a6e431987f6 but the secondary expected -59:65195028:0:CID-385de4d7-64e4-4dde-9f5d-0a6e431987f6 at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:218) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1400(StandbyCheckpointer.java:62) {code} This seems to be a consequence of the fact that the VERSION file still is at -55 (the old version) even after the rolling upgrade has started. When the rolling upgrade is finalized with {{hdfs dfsadmin -rollingUpgrade finalize}}, both VERSION files get set to the new version, and the problem goes away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-5089) When a LayoutVersion support SNAPSHOT, it must support FSIMAGE_NAME_OPTIMIZATION.
[ https://issues.apache.org/jira/browse/HDFS-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173642#comment-14173642 ] Hudson commented on HDFS-5089: -- FAILURE: Integrated in Hadoop-Yarn-trunk #713 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/713/]) HDFS-5089. When a LayoutVersion support SNAPSHOT, it must support FSIMAGE_NAME_OPTIMIZATION. (szetszwo: rev 289442a242259af53dc73a156aa523e3e6c7) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/protocol/TestLayoutVersion.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java When a LayoutVersion support SNAPSHOT, it must support FSIMAGE_NAME_OPTIMIZATION. - Key: HDFS-5089 URL: https://issues.apache.org/jira/browse/HDFS-5089 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Fix For: 2.6.0 Attachments: h5089_20130813.patch, h5089_20140325.patch The SNAPSHOT layout requires FSIMAGE_NAME_OPTIMIZATION as a prerequisite. However, RESERVED_REL1_3_0 supports SNAPSHOT but not FSIMAGE_NAME_OPTIMIZATION. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7243) HDFS concat operation should not be allowed in Encryption Zone
[ https://issues.apache.org/jira/browse/HDFS-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb updated HDFS-7243: --- Attachment: HDFS-7243.002.patch Thanks [~hitliuyi]. You're right. I've posted the updated patch. HDFS concat operation should not be allowed in Encryption Zone -- Key: HDFS-7243 URL: https://issues.apache.org/jira/browse/HDFS-7243 Project: Hadoop HDFS Issue Type: Bug Components: encryption, namenode Affects Versions: 2.6.0 Reporter: Yi Liu Assignee: Charles Lamb Attachments: HDFS-7243.001.patch, HDFS-7243.002.patch, HDFS-7243.002.patch For HDFS encryption at rest, files in an encryption zone are using different data encryption keys, so concat should be disallowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7243) HDFS concat operation should not be allowed in Encryption Zone
[ https://issues.apache.org/jira/browse/HDFS-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb updated HDFS-7243: --- Attachment: (was: HDFS-7243.002.patch) HDFS concat operation should not be allowed in Encryption Zone -- Key: HDFS-7243 URL: https://issues.apache.org/jira/browse/HDFS-7243 Project: Hadoop HDFS Issue Type: Bug Components: encryption, namenode Affects Versions: 2.6.0 Reporter: Yi Liu Assignee: Charles Lamb Attachments: HDFS-7243.001.patch, HDFS-7243.002.patch For HDFS encryption at rest, files in an encryption zone are using different data encryption keys, so concat should be disallowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7243) HDFS concat operation should not be allowed in Encryption Zone
[ https://issues.apache.org/jira/browse/HDFS-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb updated HDFS-7243: --- Attachment: HDFS-7243.003.patch HDFS concat operation should not be allowed in Encryption Zone -- Key: HDFS-7243 URL: https://issues.apache.org/jira/browse/HDFS-7243 Project: Hadoop HDFS Issue Type: Bug Components: encryption, namenode Affects Versions: 2.6.0 Reporter: Yi Liu Assignee: Charles Lamb Attachments: HDFS-7243.001.patch, HDFS-7243.002.patch, HDFS-7243.003.patch For HDFS encryption at rest, files in an encryption zone are using different data encryption keys, so concat should be disallowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7243) HDFS concat operation should not be allowed in Encryption Zone
[ https://issues.apache.org/jira/browse/HDFS-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173690#comment-14173690 ] Yi Liu commented on HDFS-7243: -- Thanks Charles for updating the patch, it looks good to me. HDFS concat operation should not be allowed in Encryption Zone -- Key: HDFS-7243 URL: https://issues.apache.org/jira/browse/HDFS-7243 Project: Hadoop HDFS Issue Type: Bug Components: encryption, namenode Affects Versions: 2.6.0 Reporter: Yi Liu Assignee: Charles Lamb Attachments: HDFS-7243.001.patch, HDFS-7243.002.patch, HDFS-7243.003.patch For HDFS encryption at rest, files in an encryption zone are using different data encryption keys, so concat should be disallowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7243) HDFS concat operation should not be allowed in Encryption Zone
[ https://issues.apache.org/jira/browse/HDFS-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173695#comment-14173695 ] Charles Lamb commented on HDFS-7243: Thanks for reviewing Yi. HDFS concat operation should not be allowed in Encryption Zone -- Key: HDFS-7243 URL: https://issues.apache.org/jira/browse/HDFS-7243 Project: Hadoop HDFS Issue Type: Bug Components: encryption, namenode Affects Versions: 2.6.0 Reporter: Yi Liu Assignee: Charles Lamb Attachments: HDFS-7243.001.patch, HDFS-7243.002.patch, HDFS-7243.003.patch For HDFS encryption at rest, files in an encryption zone are using different data encryption keys, so concat should be disallowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7185) The active NameNode will not accept an fsimage sent from the standby during rolling upgrade
[ https://issues.apache.org/jira/browse/HDFS-7185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173761#comment-14173761 ] Hudson commented on HDFS-7185: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1903 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1903/]) HDFS-7185. The active NameNode will not accept an fsimage sent from the standby during rolling upgrade. Contributed by Jing Zhao. (jing9: rev 18620649f96d9e378fb7ea40de216284a9d525c7) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormatProtobuf.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormat.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSImageWithSnapshot.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java The active NameNode will not accept an fsimage sent from the standby during rolling upgrade --- Key: HDFS-7185 URL: https://issues.apache.org/jira/browse/HDFS-7185 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Colin Patrick McCabe Assignee: Jing Zhao Fix For: 2.6.0 Attachments: HDFS-7185.000.patch, HDFS-7185.001.patch, HDFS-7185.002.patch, HDFS-7185.003.patch, HDFS-7185.004.patch The active NameNode will not accept an fsimage sent from the standby during rolling upgrade. The active fails with the exception: {code} 18:25:07,620 WARN ImageServlet:198 - Received an invalid request file transfer request from a secondary with storage info -59:65195028:0:CID-385de4d7-64e4-4dde-9f5d-0a6e431987f6 18:25:07,620 WARN log:76 - Committed before 410 PutImage failed. java.io.IOException: This namenode has storage info -55:65195028:0:CID-385de4d7-64e4-4dde-9f5d-0a6e431987f6 but the secondary expected -59:65195028:0:CID-385de4d7-64e4-4dde-9f5d- 0a6e431987f6 at org.apache.hadoop.hdfs.server.namenode.ImageServlet.validateRequest(ImageServlet.java:200) at org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:443) at javax.servlet.http.HttpServlet.service(HttpServlet.java:730) {code} On the standby, the exception is: {code} java.io.IOException: Exception during image upload: org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: This namenode has storage info -55:65195028:0:CID-385de4d7-64e4-4dde-9f5d-0a6e431987f6 but the secondary expected -59:65195028:0:CID-385de4d7-64e4-4dde-9f5d-0a6e431987f6 at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:218) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1400(StandbyCheckpointer.java:62) {code} This seems to be a consequence of the fact that the VERSION file still is at -55 (the old version) even after the rolling upgrade has started. When the rolling upgrade is finalized with {{hdfs dfsadmin -rollingUpgrade finalize}}, both VERSION files get set to the new version, and the problem goes away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails
[ https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173759#comment-14173759 ] Hudson commented on HDFS-7208: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1903 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1903/]) HDFS-7208. NN doesn't schedule replication when a DN storage fails. Contributed by Ming Ma (szetszwo: rev 41980c56d3c01d7a0ddc7deea2d89b7f28026722) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeStorageInfo.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/DatanodeStorage.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java NN doesn't schedule replication when a DN storage fails --- Key: HDFS-7208 URL: https://issues.apache.org/jira/browse/HDFS-7208 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Ming Ma Assignee: Ming Ma Fix For: 2.6.0 Attachments: HDFS-7208-2.patch, HDFS-7208-3.patch, HDFS-7208.patch We found the following problem. When a storage device on a DN fails, NN continues to believe replicas of those blocks on that storage are valid and doesn't schedule replication. A DN has 12 storage disks. So there is one blockReport for each storage. When a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given dfs.datanode.failed.volumes.tolerated is configured to be 0, NN still considers that DN healthy. 1. A disk failed. All blocks of that disk are removed from DN dataset. {noformat} 2014-10-04 02:11:12,626 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume /data/disk6/dfs/current {noformat} 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN remove the DN and the replicas from the BlocksMap. In addition, blockReport doesn't provide the diff given that is done per storage. {noformat} 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: Disk error on DatanodeRegistration(xx.xx.xx.xxx, datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, ipcPort=50020, storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939): DataNode failed volumes:/data/disk6/dfs/current {noformat} 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-5089) When a LayoutVersion support SNAPSHOT, it must support FSIMAGE_NAME_OPTIMIZATION.
[ https://issues.apache.org/jira/browse/HDFS-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173762#comment-14173762 ] Hudson commented on HDFS-5089: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1903 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1903/]) HDFS-5089. When a LayoutVersion support SNAPSHOT, it must support FSIMAGE_NAME_OPTIMIZATION. (szetszwo: rev 289442a242259af53dc73a156aa523e3e6c7) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/protocol/TestLayoutVersion.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt When a LayoutVersion support SNAPSHOT, it must support FSIMAGE_NAME_OPTIMIZATION. - Key: HDFS-5089 URL: https://issues.apache.org/jira/browse/HDFS-5089 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Fix For: 2.6.0 Attachments: h5089_20130813.patch, h5089_20140325.patch The SNAPSHOT layout requires FSIMAGE_NAME_OPTIMIZATION as a prerequisite. However, RESERVED_REL1_3_0 supports SNAPSHOT but not FSIMAGE_NAME_OPTIMIZATION. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-5089) When a LayoutVersion support SNAPSHOT, it must support FSIMAGE_NAME_OPTIMIZATION.
[ https://issues.apache.org/jira/browse/HDFS-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173772#comment-14173772 ] Hudson commented on HDFS-5089: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1928 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1928/]) HDFS-5089. When a LayoutVersion support SNAPSHOT, it must support FSIMAGE_NAME_OPTIMIZATION. (szetszwo: rev 289442a242259af53dc73a156aa523e3e6c7) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/LayoutVersion.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/protocol/TestLayoutVersion.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt When a LayoutVersion support SNAPSHOT, it must support FSIMAGE_NAME_OPTIMIZATION. - Key: HDFS-5089 URL: https://issues.apache.org/jira/browse/HDFS-5089 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Fix For: 2.6.0 Attachments: h5089_20130813.patch, h5089_20140325.patch The SNAPSHOT layout requires FSIMAGE_NAME_OPTIMIZATION as a prerequisite. However, RESERVED_REL1_3_0 supports SNAPSHOT but not FSIMAGE_NAME_OPTIMIZATION. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7208) NN doesn't schedule replication when a DN storage fails
[ https://issues.apache.org/jira/browse/HDFS-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173769#comment-14173769 ] Hudson commented on HDFS-7208: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1928 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1928/]) HDFS-7208. NN doesn't schedule replication when a DN storage fails. Contributed by Ming Ma (szetszwo: rev 41980c56d3c01d7a0ddc7deea2d89b7f28026722) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/DatanodeStorage.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeStorageInfo.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java NN doesn't schedule replication when a DN storage fails --- Key: HDFS-7208 URL: https://issues.apache.org/jira/browse/HDFS-7208 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Ming Ma Assignee: Ming Ma Fix For: 2.6.0 Attachments: HDFS-7208-2.patch, HDFS-7208-3.patch, HDFS-7208.patch We found the following problem. When a storage device on a DN fails, NN continues to believe replicas of those blocks on that storage are valid and doesn't schedule replication. A DN has 12 storage disks. So there is one blockReport for each storage. When a disk fails, # of blockReport from that DN is reduced from 12 to 11. Given dfs.datanode.failed.volumes.tolerated is configured to be 0, NN still considers that DN healthy. 1. A disk failed. All blocks of that disk are removed from DN dataset. {noformat} 2014-10-04 02:11:12,626 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing replica BP-1748500278-xx.xx.xx.xxx-1377803467793:1121568886 on failed volume /data/disk6/dfs/current {noformat} 2. NN receives DatanodeProtocol.DISK_ERROR. But that isn't enough to have NN remove the DN and the replicas from the BlocksMap. In addition, blockReport doesn't provide the diff given that is done per storage. {noformat} 2014-10-04 02:11:12,681 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: Disk error on DatanodeRegistration(xx.xx.xx.xxx, datanodeUuid=f3b8a30b-e715-40d6-8348-3c766f9ba9ab, infoPort=50075, ipcPort=50020, storageInfo=lv=-55;cid=CID-e3c38355-fde5-4e3a-b7ce-edacebdfa7a1;nsid=420527250;c=1410283484939): DataNode failed volumes:/data/disk6/dfs/current {noformat} 3. Run fsck on the file and confirm the NN's BlocksMap still has that replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7185) The active NameNode will not accept an fsimage sent from the standby during rolling upgrade
[ https://issues.apache.org/jira/browse/HDFS-7185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173771#comment-14173771 ] Hudson commented on HDFS-7185: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1928 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1928/]) HDFS-7185. The active NameNode will not accept an fsimage sent from the standby during rolling upgrade. Contributed by Jing Zhao. (jing9: rev 18620649f96d9e378fb7ea40de216284a9d525c7) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSImageWithSnapshot.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormatProtobuf.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImageFormat.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt The active NameNode will not accept an fsimage sent from the standby during rolling upgrade --- Key: HDFS-7185 URL: https://issues.apache.org/jira/browse/HDFS-7185 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Colin Patrick McCabe Assignee: Jing Zhao Fix For: 2.6.0 Attachments: HDFS-7185.000.patch, HDFS-7185.001.patch, HDFS-7185.002.patch, HDFS-7185.003.patch, HDFS-7185.004.patch The active NameNode will not accept an fsimage sent from the standby during rolling upgrade. The active fails with the exception: {code} 18:25:07,620 WARN ImageServlet:198 - Received an invalid request file transfer request from a secondary with storage info -59:65195028:0:CID-385de4d7-64e4-4dde-9f5d-0a6e431987f6 18:25:07,620 WARN log:76 - Committed before 410 PutImage failed. java.io.IOException: This namenode has storage info -55:65195028:0:CID-385de4d7-64e4-4dde-9f5d-0a6e431987f6 but the secondary expected -59:65195028:0:CID-385de4d7-64e4-4dde-9f5d- 0a6e431987f6 at org.apache.hadoop.hdfs.server.namenode.ImageServlet.validateRequest(ImageServlet.java:200) at org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:443) at javax.servlet.http.HttpServlet.service(HttpServlet.java:730) {code} On the standby, the exception is: {code} java.io.IOException: Exception during image upload: org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpPutFailedException: This namenode has storage info -55:65195028:0:CID-385de4d7-64e4-4dde-9f5d-0a6e431987f6 but the secondary expected -59:65195028:0:CID-385de4d7-64e4-4dde-9f5d-0a6e431987f6 at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:218) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1400(StandbyCheckpointer.java:62) {code} This seems to be a consequence of the fact that the VERSION file still is at -55 (the old version) even after the rolling upgrade has started. When the rolling upgrade is finalized with {{hdfs dfsadmin -rollingUpgrade finalize}}, both VERSION files get set to the new version, and the problem goes away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7243) HDFS concat operation should not be allowed in Encryption Zone
[ https://issues.apache.org/jira/browse/HDFS-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173862#comment-14173862 ] Hadoop QA commented on HDFS-7243: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675272/HDFS-7243.003.patch against trunk revision 2894433. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8440//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8440//console This message is automatically generated. HDFS concat operation should not be allowed in Encryption Zone -- Key: HDFS-7243 URL: https://issues.apache.org/jira/browse/HDFS-7243 Project: Hadoop HDFS Issue Type: Bug Components: encryption, namenode Affects Versions: 2.6.0 Reporter: Yi Liu Assignee: Charles Lamb Attachments: HDFS-7243.001.patch, HDFS-7243.002.patch, HDFS-7243.003.patch For HDFS encryption at rest, files in an encryption zone are using different data encryption keys, so concat should be disallowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7226) TestDNFencing.testQueueingWithAppend failed often in latest test
[ https://issues.apache.org/jira/browse/HDFS-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjun Zhang updated HDFS-7226: Attachment: HDFS-7226.001.patch TestDNFencing.testQueueingWithAppend failed often in latest test Key: HDFS-7226 URL: https://issues.apache.org/jira/browse/HDFS-7226 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7226.001.patch Using tool from HADOOP-11045, got the following report: {code} [yzhang@localhost jenkinsftf]$ ./determine-flaky-tests-hadoop.py -j PreCommit-HDFS-Build -n 1 Recently FAILED builds in url: https://builds.apache.org//job/PreCommit-HDFS-Build THERE ARE 9 builds (out of 9) that have failed tests in the past 1 days, as listed below: .. Among 9 runs examined, all failed tests #failedRuns: testName: 7: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 6: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 3: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testFailedOpen 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testSyncBatching .. {code} TestDNFencingWithReplication.testFencingStress was reported as HDFS-7221. Creating this jira for TestDNFencing.testQueueingWithAppend. Symptom: {code} Failed org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failing for the past 1 build (Since Failed#8390 ) Took 2.9 sec. Error Message expected:18 but was:12 Stacktrace java.lang.AssertionError: expected:18 but was:12 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend(TestDNFencing.java:448) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7255) Customize Java Heap min/max settings for individual processes
Mark Tse created HDFS-7255: -- Summary: Customize Java Heap min/max settings for individual processes Key: HDFS-7255 URL: https://issues.apache.org/jira/browse/HDFS-7255 Project: Hadoop HDFS Issue Type: Improvement Components: datanode, journal-node, namenode Affects Versions: 2.5.1, 2.4.1 Reporter: Mark Tse The NameNode and JournalNode (and ZKFC) can all run on the same machine. However, they get their heap settings from HADOOP_HEAPSIZE/JAVA_HEAP_MAX. There are scenarios where the NameNode process should have different Java memory requirements than the JournalNode and ZKFC (e.g. if the machine has 10 GB of RAM, and I want the NameNode process to have 8 GB max). HADOOP_(.*)_OPTS variables exist for these processes and can be used to add the Xms and Xmx tags, but because of how the default for JAVA_HEAP_MAX is set, it will always add '-Xmx1000m' to the final call to start up the NameNode/JournalNode/ZKFC process, resulting in two different Java heap settings (e.g. -Xmx1000m and -Xmx8g is used when starting the NameNode). Note: HADOOP_HEAPSIZE is deprecated according to [HADOOP-10950] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7226) TestDNFencing.testQueueingWithAppend failed often in latest test
[ https://issues.apache.org/jira/browse/HDFS-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjun Zhang updated HDFS-7226: Status: Patch Available (was: Open) Submit patch 001. HI [~jingzhao] and [~kihwal], I found that the failure reported in this jira is also introduced by the HDFS-7217 fix. But the issue took me some time to understand. Basically, because of HDFS-7217 change, reporting of ReceivingBlock to NN is delayed, in the reported testcase, they are replaced by ReceivedBlock later (see comment in the comment I put in the patch). Thanks Jing for the help on HDFS-7236. Would any of you please help taking a look at the patch? Thanks a lot. TestDNFencing.testQueueingWithAppend failed often in latest test Key: HDFS-7226 URL: https://issues.apache.org/jira/browse/HDFS-7226 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7226.001.patch Using tool from HADOOP-11045, got the following report: {code} [yzhang@localhost jenkinsftf]$ ./determine-flaky-tests-hadoop.py -j PreCommit-HDFS-Build -n 1 Recently FAILED builds in url: https://builds.apache.org//job/PreCommit-HDFS-Build THERE ARE 9 builds (out of 9) that have failed tests in the past 1 days, as listed below: .. Among 9 runs examined, all failed tests #failedRuns: testName: 7: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 6: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 3: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testFailedOpen 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testSyncBatching .. {code} TestDNFencingWithReplication.testFencingStress was reported as HDFS-7221. Creating this jira for TestDNFencing.testQueueingWithAppend. Symptom: {code} Failed org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failing for the past 1 build (Since Failed#8390 ) Took 2.9 sec. Error Message expected:18 but was:12 Stacktrace java.lang.AssertionError: expected:18 but was:12 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend(TestDNFencing.java:448) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-5928) show namespace and namenode ID on NN dfshealth page
[ https://issues.apache.org/jira/browse/HDFS-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated HDFS-5928: -- Issue Type: Sub-task (was: Improvement) Parent: HDFS-6751 show namespace and namenode ID on NN dfshealth page --- Key: HDFS-5928 URL: https://issues.apache.org/jira/browse/HDFS-5928 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Siqi Li Assignee: Siqi Li Attachments: HDFS-5928.v2.patch, HDFS-5928.v3.patch, HDFS-5928.v1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6744) Improve decommissioning nodes and dead nodes access on the new NN webUI
[ https://issues.apache.org/jira/browse/HDFS-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174072#comment-14174072 ] Siqi Li commented on HDFS-6744: --- Hi [~wheat9], can you take a look at this patch? Improve decommissioning nodes and dead nodes access on the new NN webUI --- Key: HDFS-6744 URL: https://issues.apache.org/jira/browse/HDFS-6744 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Ming Ma Assignee: Siqi Li Attachments: HDFS-6744.v1.patch The new NN webUI lists live node at the top of the page, followed by dead node and decommissioning node. From admins point of view: 1. Decommissioning nodes and dead nodes are more interesting. It is better to move decommissioning nodes to the top of the page, followed by dead nodes and decommissioning nodes. 2. To find decommissioning nodes or dead nodes, the whole page that includes all nodes needs to be loaded. That could take some time for big clusters. The legacy web UI filters out the type of nodes dynamically. That seems to work well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently
[ https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174155#comment-14174155 ] Yongjun Zhang commented on HDFS-7221: - HI [~clamb], I think I found the root cause here. With HDFS-7128 fix, the dfs.namenode.replication.max-streams-hard-limit property is better enforced. And this caused the testFencingStress() test failure reported here, because the test is a stress one. I added one line of change to see the test consistently passing: {code} harness.conf.setInt( DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_KEY, 16); {code} Thanks [~mingma] for fixing HDFS-7128, and [~kihwal], [~cnauroth] for the discussion there. I was thinking about whether the soft and hard setting of this property is ideal, and I noticed that you guys had some discussion there. It sounds that this property can be even set per node basis based on the hardware a node is equipped with. But this may complicate the software. I guess for now we just need to kind in mind that we have this property enforced. Thanks Charles again for reporting this long outstanding failure of recent jenkins jobs. TestDNFencingWithReplication fails consistently --- Key: HDFS-7221 URL: https://issues.apache.org/jira/browse/HDFS-7221 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 2.6.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch TestDNFencingWithReplication consistently fails with a timeout, both in jenkins runs and on my local machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7226) TestDNFencing.testQueueingWithAppend failed often in latest test
[ https://issues.apache.org/jira/browse/HDFS-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174280#comment-14174280 ] Hadoop QA commented on HDFS-7226: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675317/HDFS-7226.001.patch against trunk revision 2894433. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8441//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8441//console This message is automatically generated. TestDNFencing.testQueueingWithAppend failed often in latest test Key: HDFS-7226 URL: https://issues.apache.org/jira/browse/HDFS-7226 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7226.001.patch Using tool from HADOOP-11045, got the following report: {code} [yzhang@localhost jenkinsftf]$ ./determine-flaky-tests-hadoop.py -j PreCommit-HDFS-Build -n 1 Recently FAILED builds in url: https://builds.apache.org//job/PreCommit-HDFS-Build THERE ARE 9 builds (out of 9) that have failed tests in the past 1 days, as listed below: .. Among 9 runs examined, all failed tests #failedRuns: testName: 7: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 6: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 3: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testFailedOpen 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testSyncBatching .. {code} TestDNFencingWithReplication.testFencingStress was reported as HDFS-7221. Creating this jira for TestDNFencing.testQueueingWithAppend. Symptom: {code} Failed org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failing for the past 1 build (Since Failed#8390 ) Took 2.9 sec. Error Message expected:18 but was:12 Stacktrace java.lang.AssertionError: expected:18 but was:12 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend(TestDNFencing.java:448) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7226) TestDNFencing.testQueueingWithAppend failed often in latest test
[ https://issues.apache.org/jira/browse/HDFS-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174306#comment-14174306 ] Yongjun Zhang commented on HDFS-7226: - The remaining failed test TestDNFencingWithReplication.testFencingStress was reported as HDFS-7221. TestDNFencing.testQueueingWithAppend failed often in latest test Key: HDFS-7226 URL: https://issues.apache.org/jira/browse/HDFS-7226 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7226.001.patch Using tool from HADOOP-11045, got the following report: {code} [yzhang@localhost jenkinsftf]$ ./determine-flaky-tests-hadoop.py -j PreCommit-HDFS-Build -n 1 Recently FAILED builds in url: https://builds.apache.org//job/PreCommit-HDFS-Build THERE ARE 9 builds (out of 9) that have failed tests in the past 1 days, as listed below: .. Among 9 runs examined, all failed tests #failedRuns: testName: 7: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 6: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 3: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testFailedOpen 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testSyncBatching .. {code} TestDNFencingWithReplication.testFencingStress was reported as HDFS-7221. Creating this jira for TestDNFencing.testQueueingWithAppend. Symptom: {code} Failed org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failing for the past 1 build (Since Failed#8390 ) Took 2.9 sec. Error Message expected:18 but was:12 Stacktrace java.lang.AssertionError: expected:18 but was:12 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend(TestDNFencing.java:448) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7226) TestDNFencing.testQueueingWithAppend failed often in latest test
[ https://issues.apache.org/jira/browse/HDFS-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174317#comment-14174317 ] Jing Zhao commented on HDFS-7226: - Thanks for working on this, Yongjun! So with the current fix, is it possible that the DN just happens to send out an IBR (after normal waiting) right after receiving the data? In that case, DN may still send out both the block receiving and received msg. Thus maybe we can still call {{triggerBlockReportForTests}} here in the tests to make sure a block receiving report is sent out. TestDNFencing.testQueueingWithAppend failed often in latest test Key: HDFS-7226 URL: https://issues.apache.org/jira/browse/HDFS-7226 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7226.001.patch Using tool from HADOOP-11045, got the following report: {code} [yzhang@localhost jenkinsftf]$ ./determine-flaky-tests-hadoop.py -j PreCommit-HDFS-Build -n 1 Recently FAILED builds in url: https://builds.apache.org//job/PreCommit-HDFS-Build THERE ARE 9 builds (out of 9) that have failed tests in the past 1 days, as listed below: .. Among 9 runs examined, all failed tests #failedRuns: testName: 7: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 6: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 3: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testFailedOpen 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testSyncBatching .. {code} TestDNFencingWithReplication.testFencingStress was reported as HDFS-7221. Creating this jira for TestDNFencing.testQueueingWithAppend. Symptom: {code} Failed org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failing for the past 1 build (Since Failed#8390 ) Took 2.9 sec. Error Message expected:18 but was:12 Stacktrace java.lang.AssertionError: expected:18 but was:12 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend(TestDNFencing.java:448) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7226) TestDNFencing.testQueueingWithAppend failed often in latest test
[ https://issues.apache.org/jira/browse/HDFS-7226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174367#comment-14174367 ] Yongjun Zhang commented on HDFS-7226: - HI [~jingzhao], Thanks for the review and comments. I actually had tried that before I came up with this new solution. The issue with calling {{triggerBlockReportForTests}} is, we would see 6 reports instead of 3 the test expects, even though we only have 3 BlockReceiving entries. I think the reason is about how {{triggerBlockReportForTests}} is implemented: it incurs a waiting loop for the 3 second heartbeat interval, at which time, it will do additional block reports than the original 3, and ends up 6 reports instead of 3. But let me take a further look at this direction. TestDNFencing.testQueueingWithAppend failed often in latest test Key: HDFS-7226 URL: https://issues.apache.org/jira/browse/HDFS-7226 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Yongjun Zhang Assignee: Yongjun Zhang Attachments: HDFS-7226.001.patch Using tool from HADOOP-11045, got the following report: {code} [yzhang@localhost jenkinsftf]$ ./determine-flaky-tests-hadoop.py -j PreCommit-HDFS-Build -n 1 Recently FAILED builds in url: https://builds.apache.org//job/PreCommit-HDFS-Build THERE ARE 9 builds (out of 9) that have failed tests in the past 1 days, as listed below: .. Among 9 runs examined, all failed tests #failedRuns: testName: 7: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend 6: org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication.testFencingStress 3: org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot.testOpenFilesWithMultipleSnapshots 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testFailedOpen 1: org.apache.hadoop.hdfs.server.namenode.TestEditLog.testSyncBatching .. {code} TestDNFencingWithReplication.testFencingStress was reported as HDFS-7221. Creating this jira for TestDNFencing.testQueueingWithAppend. Symptom: {code} Failed org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend Failing for the past 1 build (Since Failed#8390 ) Took 2.9 sec. Error Message expected:18 but was:12 Stacktrace java.lang.AssertionError: expected:18 but was:12 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.hdfs.server.namenode.ha.TestDNFencing.testQueueingWithAppend(TestDNFencing.java:448) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6581) Write to single replica in memory
[ https://issues.apache.org/jira/browse/HDFS-6581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174407#comment-14174407 ] Jitendra Nath Pandey commented on HDFS-6581: I am planning to merge this to branch-2 today, and subsequently to branch-2.6 by tomorrow. As agreed on HDFS-6919, In 2.6 we will indicate in the release notes that the memory for writes on RAM and the memory for caching in datanodes are independent, and a feature to manage them together will be added in the next release. Write to single replica in memory - Key: HDFS-6581 URL: https://issues.apache.org/jira/browse/HDFS-6581 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, hdfs-client, namenode Reporter: Arpit Agarwal Assignee: Arpit Agarwal Fix For: 3.0.0 Attachments: HDFS-6581.merge.01.patch, HDFS-6581.merge.02.patch, HDFS-6581.merge.03.patch, HDFS-6581.merge.04.patch, HDFS-6581.merge.05.patch, HDFS-6581.merge.06.patch, HDFS-6581.merge.07.patch, HDFS-6581.merge.08.patch, HDFS-6581.merge.09.patch, HDFS-6581.merge.10.patch, HDFS-6581.merge.11.patch, HDFS-6581.merge.12.patch, HDFS-6581.merge.14.patch, HDFS-6581.merge.15.patch, HDFSWriteableReplicasInMemory.pdf, Test-Plan-for-HDFS-6581-Memory-Storage.pdf, Test-Plan-for-HDFS-6581-Memory-Storage.pdf Per discussion with the community on HDFS-5851, we will implement writing to a single replica in DN memory via DataTransferProtocol. This avoids some of the issues with short-circuit writes, which we can revisit at a later time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6919) Enforce a single limit for RAM disk usage and replicas cached via locking
[ https://issues.apache.org/jira/browse/HDFS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174412#comment-14174412 ] Jitendra Nath Pandey commented on HDFS-6919: I am planning to merge HDFS-6581 work to branch-2 today, and subsequently to branch-2.6 by tomorrow. As suggested earlier, in 2.6 we will indicate in the release notes that the memory for writes on RAM and the memory for caching in datanodes are independent, and a feature to manage them together will be added in the next release. Enforce a single limit for RAM disk usage and replicas cached via locking - Key: HDFS-6919 URL: https://issues.apache.org/jira/browse/HDFS-6919 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Arpit Agarwal Assignee: Colin Patrick McCabe Priority: Blocker The DataNode can have a single limit for memory usage which applies to both replicas cached via CCM and replicas on RAM disk. See comments [1|https://issues.apache.org/jira/browse/HDFS-6581?focusedCommentId=14106025page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14106025], [2|https://issues.apache.org/jira/browse/HDFS-6581?focusedCommentId=14106245page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14106245] and [3|https://issues.apache.org/jira/browse/HDFS-6581?focusedCommentId=14106575page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14106575] for discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7240) Object store in HDFS
[ https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174416#comment-14174416 ] Jitendra Nath Pandey commented on HDFS-7240: I think HDFS-5477 takes us towards making the block management service generic enough to support different storage semantics and API. In that sense object store will be one more use case for the block management. The object store design should work with the block management service. Object store in HDFS Key: HDFS-7240 URL: https://issues.apache.org/jira/browse/HDFS-7240 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jitendra Nath Pandey Assignee: Jitendra Nath Pandey This jira proposes to add object store capabilities into HDFS. As part of the federation work (HDFS-1052) we separated block storage as a generic storage layer. Using the Block Pool abstraction, new kinds of namespaces can be built on top of the storage layer i.e. datanodes. In this jira I will explore building an object store using the datanode storage, but independent of namespace metadata. I will soon update with a detailed design document. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7180) NFSv3 gateway frequently gets stuck
[ https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174429#comment-14174429 ] Brandon Li commented on HDFS-7180: -- Sorry for the late response. Thanks for filing the bug, [~ericzma]. NFS gateway could be stuck in GC and thus the connection with DN timed out, which makes NFS gateway think the DN is bad. If this is the case, you can find lots of socket timeout exception in DN logs. One of the cause of GC is the reordered writes arrive faster than the speed to dump them on local disk. In this case, NFS log should have nonSequentialWriteInMemory with a very big value (need the trace level to be DEBUG). I will upload a patch soon. NFSv3 gateway frequently gets stuck --- Key: HDFS-7180 URL: https://issues.apache.org/jira/browse/HDFS-7180 Project: Hadoop HDFS Issue Type: Bug Components: nfs Affects Versions: 2.5.0 Environment: Linux, Fedora 19 x86-64 Reporter: Eric Zhiqiang Ma Assignee: Brandon Li Priority: Critical We are using Hadoop 2.5.0 (HDFS only) and start and mount the NFSv3 gateway on one node in the cluster to let users upload data with rsync. However, we find the NFSv3 daemon seems frequently get stuck while the HDFS seems working well. (hdfds dfs -ls and etc. works just well). The last stuck we found is after around 1 day running and several hundreds GBs of data uploaded. The NFSv3 daemon is started on one node and on the same node the NFS is mounted. From the node where the NFS is mounted: dmsg shows like this: [1859245.368108] nfs: server localhost not responding, still trying [1859245.368111] nfs: server localhost not responding, still trying [1859245.368115] nfs: server localhost not responding, still trying [1859245.368119] nfs: server localhost not responding, still trying [1859245.368123] nfs: server localhost not responding, still trying [1859245.368127] nfs: server localhost not responding, still trying [1859245.368131] nfs: server localhost not responding, still trying [1859245.368135] nfs: server localhost not responding, still trying [1859245.368138] nfs: server localhost not responding, still trying [1859245.368142] nfs: server localhost not responding, still trying [1859245.368146] nfs: server localhost not responding, still trying [1859245.368150] nfs: server localhost not responding, still trying [1859245.368153] nfs: server localhost not responding, still trying The mounted directory can not be `ls` and `df -hT` gets stuck too. The latest lines from the nfs3 log in the hadoop logs directory: 2014-10-02 05:43:20,452 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated user map size: 35 2014-10-02 05:43:20,461 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated group map size: 54 2014-10-02 05:44:40,374 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:44:40,732 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:46:06,535 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:46:26,075 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:47:56,420 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:48:56,477 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:51:46,750 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:53:23,809 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:53:24,508 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:55:57,334 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:57:07,428 INFO org.apache.hadoop.hdfs.nfs.nfs3.OpenFileCtx: Have to change stable write to unstable write:FILE_SYNC 2014-10-02 05:58:32,609 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update cache now 2014-10-02 05:58:32,610 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not doing static UID/GID mapping because '/etc/nfs.map' does not exist. 2014-10-02 05:58:32,620 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated user map size: 35 2014-10-02 05:58:32,628 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated group map size: 54 2014-10-02 06:01:32,098 WARN org.apache.hadoop.hdfs.DFSClient: Slow ReadProcessor read fields took 60062ms (threshold=3ms);
[jira] [Created] (HDFS-7256) Encryption Key created in Java Key Store after Namenode start unavailable for EZ Creation
Xiaoyu Yao created HDFS-7256: Summary: Encryption Key created in Java Key Store after Namenode start unavailable for EZ Creation Key: HDFS-7256 URL: https://issues.apache.org/jira/browse/HDFS-7256 Project: Hadoop HDFS Issue Type: Bug Components: encryption, security Affects Versions: 2.6.0 Reporter: Xiaoyu Yao Hit an error on RemoteException: Key ezkey1 doesn't exist. when creating EZ with a Key created after NN starts. Briefly check the code and found that the KeyProivder is loaded by FSN only at the NN start. My work around is to restart the NN which triggers the reload of Key Provider. Is this expected? Repro Steps: Create a new Key after NN and KMS starts hadoop/bin/hadoop key create ezkey1 -size 256 -provider jceks://file/home/hadoop/kms.keystore List Keys hadoop@SaturnVm:~/deploy$ hadoop/bin/hadoop key list -provider jceks://file/home/hadoop/kms.keystore -metadata Listing keys for KeyProvider: jceks://file/home/hadoop/kms.keystore ezkey1 : cipher: AES/CTR/NoPadding, length: 256, description: null, created: Thu Oct 16 18:51:30 EDT 2014, version: 1, attributes: null key2 : cipher: AES/CTR/NoPadding, length: 128, description: null, created: Tue Oct 14 19:44:09 EDT 2014, version: 1, attributes: null key1 : cipher: AES/CTR/NoPadding, length: 128, description: null, created: Tue Oct 14 17:52:36 EDT 2014, version: 1, attributes: null Create Encryption Zone hadoop/bin/hdfs dfs -mkdir /Ez1 hadoop@SaturnVm:~/deploy$ hadoop/bin/hdfs crypto -createZone -keyName ezkey1 -path /Ez1 RemoteException: Key ezkey1 doesn't exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7184) Allow data migration tool to run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174573#comment-14174573 ] Allen Wittenauer commented on HDFS-7184: +1 Allow data migration tool to run as a daemon Key: HDFS-7184 URL: https://issues.apache.org/jira/browse/HDFS-7184 Project: Hadoop HDFS Issue Type: Sub-task Components: balancer mover, scripts Reporter: Benoy Antony Assignee: Benoy Antony Priority: Minor Attachments: HDFS-7184.patch, HDFS-7184.patch Just like balancer, it is sometimes required to run data migration tool in a daemon mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-7184) Allow data migration tool to run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174573#comment-14174573 ] Allen Wittenauer edited comment on HDFS-7184 at 10/17/14 1:10 AM: -- +1 Since I'm out of town, I'll let someone else commit it. If it isn't committed when I get back next week, I'll take care of it. :) was (Author: aw): +1 Allow data migration tool to run as a daemon Key: HDFS-7184 URL: https://issues.apache.org/jira/browse/HDFS-7184 Project: Hadoop HDFS Issue Type: Sub-task Components: balancer mover, scripts Reporter: Benoy Antony Assignee: Benoy Antony Priority: Minor Attachments: HDFS-7184.patch, HDFS-7184.patch Just like balancer, it is sometimes required to run data migration tool in a daemon mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7204) balancer doesn't run as a daemon
[ https://issues.apache.org/jira/browse/HDFS-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174578#comment-14174578 ] Allen Wittenauer commented on HDFS-7204: bq. Maybe we change the variable name daemon to something like run_via_dh (run via daemon handler) and add a comment like Allen summarized? Thanks. Sure. Open a jira under hadoop common to rename daemon and I'll work something up. BTW, it's probably worth pointing out that if you look at hadoop-config.sh, you'll see where --daemon is specifically handled. balancer doesn't run as a daemon Key: HDFS-7204 URL: https://issues.apache.org/jira/browse/HDFS-7204 Project: Hadoop HDFS Issue Type: Bug Components: scripts Affects Versions: 3.0.0 Reporter: Allen Wittenauer Assignee: Allen Wittenauer Priority: Blocker Labels: newbie Attachments: HDFS-7204-01.patch, HDFS-7204.patch From HDFS-7184, minor issues with balancer: * daemon isn't set to true in hdfs to enable daemonization * start-balancer script has usage instead of hadoop_usage -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7256) Encryption Key created in Java Key Store after Namenode start unavailable for EZ Creation
[ https://issues.apache.org/jira/browse/HDFS-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174669#comment-14174669 ] Yi Liu commented on HDFS-7256: -- Thanks [~xyao] for testing this, this should be not an issue. Let me explain below. HDFS encryption at rest requires user to configure a KMS, and the backing KeyProvider of KMS can be a {{JavaKeyStoreProvider}} or a third-party keystore which implements Hadoop {{KeyProvider}} interface. In your case, {{JavaKeyStoreProvider}} is used directly, actually both FSN and DFSClient will have KeyProvider instance (different), FSN uses KeyProvider instance to get EncryptionZone key and get Encrypted data encryption keys, and DFSClient uses KeyProvider instance to decrypt the data encryption keys. JavaKeyStoreProvider uses local java keystore file, it can't satisfy multiple nodes accessing. hadoop key create ... command constructs its KeyProvider instance in client side, and create/flush key to java keystore file, and FSN will not reload the java keystore file. That's the reason why you see the exception. So please configure a KMS and the backing KeyProvider could be a {{JavaKeyStoreProvider}}, for more information, please refer to the fs-encryption/KMS user doc. Encryption Key created in Java Key Store after Namenode start unavailable for EZ Creation -- Key: HDFS-7256 URL: https://issues.apache.org/jira/browse/HDFS-7256 Project: Hadoop HDFS Issue Type: Bug Components: encryption, security Affects Versions: 2.6.0 Reporter: Xiaoyu Yao Hit an error on RemoteException: Key ezkey1 doesn't exist. when creating EZ with a Key created after NN starts. Briefly check the code and found that the KeyProivder is loaded by FSN only at the NN start. My work around is to restart the NN which triggers the reload of Key Provider. Is this expected? Repro Steps: Create a new Key after NN and KMS starts hadoop/bin/hadoop key create ezkey1 -size 256 -provider jceks://file/home/hadoop/kms.keystore List Keys hadoop@SaturnVm:~/deploy$ hadoop/bin/hadoop key list -provider jceks://file/home/hadoop/kms.keystore -metadata Listing keys for KeyProvider: jceks://file/home/hadoop/kms.keystore ezkey1 : cipher: AES/CTR/NoPadding, length: 256, description: null, created: Thu Oct 16 18:51:30 EDT 2014, version: 1, attributes: null key2 : cipher: AES/CTR/NoPadding, length: 128, description: null, created: Tue Oct 14 19:44:09 EDT 2014, version: 1, attributes: null key1 : cipher: AES/CTR/NoPadding, length: 128, description: null, created: Tue Oct 14 17:52:36 EDT 2014, version: 1, attributes: null Create Encryption Zone hadoop/bin/hdfs dfs -mkdir /Ez1 hadoop@SaturnVm:~/deploy$ hadoop/bin/hdfs crypto -createZone -keyName ezkey1 -path /Ez1 RemoteException: Key ezkey1 doesn't exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7256) Encryption Key created in Java Key Store after Namenode start unavailable for EZ Creation
[ https://issues.apache.org/jira/browse/HDFS-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu resolved HDFS-7256. -- Resolution: Not a Problem I mark it as Not a Problem, please feel free to reopen it if you have different opinions. Encryption Key created in Java Key Store after Namenode start unavailable for EZ Creation -- Key: HDFS-7256 URL: https://issues.apache.org/jira/browse/HDFS-7256 Project: Hadoop HDFS Issue Type: Bug Components: encryption, security Affects Versions: 2.6.0 Reporter: Xiaoyu Yao Hit an error on RemoteException: Key ezkey1 doesn't exist. when creating EZ with a Key created after NN starts. Briefly check the code and found that the KeyProivder is loaded by FSN only at the NN start. My work around is to restart the NN which triggers the reload of Key Provider. Is this expected? Repro Steps: Create a new Key after NN and KMS starts hadoop/bin/hadoop key create ezkey1 -size 256 -provider jceks://file/home/hadoop/kms.keystore List Keys hadoop@SaturnVm:~/deploy$ hadoop/bin/hadoop key list -provider jceks://file/home/hadoop/kms.keystore -metadata Listing keys for KeyProvider: jceks://file/home/hadoop/kms.keystore ezkey1 : cipher: AES/CTR/NoPadding, length: 256, description: null, created: Thu Oct 16 18:51:30 EDT 2014, version: 1, attributes: null key2 : cipher: AES/CTR/NoPadding, length: 128, description: null, created: Tue Oct 14 19:44:09 EDT 2014, version: 1, attributes: null key1 : cipher: AES/CTR/NoPadding, length: 128, description: null, created: Tue Oct 14 17:52:36 EDT 2014, version: 1, attributes: null Create Encryption Zone hadoop/bin/hdfs dfs -mkdir /Ez1 hadoop@SaturnVm:~/deploy$ hadoop/bin/hdfs crypto -createZone -keyName ezkey1 -path /Ez1 RemoteException: Key ezkey1 doesn't exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7221) TestDNFencingWithReplication fails consistently
[ https://issues.apache.org/jira/browse/HDFS-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174703#comment-14174703 ] Ming Ma commented on HDFS-7221: --- Thanks Yongjun and Charles for investigating this. I agree with the suggestion to increase the value for dfs.namenode.replication.max-streams-hard-limit. Please note that dfs.namenode.replication.max-streams normally is set to less than or equal to dfs.namenode.replication.max-streams-hard-limit as it doesn't matter otherwise. So as part of this fix, you can change the value for dfs.namenode.replication.max-streams to be 16. IMHO, per node configuration is useful if you have heterogeneous nodes in the cluster and the scope is much more than these two properties, for example you have other configurations such as maxXceiverCount, balancer bandwidth, etc.. Heterogeneous storages might have addressed some of the issues. Besides, it should be easy to manage; maybe some sort of labels support in HDFS. TestDNFencingWithReplication fails consistently --- Key: HDFS-7221 URL: https://issues.apache.org/jira/browse/HDFS-7221 Project: Hadoop HDFS Issue Type: Bug Components: test Affects Versions: 2.6.0 Reporter: Charles Lamb Assignee: Charles Lamb Priority: Minor Attachments: HDFS-7221.001.patch, HDFS-7221.002.patch TestDNFencingWithReplication consistently fails with a timeout, both in jenkins runs and on my local machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7180) NFSv3 gateway frequently gets stuck
[ https://issues.apache.org/jira/browse/HDFS-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174706#comment-14174706 ] Eric Zhiqiang Ma commented on HDFS-7180: ~brandonli: Not at all and many thanks a lot for the analysis and confirmation! I checked the log on 10.0.3.176 and found the exception of socket timeout between 10.0.3.172 and 10.0.3.176 as follows. -- 2014-10-02 06:00:07,326 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 src: /10.0.3.172:37334 dest: /10.0.3.176: 50010 2014-10-02 06:00:31,970 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow flushOrSync took 24097ms (threshold=300ms), isSync:true, flushTotalNanos=9424ns 2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.3.176:50010 remote=/10.0.3.17 2:37334] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234) at java.lang.Thread.run(Thread.java:745) 2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643, type=LAST_IN_PIPELINE, downstream s=0:[]: Thread is interrupted. 2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643, type=LAST_IN_PIPELINE, downstream s=0:[] terminating 2014-10-02 06:01:32,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1960069741-10.0.3.170-1410430543652:blk_1074363564_623643 received exception java.net.SocketTime outException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.3.176:50010 remote=/10.0.3.172:37334] 2014-10-02 06:01:32,093 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: dstore-176:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.0.3.172:37334 dst: /10.0.3.176:50 010 java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.3.176:50010 remote=/10.0.3.17 2:37334] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at
[jira] [Commented] (HDFS-7256) Encryption Key created in Java Key Store after Namenode start unavailable for EZ Creation
[ https://issues.apache.org/jira/browse/HDFS-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174744#comment-14174744 ] Xiaoyu Yao commented on HDFS-7256: -- Thanks [~hitliuyi] for the detail explanation. I configured my test environment based on HDFS-6134 proposal: https://issues.apache.org/jira/secure/attachment/12660368/HDFSDataatRestEncryption.pdf. Can you point me the link to fs-encryption/KMS user doc if there is a different one? I do have a KMS setup with JavaKeyStoreProvider pointing to the same java key store file. Based on your suggestion, I just switch to use 'kms://http@localhost:16000/kms' instead of the java key store file 'jceks://file/home/hadoop/kms.keystore' directly for the 'dfs.encryption.key.provider.uri' in hdfs-site.xml and 'hadoop.security.crypto.jce.provider' in core-site.xml. Below I have two follow up questions when executing the the 'hadoop key' command after the change. Can you confirm if these are expected or not? 1. Have to specify -provider explicitly even though hadoop.security.crypto.jce.provider='kms://http@localhost:16000/kms' is configured in core-site.xml. hadoop@hadoopdev:~/deploy$ hadoop/bin/hadoop key list There are no non-transient KeyProviders configured. Use the -provider option to specify a provider. If you want to list a transient provider then you must use the -provider argument. 2. Keys are returned with -provider specified but WARN message is logged in kms.log on Anonymous request. My understanding is that KMS should proxy user 'hadoop' based the proxy user setting below. Do I miss anything? hadoop@hadoopdev:~/deploy$ hadoop/bin/hadoop key list -provider kms://http@localhost:16000/kms Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/] key1 {code} 2014-10-16 22:08:38,386 WARN AuthenticationFilter - Authentication exception: Anonymous requests are disallowed org.apache.hadoop.security.authentication.client.AuthenticationException: Anonymous requests are disallowed at org.apache.hadoop.security.authentication.server.PseudoAuthenticationHandler.authenticate(PseudoAuthenticationHandler.java:184) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:330) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:507) at org.apache.hadoop.crypto.key.kms.server.KMSAuthenticationFilter.doFilter(KMSAuthenticationFilter.java:129) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:745) {/code} The client runs with user 'hadoop'. The proxyuser and delegation token(use default) are set up in kms-site.xml. !-- proxyuser configuration for user named: hadoop-- property namehadoop.kms.proxyuser.hadoop.users/name value*/value /property ... Encryption Key created in Java Key Store after Namenode start unavailable for EZ Creation -- Key: HDFS-7256 URL: https://issues.apache.org/jira/browse/HDFS-7256 Project: Hadoop HDFS Issue Type: Bug Components: encryption, security Affects Versions: 2.6.0 Reporter: Xiaoyu Yao Hit an error on RemoteException: Key ezkey1 doesn't exist. when creating EZ with a Key created after NN starts. Briefly check the code and found that the KeyProivder is loaded by FSN only at the NN start. My work around is to restart the NN which triggers the reload of Key Provider. Is this expected? Repro Steps: Create a new Key after NN and KMS starts hadoop/bin/hadoop key create ezkey1 -size 256 -provider jceks://file/home/hadoop/kms.keystore List Keys hadoop@SaturnVm:~/deploy$ hadoop/bin/hadoop key list -provider